Hosting static blogs using something like Github Pages is an extremely popular way to get blogging without the hassle of a full-blown CMS. One of the major downsides of this, however, is that because the content is static, there’s no easy way to provide search functionality.
Luckily, using a couple of free tools, you can generate a dynamic search for your static blog very easily. I want to show you how to create a free site search for your Github Pages blog.
Step 1: Crawl
First, you will need to use an import.io Crawler to grab all of the content from your site, so that you can index and then search it.
We’ve got some tutorials that introduce you to building a Crawler if you’ve never done one before.
When you build a Crawler to your blog, you can map columns for “title” and “subtitle”, which do what they say on the tin; then you can pick up all of the content using an “images” column (type IMAGE), “content” (type STRING) and “links” (type LINK), which can be mapped to pull out all of the images, paragraphs of text, and links on the page respectively.
You can train a single column to pick up multiple paragraphs at the same time:
A couple of final tips: firstly, train more than the 5 required pages – I trained over 10 to make sure I had coverage of all post formats. Additionally, when you are running your crawler, make sure you give it enough start pages so that it can navigate to all of your posts – it will only follow 10 links deep by default, and you may need to specify more URLs so you get all the posts in your archive.
If you would like to see an example of a crawler doing these things, you can check out the crawler for my blog. This is its crawl configuration:
Step 2: Search index provider
As a quick way to host a search index for my blog, I decided to try FacetFlow: not only do they host Elasticsearch indexes for you, but they also have a free sandbox plan which allows you a generous 5,000 document / 500MB storage index.
Once you’re signed up, they will show you the connection details for you:
Step 3: Creating your index
In order to help you create the index with some sensible defaults, I have written some Python utilities, named after the doctor that studies Jekyll (the name of the software that powers Github Pages) in Robert Louis Stevenson’s novella.
Clone the Gabriel repository on to your machine, and then configure your Facetflow credentials: copy the “es.json.template” file into “es.json”, and populate your details. You will need to change the “host” property, and put your Facetflow API key in the “credentials: username” field.
Once these two files are set up, you simply need to run the “create_index.py” script. This script will then create the appropriate index in your Facetflow account.
(I have also provided the “delete_index.py” script which will delete the index and data if you need to.)
Step 4: Indexing your content
Now that we have created our Crawler and our search index ready for the data, it’s time to run the Crawler and populate our index so that we have content to search.
There are a few configuration files you need to fill in prior to being able to do this.
Next, you need your crawl configuration. There’s an example in “crawl.json.example”, but this will crawl my blog – you can get the “crawl.json” file for your own crawler by opening it in the import.io tool, then choosing the “Export settings” option:
The final configuration file is based on “mapping.json.template” – if you used the same column names as I outlined above, you can simply copy this file into “mapping.json”. If you have slightly different column names, you can modify this file to match those columns.
Once you have auth.json, crawl.json, es.json, index_mapping.json and mapping.json all set up, you are ready to run your crawler and put the data into your Elasticsearch index on Facetflow.
First, you need to start the Python script. “server.py” listens for pages of data that the import.io crawler finds, and send them to Facetflow.
Once this is started, it is time to run the import.io crawler. The basic instructions for running the command line crawler are on our knowledge base. But if you have import.io downloaded in your home directory on Linux, for example, you could run a command like this from the Gabriel directory: “~/import.io/import.io -crawl crawl.json auth.json”.
Once the crawler is running, it will show you the pages of data on the command line. There will also be a row printed by the Python server for each row that it processes and sends to your Elasticsearch index.
When it has finished, the import.io crawler will output “Crawl finished” and then exit – you can now stop your Python script (with Ctrl+C). Facetflow should show your blog post number in your control panel:
Whenever you need to re-index your content (such as you have edited blog posts or created new ones) you can repeat the two steps above. Because the page’s URL is used as its ID, updates will be made correctly and your new changes indexed correctly. You could even run it as a scheduled task on your machine or a server in order to automatically update your index.
Step 5: Searching your content
Now that you have indexed your content, it’s time to search it! Facetflow shows you some examples of doing this, but if you want to make full use of the power of Elasticsearch, you want a URL something like this:
This is for my example index, and shows you how to combine the “q” parameter (search terms – don’t forget to URL-encode it!), the “default field” (heading and content), and pagination parameters from Elasticsearch’s query API to get scored results for content.
What have you come up with?
Hopefully this has inspired you to go off and build your own search page for your blog or website with data provided by an import.io data source. If you build something you’re proud of, let us know! You can either send us a tweet or drop us an email to email@example.com.
As always, if you need any support with the import.io tool or with code, just drop us a line to firstname.lastname@example.org and we’ll be more than happy to help out!