Crawling data to Amazon Kinesis

Recently I’ve shown you how to use’s command-line crawling capability to index blog data into ElasticSearch. This was a reasonably low-volume use case – even the longest running blogs will have only a few thousand posts to index.

If you are working with a significant volume of data – in to the gigabytes – then you are going to need more infrastructure to handle the sheer volume and load of data.

Today, at the Data Summit, I presented one way to deal with the data from huge crawls as it is retrieved: Amazon Kinesis.

Kinesis is Amazon’s distributed, redundant and (most important) highly scalable queue service. It is designed to ingest huge volumes of data very quickly, and then allow you to process that data in a highly parallelised manner.

It requires just one python script – a modified version of the one we used with the ElasticSearch example – to push crawled data into Kinesis. For this demo it also conveniently sets up a Kinesis stream before pushing the data in.

Once the command-line Crawler is configured to push the crawled data into the Python script, it is then a case of using the data that has been pushed into Kinesis. For this, we again have a small Python script which reads items out of Kinesis and displays them on the screen.

In addition to writing your own Kinesis stream processors, you can connect it to other Amazon data storage and analysis tools, such as Elastic MapReduce, DynamoDB and the RedShift petabyte-scale data warehouse.

This may be a small demo showing you how to get started with a Crawler and Kinesis. But the potential to quickly scale this to many Crawlers on the command line, the elastic scalability of Kinesis, and the potential to analyse the retrieved data in a highly parallelised manner means that large scale data acquisition and cost-effective processing and analysis of the same is something technical teams of any size can achieve.

Turn the web into data for free

Create your own datasets in minutes, no coding required

Powerful data extraction platform

Point and click interface

Export your data in any format

Unlimited queries and APIs

Sign me up!