“Gartner believes that enterprise data will grow 650 percent in the next five years, while IDC argues that the world’s information now doubles about every year and a half. IDC says that in 2011 we created 1.8 zettabytes (or 1.8 trillion GBs) of information, which is enough data to fill 57.5 billion 32GB Apple iPads, enough iPads to build a Great iPad Wall of China twice as tall as the original.” (Source)
Data is everywhere.
For the first time in history a large portion of the world’s data is in one place: the World Wide Web. Never before have we been so connected to each other, to our possessions, and to technology as we are today.
Wondering how big “Big Data” really is? Consider the following summary statements from The GovLab Index:
-
- How much data exists in the digital universe as of 2012: 2.7 zetabytes (or 1 billion terabytes)
- Increase in the quantity of Internet data from 2005 to 2012: +1,696%
- Percent of the world’s data created in the last two years: 90
- Number of exabytes (=1 billion gigabytes) created every day in 2012: 2.5; that number doubles every month
- How much information in the digital universe is created and consumed by consumers (video, social media, photos, etc.) in 2012: 68%
- The world’s annual effective capacity to exchange information through telecommunication networks in 1986, 2007, and (predicted) 2013: 281 petabytes, 65 exabytes, 667 exabytes
- Increase in data collection volume year-over-year in 2012: 400%
From these numbers, two things are clear: (1) data is not going anywhere; and (2) the internet is basically a giant living data set that’s constantly being uploaded each and every second of each and every day.
The huge amount of data being uploaded and shared to the web creates a massive opportunity for businesses looking to learn more about their competitors, their products, their processes, their markets, and their customers.
To make the right business decisions today, you must rely on data. If you’re not doing it or thinking about it, you’re going to be left behind.
Don’t take our word for it, though. If you want to learn how companies of all sizes are benefiting from Big Data today, read through these articles:
By now, maybe we’ve convinced you of the opportunity that exists and how important data is to the continued success of your business. If so, that’s great! But before you start making decisions on data, you need to figure out how you’re going to collect the data from the web, and how you’re going to make it ‘process ready’. This post will help you decide by walking you through 3 main data collection options: (1) code your own scraper; (2) visual data collection tools; (3) web data as a service. In each option presented, you’ll be provided with an overview, any available tools you can use, and an overall score card from us that can help you ultimately come to a decision about which option is right for you.
Let’s begin with the first option:
1. The “Code Your Own” Option
Overview
If you are a developer, you do have the option of building your own web collection tool using these the frameworks listed below. These days, however, with the availability of option 2 (tools) and 3 (web data as a service), this may not be the best option based on how much time is required in order to build your own custom web collection tool.
Frameworks available
Score Card:
People cost: This option requires a dedicated to get started developer and part-time for maintenance.
Time to value: Allowing 1-3 month from project start to getting process ready data is realistic, depending on the complexity of the project and your developer’s experience with extracting web data.
Ongoing: If you are collecting data from the web on an ongoing basis (e.g monthly reviews from Amazon), it’s worth bearing in mind that web scrapers typically break when the websites they are collecting data from change. This creates ongoing maintenance work for the person who built the scraper in the first place, this can be costly when depending on the person (ie developer).
Additional work required:
- Data QA (ie ensuring what is collected is what is on the web site)
- To monitor for change across a large data set (eg monitoring for new product reviews) you will need to diff your data files
- You will still need to monitor for breakage as websites changes break your scrapers.
2. The “Tools” Option
Overview
These visual tools enable non-developers to get data from websites, which reduces the technical barrier to getting web-data. Most of them are free/low-cost which is great if you can’t code and don’t have budget. Warning: They still require set-up + maintenance….so whilst they are not taking up as much developer resource, you still have to build a team and process to manage them which is serious consideration when planning your project.The question worth asking is yourself: do you want to spend your team time on data collection or just focus on using the data (analysis and insight)?
Other things that need to be considered with the Tools and coding your own scraper is:
- data QA process (ensuring what you collect from the web is actually what is on the website)
- managing query volume on sites so that you don’t overload sites and get blocked
Tools available
Mozenda (budget: $2000-10,000/year)
Connotate (budget: $50,000-100,000/year)
Outwit (budget: ~$200 desktop software purchase)
import.io (budget: Free)
Kimono Labs (budget: Free)
Kapow Software (budget: $50,000-100,000/year SaaS)
Score card:
People cost: This option requires non-technical folks to get started and ongoing resource for maintenance. It’s worth noting that for most data collection project you will still need a developer to write scaffolding scripts around the scraping tool to get the result you want.
Time to value: Allowing 1-2 months from project start to getting process ready data is realistic, depending on the complexity of the project and your experience with extracting web data
Ongoing: If you are collecting data from the web on an ongoing basis (e.g monthly reviews from Amazon), it’s worth bearing in mind that web scrapers (including scraping tools) typically break when the websites they are collecting data from change. This creates ongoing maintenance work for the person who built the scraper in the first place, this can be costly when depending on the person.
Additional work required:
- Data QA (ie ensuring what is collected is what is on the web site)
- To monitor for change across a large data set (eg monitoring for new product reviews) you will need to diff your data files
- You will still need to monitor for breakage as websites changes break your scrapers.
- For more complex drill-down data collection projects you will need to chain your data scrapers together, which will require custom coding (for example finding all headphones on amazon and then drilling down to get the reveiws for each headphone set will require 2-3 scrapers chained together)
3. The “Data as a Service” Option
Overview
This is the most recent category in the web data space. It’s the best option for folks who have budget and want to focus their efforts on data analytics rather managing the data collection process. It enables users to specify website URLs (e.g. bestbuy.com), data schema (e.g. product_name, description, price etc) and frequency of refresh (daily, weekly, monthly) and get the data delivered on that schedule. No need for set-up or ongoing management, the process-ready data can plug straight into your data analytics stack.
Tools available
Import.io DaaS offering (note: please let us know if find any other options in this category!)
Score card:
People cost: This option requires no people resource for build or ongoing management.
Time to value: Allowing 2-4 weeks to get process ready data is realistic
Ongoing: No ongoing maintenance
Additional Work Required: No ongoing additional work required to get process ready data.
We hope this overview will help find the right web data collection service for your business, if you have any questions please reach out.
What other questions do you have for us on this subject? Ask us below!
1 Comment. Leave new