Like many businesses today, you know you need external data to drive insights.
What you may not know is how to get your hands on that data. You would think in this day and age, like pretty much everything else, you could just buy what you want. And you can. Kind of.
“Big data will finally forge the last links of the value chain that will help companies drive more operational efficiencies from existing investments. ”
– Sashi Reddi, VP of CSC’s Big Data and Analytics group
“Analytics has become a key competitive weapon. Leaders are addressing the diversity of data and unlocking the value of data via algorithms tuned to anticipate and deliver customer value.”
– Carrie Johnson, Sr VP at Forrester Research
“2016 will be the year of data productization and agile data integration. Clear definition of concepts will become even more important whereas speed and clarity will provide the competitive edge.”
– Yves Mulkers, Founder of 7wData
“2016 will be the year of deep learning. Data will move from experimental to deployed technology in image recognition, language understanding, and exceed human performance in many areas.”
– Gregory Piatetsky, Founder of KDNuggets
In this guide, we put ourselves in your shoes. The shoes of someone who needs to get external web data into their company. And we answer the all important question:
How do I get external data into my business?
1. Build an in-house solution (DIY)
“Building an in-house solution is perfect for small to medium-sized businesses that have tight budget requirements for their data and analytics initiatives. Good data scientists should be able to build you some web scrapers, as well as analyze and visualize results from the data they scrape.”
Before deciding to sign a contract with a data provider, it’s worth considering if you can build something in-house. This typically involves employing software engineers to build web scrapers and a platform upon which those scrapers can run, as well as the hardware and software they need.
This solution works best for companies whose data needs are relatively small or come from a small number of sites. The more types of sites you need data from and/or the more complex those sites are, the more complicated the web scrapers will be to write and maintain – and the higher the costs.
Because of the high internal cost associated with such technical resource, there is a risk that you may not be able to add them at speed – a typical project takes anywhere from 1-3 months. In addition, because extracting data isn’t a core competency of your business (or you wouldn’t be reading this article), there’s a danger that any roadblocks may take a long time to solve without expert help.
Web scrapers are typically brittle and are susceptible to break every time a website changes. Every time a web scraper breaks it needs to be adapted and sometimes rewritten. In-house solutions require that you retain your developers full-time.
For example, if you were collecting addresses from SMBs, you would likely need to build a new web scraper for each website you wanted data from – since each SMB’s site is different. If you need 10,000 addresses from 10,000 different businesses, you are going to need either a very large team or a very long time.
To be scalable, most modern web data extraction requires specialist proprietary software built on patented algorithms that take years to develop. Which is why many companies choose to rely on tools and data services for their data collection.
2. Integrate with a data tool
If you want to control your data extraction in-house, but don’t have the technical expertise to build a full-stack solution, you can subscribe to a data scraping platform that will do some of the work for you.
To extract data from the web, you need a means of collecting and parsing the data from a web page (ie a web scraper) as well as a platform that can run that scraper over a bunch of different pages. Some tools will provide you with one or the other of those services while others may provide both.
Tools such as Scrapinghub and ScraperWiki give you a cloud platform for running web scrapers you build in-house. While tools like OutWit and ScrapeBox provide a product that allows you to build a web scraper without writing any code, but you have to run them on your own platform. Still other tools like Import.io, KimonoLabs, Connotate, Mozenda, ParseHub, FiveFilters and WebScraper.io provide both.
Choosing between these tools depends on your budget, technical experience, data volume, website complexity and the format you want to consume the data in.
Either way, most tools provide a semi (sometimes fully) visual way of performing the task, reducing many of the technical barriers to extracting data. Most of them are also pretty low-cost (some are even free) making them a good solution if you’re low on budget.
However, these tools still require set-up and maintenance. So, while you may not need as much internal technical resource, you will still need a team to process, build and manage them. You will also still be wholly responsible for data quality – another very time-consuming task.
Not sure what the right tool to use? This list of 28 big data tools should help you figure out which is best.
3. Find a ‘vertical specific’ data provider
Depending on your industry there may or not be a data supplier who will sell you a standardized dataset of relevant industry information.
How they get this data depends on the type of business. Some, like Factual (a location data provider), scrape and clean data from other sites around the web. Others, like Yellow Pages, sell data they collect through their business. And still others, like Bloomberg, source data from corporate registries and stock exchanges.
No matter how they get their data, these data providers typically develop specialized technology that is dedicated to only processing the data that is relevant to their vertical. Meaning a provider that specializes in locations likely won’t be able to sell you pricing data.
Specialized technology does mean that you will be able to get lots of good quality data very quickly. But, it also means that you won’t be able to customize the data you receive. These are out-of-the-box solutions that won’t allow you to add or subtract sources. The challenge for many businesses, then, is finding a provider that sells exactly the data that they need. For the most part these data providers are only part of a business’s external data solution.
The other thing to consider when buying vertical specific data, is that everyone else in your market will also have access to the same data. It will still provide you with good insights, but those insights won’t be exclusive to you.
4. Buy data as a service (DaaS)
The final way to collect data from the web is to pay another company to get it for you. DaaS companies are specialists in data extraction, not your market. Which means you will have to provide them with a spec of what data you want and where you want it from. However, they will have far superior knowledge in how to extract that data efficiently and at scale.
Using a data service means that you don’t have to hire additional internal resources or incur any maintenance costs. It also allows you to specify exactly what data you want so that you don’t have to sift through extemporaneous information. Some services will also do the post processing and quality checking for you – reducing the load on your internal analysis teams.
Another benefit of an external service provider, is that these they will consult with you to get you exactly the data you need, in the format you need. And they will have additional specialist tooling to enable them to provide data from the web other solutions can’t provide.
Most data services will want you to commit to a 12 month contract, but the best ones will have opt-out clauses in case the data proves unsatisfactory. When choosing a provider, look for one which is willing to do a proof of concept to prove the data quality and has a good track record with their other clients.
Factors to consider when buying data from a provider
Before you break out your checkbook, there’s a lot to consider.
Some data providers will supply you with custom built, flexible datasets while others will sell you standard “out-of-the-box” solutions, and still others will offer you a terminal which contains a large amount of standardized data in a graphical user interface. You might even consider building a team to gather the data for you in-house.
When you’re assessing the different types of data providers to choose from, consider the following dimensions.
Will the selected data solution provide all of the data that is needed or is it likely that you will have to supplement the data solution with additional data sources?
Flexibility of dataset/schema
In the future you may need to add new data sources or data points to your data solution in response to changing business requirements. How easy is it to add to or extend the data solution with additional data sources?
Speed of data delivery
How quickly will can your chosen solution provide you with data? And how quickly can it respond to new requests and changes in spec?
Depending on your approach, there will be a mix of internal and external costs associated with your data solution. Internal costs may include: technology infrastructure and development, training, setup and integration expenses. External costs may include: data provider subscription, maintenance fee with data services provider, set up costs with data services provider.
Ease of swap out
How easy will it be to swap out the data solution in the future in favour of an alternative data solution? If you decide to change providers you don’t want to lose all the data you’ve collected so far.
Is collecting data the main focus of the data provider you’ve chosen, or is it more of a side business?
If the data you need is critical, you have to do your due diligence and make sure the company you work with will be around in two years time, and will be able to serve your needs long term.
The data provider you choose will depend on what data you want to buy, where you want it from, how often you want it refreshed and how process ready you need it to be. Some of the criteria listed above may be more important to your business than others.
Which solution is right for you?
There is no one single solution to sourcing data for your business. However, there are some guidelines.
Best: Buy from a DaaS provider
There’s no need to reinvent the wheel! Building a solution in-house – for anything but the simplest use-cases – makes very little financial or strategic sense. Today’s solutions allow you to buy the data you need at a much lower cost, with a higher degree of quality, flexibility, scale, and value. Building in-house is risky, expensive and a huge commitment. If you have budget and need a lot of data, this is the way to go.
- Everything from the data to the delivery method is customizable
- Data quality checks built in
- Specialist tools and algorithms mean they can extract data more efficiently
- Lets you focus on ‘core business’
- May require long term contract
- Longevity of small companies is a risk
Good: Integrate with a web scraping tool
Good: Integrate with a web scraping tool
If you don’t have budget, try a free, no-code-required web scraper platform to build out an MVP and prove the value of the data you need before pushing internally to buy that data from a reputable source.
Doing it this way eliminates 90% of the overheads you will get while trying to use a freelancer, or offshore team to ‘try scraping’. It will also start go give you some insight into the world of web data extraction.
- You control the data that is returned
- Don’t have to build and host a full stack solution
- Tool support teams are on hand if you get stuck
- Skill level required is less than building an in-house solution
- Tools require a learning curve
- Time-spent using tool and maintaining sources
- Limited functionality
OK: Vertical specific data provider
If you can buy the data direct from the source, in a format you need, then this may be an option for you. Just know that that your data will be then limited to the data they collect, and you will have very little viability on market coverage and data accuracy beyond what they tell you.
- Most comprehensive data in that area
- Long standing provider is unlikely to miss data delivery or go bust
- No customization of data or sources
- May need multiple suppliers to have total view of market
- Insights are not exclusive
Bad: Scrape it yourself
As a last resort, or if you only need data from a few sources, you can build a scraper yourself. This solution will require a copious knowledge and internal resources, but it will also offer you total control over your project and data.
- 100% control of project
- Good for one ‘one-off Scrapes’
- IT and infrastructure overheads
- Cost of maintenance
- No SLA or fall back position
- Must hire, train and maintain an internal team
- Not core competency for most businesses
- No access to advanced algorithms