If you’re one of the 125k KimonoLabs users who got this message last week…
“After almost 2 years of building and growing kimono, we couldn’t be happier to announce that the kimono team is joining Palantir.”
…you’re probably wondering what to do next.
There’s no denying that kimono was a useful service with some great features and if you’ve come to rely on any one of them for your data collection, finding and learning a new tool can seem daunting. To help you sort through the field, we’ve broken down kimono’s scraping features and listed the possible alternatives.
A quick disclaimer. Yes, we are (or I guess were) a direct competitor to KimonoLabs’ free scraping product. But this is not just going to be a post about how awesome we are. There is no one-size-fits-all solution to data scraping. The tool you choose will depend on your technical skills, data requirement, budget and how the product speaks to you. Obviously, we hope you’ll eventually choose us, but if you don’t that’s okay. At the end of the day, we just want you to have access to the data you need.
Right, let’s get started.
A major benefit of using any scraping service is that you don’t have to host anything yourself. Everything is done in the cloud. All the website page views, data normalization, and transformation gets handled on someone else’s server.
These days, most scraping platforms are cloud hosted – as is practically everything else – so you shouldn’t have any trouble finding one. A few of the big names to check out are Import.io, Mozenda, Cloudscrape, ParseHub and Scraping Hub.
Both Mozenda and Import.io allow you to run your API and see your data on the web, but in order to build a point-and-click API, you need to download an app. Mozenda’s is only available on Windows while Import.io’s can be used on any operating system.
If you’re looking for another Chrome extension, you could try data-miner.io, or webscraper.io. Or for a solely web-based scraping experience with no extensions or downloading, you can use Cloudscrape, Semantics3, Apifier, or import.io Magic.
Magic is a special type of data extraction that automatically finds and pulls your data out of a page with just a URL.
If your data is spread across multiple different pages in the same structure (like a product list that goes on for several pages), pagination is the quickest way to access the data from all the subsequent pages. You just train your API on the first page and then select the “next” button and the API should be able to grab all the same information from the next however many pages.
Diffbot and Semantics3 are the only free scrapers that have true pagination options. If you’re an Import.io user (true pagination is coming soon), you can use the Bulk API tool to achieve the same effect. This tutorial will show you how.
Writing your own Regex allows you to be more specific about the data you extract. Point and click to extract tools like kimono, work by trying to generate the XPath and Regex required to access the data. However, depending on how complex the data or website is, the tool may not always get it right, which is why being able to override the automatically generated Xpath or Regex is incredibly useful.
Of course, this assumes that you have enough of an understanding of Regex to be able to write your own. Honestly not as hard as it might sound. If you can’t write Regex, but you’d like to, this w3schools tutorial will teach you everything you need to know.
There are lots of other scraping services which allow you to write your own Regex including import.io (you can also specify your own Xpaths), Diffbot, Mozenda, Cloudscrape and Parsehub.
Getting data from sites with infinite scroll is tricky. Kimono had this feature and then removed it in June 2015. We put it on here, however, because it is one of the most requested features for any scraping tool.
Cloudscrape, Apifier, Parsehub and data-miner.io all have some function to deal with infinite scroll.
If you’re an Import.io user, you can use this workaround to deal with some infinite scroll sites.
A lot of websites follow a very specific URL pattern. In these cases, you can generate a specific list of URLs to extract data from. In addition to being able to run these URLs through your API, kimono also had a way of generating them.
Import.io, Diffbot, Cloudscrape, PArsehub, FiveFilters and data-miner.io all allow you to paste in a load of URLs to your API. However, only Apifier allows you to both generate the URLs and run them through your API.
If your preferred scraper doesn’t have a way to generate a list, you can also use Google Sheets or Excel to achieve the same effect.
Source URLs for one API from another
One scraping strategy is to scrape a specific set of URLs that you source from another kimono API – this is useful for scenarios where the links to a page you want to crawl are located in a central place (e.g., an overview page of products, people, real estate etc), but the actual data you want is on another page (such as a product page).
At Import.io, we call this functionality Chained APIs and it’s a lot more efficient than trying to crawl an entire website. Mozenda, Cloudscrape and Trooclick API also offer similar functionality.
Cloning an API allows you to make an exact copy of another API. This is especially useful if you want to try another configuration without potentially losing all your previous work or you want to make a copy of another API some else has made.
Import.io also allows you to clone APIs, only we call it “Duplicate”.
Kimono also allowed you to run your APIs on a schedule, either hourly, daily, weekly, monthly, which is handy if your data changes regularly. Weekly runs were limited to 10,000 pages and daily runs to 1,000 pages. Even still, it is much nicer not to have to log in once a day and hit “run”.
Diffbot, Mozenda, Cloudscrape and Scrapinghub all offer the same ability, each with varying limits to the frequency at which you can run your API. Other tools require you to refresh the data manually or write a Python script to do it for you.
Want to know when your data changes or there is new data? With kimono, you could set up an email alert, which it would send you anytime it detected new or different data.
Parsehub appears to be the only service that offers email alerts for free while Import.io and Connotate both offer it to enterprise customers.
Google sheets integration
Once you’ve got some data kicking around in kimono, you want to be able to use it. Kimono had a handy add-on for Google Spreadsheets that automatically let you import data fetched by your kimono APIs directly into Google spreadsheets. Once imported, the cells containing the data remain linked to the kimono APIs – this lets you refresh your data with just one click.
While they’re not specifically add-ons for sheets, both import.io (see it here) and Cloudscrape have similar functionality that allows you to import your API into a Google Sheet and refresh it from within the sheet.
In addition, if any of you are on the Blockspring train (which you should be), you can use their sheets add-on to import data from loads of different websites (they have connections to lots of different APIs) including an integration with Import.io’s Magic API.
If you don’t want to use your data in Google Sheets or pipe it directly into your app via the API (like this), kimono offered the ability to download your data as either JSON, CSV or RSS. We’ve listed each download method and the services that support it as part of their free version in the table below.
Import.io, Mozenda, Cloudscrape, Parsehub, dataminer
Import.io, Diffbot, Cloudscrape, Semantics3, Parsehub, TrooclickAPI
Features are only part of what makes a tool great. In the end, the tool you choose will ultimately come down to how quick and easy it is to use. How much the workflow makes sense to you. All of the tools we’ve mentioned include a free option or a free trial, so once you’ve narrowed it down to 2 or 3 that have all the features you want, you can give them a whirl without worrying about investing too heavily in a system you might not like.
Obviously, we’re a bit biased. At Import.io, we’ve worked hard to create a product that is both easy to use and super powerful. Our features allow you to extract tons of data in minutes (or seconds) without having to code a thing. But you don’t have to take our word for it…
“What is different about Import.io is its very intuitive, visual method for extracting web data, whether it is on a single web page or multiple pages.” – Gregory Piatetsky (@kdnuggets)