Originally posted on February 25th, 2016
If you’re one of the 125k KimonoLabs users who got this message last week…
“After almost 2 years of building and growing kimono, we couldn’t be happier to announce that the kimono team is joining Palantir.”
…you’re probably wondering what to do next.
KimonoLabs users are more than likely in a state of shock at the news. There’s no denying that Kimono was a useful service with some great features and if you’ve come to rely on any one of them for your data collection, finding and learning a new tool can seem daunting. To help you sort through the field, we’ve broken down KimonoLabs’ web scraping features and listed the possible alternatives you’ll want to consider.
A quick disclaimer. Yes, we are (or I guess were) a direct competitor to KimonoLabs’ free web scraping product.
Right, let’s get started.
Cloud hosting and Web-based
A major benefit of using any web scraping service is that you don’t have to host anything yourself. Everything is done in the cloud. All the website page views, data normalization, and transformation gets handled on someone else’s server. One of the nice things about kimono was that everything worked in-browser.
Import.io is 100% SaaS. We store your data in the Import.io cloud, accessible using a web browser on any device. All you need is a web connection and you can get to work.
If your data is spread across multiple different web pages in the same structure (like a product list that goes on for several pages), pagination is the quickest way to access the data from all the subsequent pages. You just train your API on the first page and then select the “next” button and the API should be able to grab all the same information from the next however many pages.
Import.io’s pagination features allow you to extract data that is contained on several web pages. When our pagination detector does not automatically detect multiple pages, you can simply show Import.io the next button to get more data. Or you can use the URL generator to look for useful patterns, such as page numbers or categories, and automatically generate all of the URLs you need in a matter of seconds. It’s easy to use and gets your team results quickly.
Writing your own Regex allows you to be more specific about the data you extract. Point and click to extract tools like kimono, work by trying to generate the XPath and Regex required to access the data. However, depending on how complex the data or website is, the tool may not always get it right, which is why being able to override the automatically generated Xpath or Regex is incredibly useful.
Import.io gives you some more choices to think about in this area. With Import.io Extract, you can use both Regex and Xpath if you choose. But, if you are more used to Excel functions, you can use Import.io Transform to get only the data you need in each column. Transform allows you to clean, prepare, and wrangle the web data you have extracted by using over 100 Excel-like functions and formulas. Once a transformation is created, then every time your scheduled data extraction runs, the transformation is also applied, leaving you with clean, ready-to-use data for your AI or Machine Learning project or for reporting. All of the data, both original and transformed, is stored in Import.io’s cloud based service available for you to perform further insights.
A lot of websites follow a very specific URL pattern. In these cases, you can generate a specific list of URLs to extract data from. In addition to being able to run these URLs through your API, kimono also had a way of generating them.
Import.io’s URL generator is the quickest way to generate multiple URLs by using the patterns found in the URLs. URL parameters include categories, search terms, and page numbers.
Source URLs for one API from another
One web scraping strategy is to scrape a specific set of URLs that you source from another kimono API – this is especially useful for scenarios where the links to a page you want to crawl are located in a central place (e.g., an overview page of products, people, real estate etc), but the actual data you want is on another page (such as a product page).
At Import.io, we call this scraping functionality chained extractors and it’s a lot more efficient than trying to crawl an entire website. With chained extractors, a list page can be linked with detail pages from each item on that list. For instance, a top level list has some data about each item, but when you click on each item you get a detail page with more data. Import.io allows you to pull all of the detail page data at the same time.
Cloning an API allows you to make an exact copy of another API. This is especially useful if you want to try another configuration without potentially losing all your previous work or you want to make a copy of other APIs someone else has made.
Import.io also allows you to clone APIs.
Kimono also allowed you to run your APIs on a schedule, either hourly, daily, weekly, monthly, which is handy if your data changes regularly. It was all up to you how to get it done. Weekly runs were limited to 10,000 pages and daily runs to 1,000 pages. Even still, it is much nicer not to have to log in once a day and hit “run”.
Import.io performs in much the same way by enabling you to get exactly the data that you want when you want it. You can schedule simple extractions, chained extractors, interactive extractors, and reports to run hourly, daily, weekly, monthly or on a specific schedule. We know you and your team are busy, so tailor everything to meet the demands of your schedule.
Want to know when your data changes or there is new data? With kimono, you could set up an email alert, which it would send you anytime it detected new or different data. It was a simple way to stay on top of everything.
With Import.io, you can customize email notifications preferences and get a direct link to your web data emailed to you. When used with change reports, you can be notified if anything changes on a website via email. Stay on top of all the changes without delay.
If you don’t want to use your data in Google Sheets or pipe it directly into your app via the API (like this), kimono offered the ability to download your data as either JSON, CSV or RSS.
Import.io provides JSON, CSV, and Excel downloads. You can also download images and files as well as take a screen shot of each web page.
Features are only part of what makes a web scraping tool great. In the end, the software you choose will ultimately come down to how quick and easy it is to use and how much the workflow makes sense to you. In other words, does it match with the results you want and what you can do?
Obviously, we’re a bit biased. At Import.io, we’ve worked hard to create a product that is both easy to use and super powerful. Our features allow you to extract tons of data in minutes (or seconds) without having to code a thing. We think we’ve created an amazing tool that will meet all your team’s web scraping needs. But you don’t have to take our word for it…
“What is different about Import.io is its very intuitive, visual method for extracting web data, whether it is on a single web page or multiple pages.” – Gregory Piatetsky (@kdnuggets)