Mike Shilov, Founder of Scraping.pro
I asked Michael Shilov, creator of the popular Scraping.pro, to give us his thoughts on web scraping and the future of the structured web. Can the two co-exist?
Big Data has become a hot topic over the past year. What do you think the reason for this is?
I think this is obvious. It’s difficult to imagine today’s world without data. When I got involved in IT, a 10 MB hard drive seemed gigantic, and today, hard drives capable of storing terabytes of data are a standard! Besides, the largest “drive” today is the Internet that contains an immeasurable amount of data and expands at a mind-blowing speed. We just need to learn to separate seeds from the chaff, and that’s what big data technologies are all about.
Can you tell us a bit more about Scraping.pro?
Scraping.pro is a blog dedicated to the problem of extracting data from the Internet. It was started by me in 2012, and, as far as I know, is the largest web scraping resource at the moment. Back in 2004, I wrote my first commercial scraper, but failed to make it a commercial success. There are a number of similar products on today’s market, and the goal of scraping.pro is not the promotion of an own product or service, but the analysis of the market of such solutions and publishing of their reviews.
Sounds great. Do you use web data in your everyday life?
Of course! I think we all start our mornings by reading some news online 🙂 I also have an IT business (unrelated to web scraping) and the web is the main source of data for marketing and decision-making. Besides, I often get contacted by people regarding web data scraping, and most of these questions relate to their business in this or that way.
Do you have any tips & tricks for people who want to turn unstructured data structured data from the Web?
The thing is, this is still a fairly complex task. Products vary from “low-level”, where you need to be familiar with things like regex, xpath, css, http and such, to “high-level”, where all you need to do is to make clicks on the data you want to extract. The first type is usually more universal, but requires some technical skills. The second one works even for inexperienced users, but is often not efficient enough for solving more complex tasks. That’s why I truly appreciate the efforts made by import.io and similar services to find the golden mean.
What do you think the future is for the Structured Web, and web data.
There is no doubt that connections between data on the Internet will grow (remember, it once started with the good old hypertext), and the speed of this process depends on how commercially profitable it will be. However, I don’t think that the problem of data scraping will ever go away. Even if all websites eventually become structurally interconnected, there will always be a need to untangle this huge knot 🙂
For more news on web scraping, follow scraping.pro on Twitter (@ExtractWebData).