Hey guys, in my experience as a web scraping developer I have come across so many misconceptions about web scraping. So I thought it would be valuable for you to mention and explain the biggest misunderstandings about web scraping. Read the article or watch the video then let me know what else you would add to the list!
6 Misunderstandings About Web Scraping
As web scraping is becoming more and more popular I think we need to get things straight. After a little research on the internet and considering the questions I often get asked, I’ve found that these six misconceptions are the most common about web scraping. If you are totally new to web scraping or you consider leveraging it the followings should be helpful for you.
Starting with the biggest BS around web scraping. Web scraping is just like any tool in the world. You can use it for good stuff and you can use it for bad stuff. Web scraping itself is not illegal. With that said, you should be super careful when scraping the web because it does matter how you use the scraped data. For example, scraping someone else’s content then simply republishing it probably can get you in trouble. Importantly, in this case web scraping itself was okay. The problem is that you steal someone else’s work which is not cool. Many times when web scraping is associated with legal issues, the real badboy move is what you do with the scraped data after all.
A great example when web scraping can be illegal is when you try to scrape nonpublic data. Nonpublic data can be something that is not reachable for everyone on the web. Maybe you have to login to see the data. In this case web scraping is probably unethical, depending on the context. Also it does matter how nice you are technically when scraping a website. To learn more, I urge you to check out the most frequent legal issues associated with web scraping!
You need to code
Some people think that you need to be a pro programmer to scrape data from websites. Actually there are many software solutions so you don’t necessarily need to write any code. If you google visual web scraper or web scraping software you will find many possible solutions for your problem without coding.
Also keep in mind that though scraping a website without coding is great but it’s not applicable in many cases. If you have to further process data (cleaning, deduplication, etc..) a web scraping software can’t really help you. I suggest using a software like Portia only when you need to scrape basic websites and don’t need further processing. In this case a scraping software is the way to go.
Web scraping is cheap
Most people and businesses don’t want to deal with web scraping themselves. It is quite frequent that they hire a company that provides web scraping solutions or a freelancer. Now, just to get this straight, web scraping is cheap regarding the ROI it provides in most cases. At the same time, you should know that hiring a full-fledged web scraping service is gonna cost you money. If you do a quick research how much different vendors and freelancers charge for web scraping services you will find a huge difference. It’s because some companies and freelancers with higher rates do provide better services.
Also, you should figure how complex your project is. For large, long-term projects I suggest hiring a vendor because they usually guarantee you’ll get your data everytime on time. Also some web scraping companies provides additional useful services like further processing data to fit into your system. On the other hand, if you have a basic one-time web scraping job then it might be better to choose a freelancer and hire him. It’s true almost everytime that one-time scraping jobs cost just a little money if you hire a freelancer, contrary to a vendor.
The web scraper works forever
When building a scraper, we want it to work seamlessly forever and just deliver the data we need. Unfortunately it’s not that easy. The biggest challenge in web scraping is that websites are constantly changing. This is the nature of the current state of the internet. To keep up, we should always adjust our scraper so we can trust it delivers reliable and up-to-date data. Now, if you just setup your scraper with a freelancer dude then it’s gonna be a headache when the scraper wrecks(and it will sooner or later unfortunately) because you need to find another freelancer to make it work again or if you’re lucky the one who built the scraper is available at the moment.
You’re in a good position if you’re using a web scraping service because the vendor will take care of all the problems you will not even realize anything. The data is flowing as usual. So just keep in mind that if you need continuous data flowing into your system, you’ll need to watch your scraper and adjust if it wrecks.
Web scraping is all about selecting data from the HTML
This one is a myth often told by programmers who have never built a real world web scraper. I’ve heard this one soo many times. Like “It’s no big deal bro just write a regex and fetch the data from the html and you’re done.” Sure web scraping is associated with fetching data from a website but the thing is what really matters is how you can use that data to drive your business. Web scraping is much more than getting raw data out of a website.
Web scraping – when done correctly – involves cleaning messy data(because 99% of the time raw data from the web is plain unusable), deduplication, all sort of filtering, integration with your current system, maybe analytics and visualization. It’s complex. Now you might say that hey at the end of the day you just want to see the raw data you don’t need any of the stuff just mentioned. That’s cool. But there’s a chance you’re leaving massive amount of value on the table by not processing the data further.
Any website can be scraped
Website owners can make it really hard for bots to scrape data. There’s a bunch of ways to make a website scraping-proof. Although in reality, there’s no technical shield that could stop a full-fledged scraper from fetching data.
That being said, if the website has lots of scraper traps, captchas and other layers of defense against bots then surely web scraping is not welcomed there. In that case, you should think twice about it before scraping the website. Technically it’s possible to fight all types of bot defenses but do you really want? If the website proactively steps up against scrapers then it’s not a good idea to scrape it anyway.
Of course there are more things I could mention today I just wanted to tell you about the ones that I got the most and feel like these are the most crucial when it comes to leveraging web scraping. Comment below I would be glad to hear your thoughts!