This is a recap of our most recent webinar where we looked at advanced crawling techniques using import.io. Follow us down the garden XPath as we check out some features for confident users looking to get the most out of their crawlers.
This webinar is all about our advanced features. If you’re new to import, I recommend you watch this Getting Started webinar first, because we’ll be skipping some of the basics to get down into the real meat of what import can do. Advanced crawling, XPaths, URL templates – this webinar’s got all that and more.
Advanced Crawling Techniques
Once you’ve trained five (or more) pages for your crawler and you’ve uploaded your crawler settings, you’ll arrive at the crawler settings page. Everything on this page is set to a default, so if you’re not comfortable changing the crawler’s settings – don’t worry. You should be able to run the crawler as is.
If, however, you believe in getting the most efficient crawl possible, then toggle over to the advanced mode (image right).
The most important thing to look at is your URL pattern in the “where to extract data” box. This determines whether or not the pages you want will be converted into data. These patterns are generated based on the URLs of the pages that you trained your crawler on. By default we make this pretty generic – because we want you to get maximum data – but if you only want data from a very specific category of a website, there’s no sense waiting for the crawler to go through the whole website and then having to filter all your data!
Now, how you modify your URL pattern all depends on what is in the URL of the pages you want. In this case, I only want to crawl ASOS for jeans (not everything). By looking at my training URLs I can see that the word “jeans” appears in every URL. That means I can change the URL template from this…
This means that the word “Jeans” always has to be present in the URL for the crawler to extract data from it meaning that we’ll only get data for Jeans in the ASOS website.
Another top tip for making your crawler super efficient is to paste the URL template you’ve just generated into the “where to crawl” box. This means that the crawler won’t even try to go to URLs that don’t match your URL pattern.
Troubleshooting your Crawler
Sometimes, the crawler doesn’t run as planned. You might end up with pages that are blocked or unavailable. To find out where your crawler is going wrong, you can use what’s known as a log file.
If you scroll down to the bottom of the advanced crawler window, you will see the option to save a log file:
A log file is essentially just a folder with a list of all of the URLs that the crawler has visited and which ones it has extracted data from. More importantly, it will also export a list of all of the URLs that fail. By looking at these URLs you can figure out what it is about these pages that is confusing the crawler. You may need to do some extra training or readjust your URL pattern.
Solving Common Problems with XPaths
When you’re in data extraction mode – i.e. training your rows and columns – navigation is locked, which means that you can’t click on things in the page. This can be an issue if the data you want is hidden somewhere on the page. For example in this ASOS page, the sizes are locked behind a dropdownlist.
Getting this information is generally quite easy, because you can almost always use the raw XPath without any manipulation. To get it, open the page up in Chrome (yes you can use other browsers – I just like Chrome’s layout best), right click on the data you want and click “inspect element”.
You’ll see a box popup in the bottom of your screen with a lot of scary looking code in it. One of these lines of code should be highlighted in blue. Right click on this line and click “copy XPath”.
Then simply paste that XPath into the XPath override box in the advanced column settings.
You may have to play around with this a little bit depending on your website to find just the right line to get the XPath from.
Data that is not consistent across pages
The best way to explain this is through an example. On this page for the Manchester airport you can see there is a tick mark next to terminal 2…
But in this page the tick is in terminal 3…
That’s great if you’re a person and you’re looking at the site, but not so great if you’re extracting data – because all you’ll be able to get is the check mark image. We need a reference point on the page and from there we can navigate back to the correct piece of data.
In this instance, our reference point is the gif (the check mark). We are going to use an XPath to get to the gif and then navigate BACK to the terminal name.
If we right click on the gif and click on inspect element, we can see that is is an image with specific file name:
Using an XPath, we can navigate to this filename by using:
This XPath will look everywhere in the HTML for an image with that specific filename.
We can then navigate to the parent sibling using /.. so our XPath looks like this:
and then to the element before using the preceding-sibling::td (the td is because the data we need sits in a td of the HTML). Our final XPath is as follows:
What this means is that wherever in that chart the gif image is, the program will always navigate to the part of the HTML before the gif (which is the terminal name) meaning that you get the location of each of these stores, no matter what the location is.
To check out the full archive of videos on our YouTube channel.
Turn the web into data for free
Create your own datasets in minutes, no coding required
Powerful data extraction platform
Point and click interface
Export your data in any format
Unlimited queries and APIs