XPath 101: 3 common XPaths for extracting data

XPaths are what make the import.io app go ‘round. Without them, we wouldn’t be able to extract data, so needless to say, they’re pretty important. The reason XPaths are so important, is because when you click on data to train the tool, behind the scenes our algorithms are trying to work out the corresponding XPath for that data. These are the three most common uses of Xpaths in import.io; but first, what is an XPath anyway?

What is an XPath?

Websites are all made up of HTML – hyper text markup language – this is a special type of language that displays data on a webpage. In simple terms, an XPath, is a path that shows a program to a particular part of the HTML. Think of it as directions on a map.

Most of the time you’re point and click training will be enough to direct the app to the data you need. Sometimes though, you will need more specific directions to get to data which is hidden or a bit tricky to extract.

Last year we released a feature that allows you to override our automatic XPaths by inserting ones you create manually. These are the three most common reasons you might use XPaths within import.io…

Hidden data

Sometimes, when training the program, you can’t actually see the data that you need – and if you can’t see it, you can’t click it. It might be hidden in a dropdown menu, inside a popup, or even in the metadata.

Even though we can’t see it with our eyes, it can still be seen in the HTML of the page, and that means you can use the inspect element tool in Chrome to get the corresponding XPath.

Moving data

Sometimes your data moves around the page.

Since our algorithms can’t read the phrase “Memory Card Reader”, you’ll need to use a  “following-sibling” XPath command to anchor the extraction to that phrase.

Outside your results

When you’re training data for a page with multiple results, our algorithm tries to generate XPath rows which each contain one full example of data – you can see this by the blue highlighting. But sometimes you may want data that is on the page, but not within the blue rows.

For example, each class represents a row of data, but the date of the class is outside your row results.

By anchoring your XPath to something that is constant in the HTML, you can navigate from it to the data point you want.  In this instance we found the phrase “classname” in the HTML and navigated up to the day itself  giving us the day each class is on.

Mastering XPaths Yourself

Xpaths are particularly tricky because each one is specific to the website you are extracting data from. It’s helpful to remember that XPaths are directions through a map. You need to find a reference point that is always constant and navigate from there.

The best way to learn XPaths is by doing the w3schools tutorial.

Extract data from almost any website