Q. What do a website and an iceberg have in common?
A. There is a titanic amount of important stuff lurking just below the surface of both.
Web Designers and Developers will know what I mean. Underneath the pretty facade of a website’s interface, are thousands of lines of markup – an absolute gold mine for us data lovers.
…and import.io of course are here to show you how to access it.
What types of data are hidden in web pages?
- SEO titles
- Meta data
- Attribute based data
- Data in link values
If you right click on a website and click ‘view source’, you will see the code that generates the web page you were on. Looking at this code you can see everything that is available to you, if you can see it there you can have it.
1. SEO titles
99% of pages have a <title> tag, and the information it contains is valuable. Most SEOs will tell you this information is important as it’s a primary place where websites place the keywords they want google to rank them for. Knowing this can be very useful.
I almost always include the title tag content in my APIs, because it helps give additional context to the data.
Here is the Xpath: //title
2. Meta tags
Meta tags tell search engines, and other software, information about the page. Items like the author, their email, publish date, language, revisit after period and distribution info. There are many uses for this type of data.
Social Meta Tags
Facebook, LinkedIn and Twitter all have meta data elements you can provide them to help them understand your site. This data is relevant as it shows you how the company wants to be perceived on social media.
Pro tip: Open Graph is a great source of the types of meta tags websites use.
3. Star Ratings
A lot of the time a star rating is visible on the page but is only a picture, or even worse an empty div container with a background image set via CSS (nightmare!).
However, the true data hunter knows that there is more than meet the eye. Most of the time if you right click on the star rating, and click inspect element (chrome) it will show you the HTML for the stars. It normally looks something like this:
You can see highlighted in yellow, the tell tale attributes that we can extract data from. Even though there is no text on the page, we can clearly see that the star rating is embedded in the HTML tags.
Most of the time, our Magic and/or data suggest tools will be able to auto extract these for you, but if you need to get them manually…
Here is an Xpath for that: //pathtohtmltag/a/@title
4. GPS coordinates
My personal favorite, is getting the latitude and longitude from within a Google map. There are lots of different ways to embed a map, so you may have to play around to get yours working… but here is a common example.
Find a page with a map on it, like this hungry house page.
Then inspect element to find the hidden code that contains the data:
You have to use your eyes to find a latitude and longitude in the HTML. In this case you can see it buried in a link.
Because we hold the power of Xpath we can get that link into a column with this Xpath: //a[contains(@href,’google’)]/@href
And from there we just need to remove the extraneous bits of the href/link using a simple piece of RegEx to alter what is in the column: (?<=ll=)(.*?)(?=&)
you could have also grabbed the latitude and logitude from the script which is also on the page:
Just use Xpath to grab the content of the script tag like this: //script
Regex out the bit you want like this: LatLng((.*?))
Job done (again)!
Give it a try
You can see there is a whole wealth of data hidden underneath the surface of the visible webpage. Hopefully, by showing you some examples, and giving you some bits of code to play with, we’ve given you everything you need to take the plunge into the water to explore a little of what lies beneath.