5 New Advanced Data Extraction Features To Try Out

You asked for more powerful data extraction features. We took your feedback and set our Engineers loose on the challenge. We are excited to announce 5 brand new advanced data extraction features that will help you get data out of more websites:

  • Disable CSS
  • Default Column Values
  • Advanced Regex Support
  • Require Column Values
  • Raw HTML Extraction

If you are just starting to use Import.io, our Build an Extractor tutorial that will help you get started. Come back to this article when you need advanced features to help with your data extraction.

On June 27th at 9:00 am PST, we are hosting a webinar where we go over our newest advance features. We will be walking you through real world examples and answering any questions regarding your project. Sign up today to reserve your spot. We are limiting this webinar for the first 50 people.

Using Yelp.com as an example, we will show you how to supercharge your data extraction.

Disable CSS

CSS 4.07.26 PM

Sometimes when you come to extract data via point-and-click in the “Website view”, the website does not display exactly as you would like it to, and you are unable to select the data that you want to extract.

In this example, the location data is not visible in the Yelp header as we would expect. By simply turning CSS off, we can easily select the text that we are after and extract it.

Use this technique to extract data from complicated (CSS heavy) sites.

Default Column Values

Default Value 4.07.26 PM

Set Default Value allows you to replace blank cells with a pre-set value. In this example, we want Realtors with no reviews to display “0 Reviews”.

Apply this feature to normalize your missing data fields for a cleaner data set.

Advanced Regular Expression Support

Regex 4.07.26 PM

In this example, we want only the opening times of the restaurant. Using a regular expression allows us to remove the “Open Now” string nestled in amongst the opening hours. Additionally, notice how Regex breaks apart each day into separate lines. Double win!

(Warning! Regular expressions are very powerful but can be a little daunting for the uninitiated.)

Start using Regex by learning the basics from this amazing regular expression online resource.

Required Column Values

Required 4.07.26 PM

Sometimes you are only looking for a key pieces of data.

In this example, we set the price column as “required” in order to only return rows that have prices. This means that we don’t have to spend time cleaning the data in the spreadsheet after download.

Pre-filter your data using the Required feature to minimize any post data processing.

Raw HTML Extraction

Output HTML 4.07.26 PM

It is now possible to extract the underlying HTML from a section of a web page rather than only being able to extract the values that are displayed in the browser.

In this example, we want to extract a large block of text along with formatting from a blog post. Extracting the raw HTML allows us to extract the content and the accompanying HTML markup that specifies the display format of the content: whether it is bold, a heading, what’s a link, a URL etc.

Extract hidden data in the source code through HTML Extraction.

We are here to help

Join us on our upcoming webinar on June 27rd @ 9:00 am PST. We are limiting the audience to the first 50 people that sign up. This is great chance to ask questions about your project to our Import.io experts.