Import.io User Guide

Chaining extractors tutorial


Many websites, for example Amazon, provide search results as a list of products on one webpage with a link from each product to other webpages that contain individual product details. To retrieve the details for all the products requires two extractors, know as the chain workflow. The first extractor captures a list of links (URLs) containing the product details. The second extractor uses the output from the first extractor to collect data for the individual products. This method is known as chaining extractors.

This tutorial uses www.discogs.com to demonstrate how to extract product data using the chain workflow.

Step 1: Retrieving a list of products

Let’s build a links extractor.

The webpage https://www.discogs.com/search/?style_exact=House&style_exact=Disco contains a list of products.

To create a new extractor to capture a list of links to products on the webpage, perform the following steps:

Note: Alternatively, start over with an empty table, add a column, and train your extractor how to the locate the links.

  • Click Save in the upper right of the editor. The Save Extractor dialog box appears.
  • In the Extractor name box, enter a name for your extractor.
  • From the Schedule to run list, select Once. (There’s no need to run this tutorial extractor on a regular basis.)
  • Click Save and run. The first time you create a new extractor, you need to save and run the extractor to review a test run of your dataset. The extractor runs and Import.io returns to the dashboard Run history tab to show you the results of your test run.

Note: To see the completed Discogs links extractor, navigate to https://dash.import.io/27f7eb0f-ace2-451a-972c-a97e6d4e2032. To use it, click Duplicate on the extractor commands menu and create your own copy.

Step 2: Retrieving a full list of products

There are many, many pages of products available and we want more than just the first!

To add product links from additional pages, perform the following steps:

  • In a new browser window, navigate to https://www.discogs.com/search/?style_exact=House&style_exact=Disco.
  • Navigate to page 2, https://www.discogs.com/search/?style_exact=House&style_exact=Disco&page=2.
  • In the browser address bar, notice the page=2 parameter at the end of the URL. This parameter determines which page of products the browser displays. The Import.io URL generator uses this parameter to easily and quickly create a list of URLs containing as many pages as you desire.
  • Copy the entire URL to your browser clipboard.
  • Return to the browser window containing the Import.io dashboard.
  • On the dashboard, in the left-side navigation pane, select your extractor.
  • Click the Settings tab.
  • Click Show URL Generator.
  • Click Edit next to the URL box.
  • Enter the URL for page 2 https://www.discogs.com/search/?style_exact=House&style_exact=Disco&page=2.
  • Press Enter on your keyboard or click OK.

    The URL generator automatically finds the variable for the page= parameter (If it doesn’t, you can add the parameter yourself by clicking, holding, and dragging your mouse over the 2.)

Let’s retrieve 1000 items. (Adjust the numbers to get more or less, as you desire). There are 50 products per page, so we need 20 pages.

  • Next to Range of numbers, change the numbers to go from 2 to 20 (leave step set to 1).
  • Click Add to List. You now have 20 URLs in your list.
  • Click Save URLs.

Step 3: Running the links extractor

Click Run URLs. Import.io runs the extractor on all 20 URLs, creating a list of links to 1000 product pages.

Step 4: Creating a details extractor

Each of the 1000 links generated in Step 3 points to a separate single-item page with details for that specific product. Let’s build a details extractor to extract details from the single-item pages like this one:

 

To create a new extractor to extract details from a single-item page, perform the following steps:

  • Open https://dash.import.io in a web browser.
  • Click New Extractor.
  • Enter the URL for the detail page https://www.discogs.com/Lazydisco-More-Tigers-/release/8763230.
  • Click Go. Import.io takes a moment to load and analyze the webpage, opens the editor, and displays the newly-created data table.
  • Click the Edit tab.
  • Click Start over with empty table.
  • Add and train columns using the Import.io point-and-click process. This example adds the following columns:
  • Artist
  • Album
  • Label
  • Format
  • Country
  • Genre
  • Style
  • In the editor commands bar, click the Advanced/Standard slider switch until the advanced options appear.
  • In the Rows dropdown list, select Single Row to guarantee your output returns as a single row of data.
  • Click Save in the upper right of the editor. The Save Extractor dialog box appears.
  • In the Extractor name box, enter a name for your extractor.
  • From the Schedule to run list, select Once. (There’s no need to run this tutorial extractor on a regular basis.)
  • Click Save and run. The first time you create a new extractor, you need to save and run the extractor to review a test run of your dataset. The extractor runs and Import.io returns to the dashboard Run history tab to show you the results of your test run.

Note: To see the completed Discogs details extractor, navigate to

https://dash.import.io/122dfedd-83c8-44e0-95a4-79c27c82ec16. To use it, click Duplicate on the extractor commands menu and create your own copy.

Step 5: Chaining the details extractor to the links extractor

Now let’s chain the details and links extractors together so the output of the links extractor (1000 links to products) becomes the input to the details extractor.

To create the chain, perform the following steps:

  • On the dashboard, in the left-side navigation pane, select your details extractor.
  • Click the Settings tab.
  • In the Extract from dropdown list, select URLs from another Extractor.
  • In the Selected extractor box, type the first few letters of the name of your links extractor. A dropdown list appears.
  • Select your links extractor.
  • Select the column that contains the URLs to use as the input.

Import.io automatically saves your choices. Your extractor is ready to query all the URLs on next run.

Step 6: Verifying the links extractor has finished running

When you run a chained details extractor, Import.io:

  • Retrieves the latest run of data for the links extractor
  • Retrieves the specified column containing the links
  • Feeds the URLs as input, one by one, to the details extractor
  • Runs the details extractor on each URL

Therefore, allow the links extractor to finish its run before you start the details extractor run. To check that the links extractor has completed its run, perform the following steps:

  • In the left-side navigation pane, select your links extractor.
  • Click the Run History tab.
  • View the status of the latest run. When creating this tutorial, Import.io successfully captured the 1,000 links in just under 25 seconds.

Step 7: Running the details extractor

To run the details extractor on the 1,000 single-item pages, perform the following steps:

  • In the left-side navigation pane, select your details extractor.
  • Click the Settings tab.
  • Click Run URLs. The dashboard switches to the Run History tab and displays the current progress of the run.

    There are 1,000 URLs to process, so it might take a few minutes.
  • When the extractor run completes, click the Download icon for the run to download the data in Excel, CSV, or JSON format.

You can also access this data using the Import.io API, schedule the chain to run periodic updates on discogs.com, or both to automate keeping your database up to date.