Import.io User Guide

Extracting from multiple webpages

After you train your extractor to understand the structure of a webpage, you can configure the extractor with a list of additional URLs to gather data from multiple pages of the website. This topic describes different methods for adding URLs to the list.

Note: A different topic covers adding URLs to train your extractor, rather than adding URLs to the list your extractor uses to collect data.

The Settings tab on the dashboard provides the following ways to add URLs to your extractor:

  • Adding URLs in the basic workflow
  • Adding URLs in the chain workflow
  • Uploading URLs through the API

Adding URLs in the basic workflow

The basic workflow extracts data from one or more webpages using a single details extractor. The getting started tutorial demonstrates adding URLs in the basic workflow.

To use the basic workflow, on the dashboard Settings tab, click the Extract from dropdown list and select an explicit list of URLs. The following options appear:

The URL list box displays the list of URLs your extractor uses when you run the extractor.

The following commands pertain to the list:

  • Show URLs – This option toggles between the following filters:
    • Show invalid URLs – This filter limits the list to just the URLs that do not adhere to proper URL syntax.
    • Show all URLs – This filter shows both the valid and invalid URLs.
  • Remove all URLs – This command clears all the URLs from the list.
  • Remove duplicate URLs – This command scans the URL list and removes any duplicate URLs.
  • Download URLs – This command creates and downloads an ASCII text file containing the list of URLs.

The following commands pertain to the extractor:

  • Save URLs – This command saves the updated list to your extractor.
  • Run URLs – This command executes your extractor using the most-recently-saved URL list.
  • Email me when this run finishes – This option determines whether or not you receive an email upon completion of the extractor run.

Adding URLs manually with copy and paste

To add URLs to the URL list manually, perform the following steps:

  1. Copy and paste URLs into the URL list box. You can paste multiple URLs at the same time as long as they are on different lines or separated by commas.
  2. Click Show invalid URLs and correct or remove any invalid URLs Import.io detects.
  3. Click Save URLs. Your extractor is ready to query all the URLs on its next run.

Adding URLs with the URL generator

Use the URL generator to alter parameters in a URL to quickly generate a list of multiple URLs for your extractor. For example, rather than copy, paste, and edit each new URL, use the URL generator to apply a range of page numbers to a URL to create a list of paginated URLs, without the manual effort. Details for using the URL generator are available as a separate topic.

Adding URLs in the chain workflow

The chain workflow uses two extractors in conjunction. First, a links extractor collects multiple URLs from a webpage (such as a list of products). Next, a details extractor visits each URL the link extractor collected and extracts detailed information at each URL. The chaining extractors tutorial demonstrates adding URLs in the chain workflow.

To use the chain workflow, perform the following steps:

  1. Create and run a links extractor.
  2. On the dashboard Settings tab, click the Extract from dropdown list and select URLs from another extractor. The following options appear:

The following information specifies the location of the URL list your extractor uses when you run the extractor:

  • Selected extractor – Use this box to specify the links extractor.
  • URL column – Use this box to specify the column of data in the links extractor that contains the URL list.

The following commands pertain to the extractor:

  • Run URLs – This command executes your extractor using the most-recently-saved URL list.
  • Email me when this run finishes – This option determines whether or not you receive an email upon completion of the extractor run.

To include URLs from another extractor, perform the following steps:

  1. Select the links extractor and identify the data column containing the links. Import.io automatically saves your choices. Your extractor is ready to query all the URLs on its next run.

Uploading URLs through the API

To upload URLs through the Import.io API, use the PUT /extractor/{extractorId}/_attachment/urlList API request. Once uploaded, the data is accessible through both the dashboard and the API.

Refer to the Import.io API reference for detailed information regarding the API requests.

Note: The Import.io API is available only with Import.io paid subscriptions.