Import.io User Guide

Retrieving crawl run URLs


Use the following example code to retrieve the URLs queried by an extractor in a given a crawl run (extractor run). The examples use the curl and jq commands, which can be readily applied to other programming languages such as (Java, Python, Perl, etc).

To retrieve the URLs from a given run, perform the following steps:

  • Retrieve the crawl run IDs from the run history of the extractor.
  • Identify the ID of the desired run.
  • Retrieve URL list from the desired run.

Step 1. Retrieving crawl run IDs from the run history

To retrieve a list of IDs for the crawl runs displayed in the Run history tab of the Import.io dashboard, set $EXTRACTOR_ID and $IMPORT_IO_API_KEY to your specific extractor GUID and API key, then use the following API request:

import.io $ curl -s "https://store.import.io/store/crawlrun/_search?_sort=_meta.creationTimestamp&_page=1&_perPage=30&extractorId=$EXTRACTOR_ID&_apikey=$IMPORT_IO_API_KEY" | jq .

 

The JSON output in the API response will look something like this:

{
   "took": 2,
   "timed_out": false,
   "hits": {
     "total": 3,
     "hits": [
       {
         "_type": "CrawlRun",
         "_id": "ab9bb66b-ab40-421b-a083-fc075ff9f24f",
         "_score": 0,
         "fields": {
           "_meta": {
             "timestamp": 1488398705946,
             "lastEditorGuid": "d1100850-863b-4e0f-9fa0-5fbcd44db427",
             "ownerGuid": "00a451ae-c38d-4752-a329-389b37cfc0aa",
             "creatorGuid": "00a451ae-c38d-4752-a329-389b37cfc0aa",
             "creationTimestamp": 1488398699884
           },
           "guid": "ab9bb66b-ab40-421b-a083-fc075ff9f24f",
           "runtimeConfigId": "9f56c8ae-5768-49b1-8fcf-697dc63db379",
           "extractorId": "8560e178-e21d-4fea-b0e4-b65ea4320714",
           "startedAt": 1488398701100,
           "stoppedAt": 1488398705945,
           "totalUrlCount": 1,
           "successUrlCount": 1,
           "failedUrlCount": 0,
           "rowCount": 10,
           "state": "FINISHED",
           "urlListId": "0c2e6446-923e-452f-9356-f68edc8347ff",
           "json": "470d9c46-7b71-4636-8dd0-50ac83539b16",
           "csv": "5529f964-deff-4257-9b25-db7e258f7465",
           "log": "6877cf09-ad47-44f2-94d9-5c63bbec01cc",
           "sample": "fa8612da-2cb5-4b80-8228-0c044c802407"
         }
       },
       {
         "_type": "CrawlRun",
         "_id": "2b9fb92f-2500-4c03-84b5-4683fe9fbb09",
         "_score": 0,
         "fields": {
           "_meta": {
             "timestamp": 1488397872333,
             "lastEditorGuid": "d1100850-863b-4e0f-9fa0-5fbcd44db427",
             "ownerGuid": "00a451ae-c38d-4752-a329-389b37cfc0aa",
             "creatorGuid": "00a451ae-c38d-4752-a329-389b37cfc0aa",
             "creationTimestamp": 1488397837002
           },
           "guid": "2b9fb92f-2500-4c03-84b5-4683fe9fbb09",
           "runtimeConfigId": "de177ec7-90e6-44af-8bc6-520447657a62",
           "extractorId": "8560e178-e21d-4fea-b0e4-b65ea4320714",
           "startedAt": 1488397837851,
           "stoppedAt": 1488397872332,
           "totalUrlCount": 11,
           "successUrlCount": 11,
           "failedUrlCount": 0,
           "rowCount": 220,
           "state": "FINISHED",
           "urlListId": "10aed23c-a9c4-4351-8095-5727a44a02a3",
           "json": "2de1e68c-4993-4940-87a9-5b76528087fd",
           "csv": "f55bf49c-b2d9-43fe-9bd8-7ec141936e28",
           "log": "6198f94e-c451-4306-b706-eafdba05be5b",
           "sample": "b3ee13b0-b716-432d-a4f8-bc2beca90188"
         }
       },
       {
         "_type": "CrawlRun",
         "_id": "2a6dc3e5-ddec-40c0-a2c4-9fbee86b1a90",
         "_score": 0,
         "fields": {
           "_meta": {
             "timestamp": 1487033608913,
             "lastEditorGuid": "d1100850-863b-4e0f-9fa0-5fbcd44db427",
             "ownerGuid": "00a451ae-c38d-4752-a329-389b37cfc0aa",
             "creatorGuid": "00a451ae-c38d-4752-a329-389b37cfc0aa",
             "creationTimestamp": 1487033441315
           },
           "guid": "2a6dc3e5-ddec-40c0-a2c4-9fbee86b1a90",
           "runtimeConfigId": "de177ec7-90e6-44af-8bc6-520447657a62",
           "extractorId": "8560e178-e21d-4fea-b0e4-b65ea4320714",
           "startedAt": 1487033441838,
           "stoppedAt": 1487033608905,
           "totalUrlCount": 38,
           "successUrlCount": 38,
           "failedUrlCount": 0,
           "rowCount": 748,
           "state": "FINISHED",
           "urlListId": "ce935349-98ff-4ae8-b458-127468df1b41",
           "json": "012b66b8-eb4d-45fc-a3ba-aa96d6c18f97",
           "csv": "f0f5e838-88ed-46b1-9dc0-53a9245d227f",
           "log": "1db579cc-8476-4dae-9226-a1aa06d48a12",
           "sample": "b9c87329-4966-4330-835d-9d31a53bcaec"
         }
       }
     ],
     "max_score": 0
   }
 }

Step 2. Identifying the ID of the desired URL list

As you can see above, each crawl run has its own urlListId, which contains the list of queried URLs for that run. Locate the desired urlListId and proceed.

Step 3. Retrieving the URL list

To retrieve the URLs for a specific crawl run, use the following API request. Set $EXTRACTOR_ID to your specific extractor GUID, set $IMPORT_IO_API_KEY to your API key, and replace the urlListId with the ID from the desired run.

This example uses the urlListId from the last crawl run from the list in Step 1.

import.io $ curl -s -X GET -H 'Accept-Encoding: gzip' --compressed "https://store.import.io/store/extractor/$EXTRACTOR_ID/_attachment/urlList/ce935349-98ff-4ae8-b458-127468df1b41?_apikey=$IMPORT_IO_API_KEY"

 

The API response returns the following list:

 

https://www.yelp.com/biz/hello-robin-seattle?start=0

https://www.yelp.com/biz/hello-robin-seattle?start=10

https://www.yelp.com/biz/hello-robin-seattle?start=20

https://www.yelp.com/biz/hello-robin-seattle?start=30

https://www.yelp.com/biz/hello-robin-seattle?start=40

https://www.yelp.com/biz/hello-robin-seattle?start=50

https://www.yelp.com/biz/hello-robin-seattle?start=60

https://www.yelp.com/biz/hello-robin-seattle?start=70

https://www.yelp.com/biz/hello-robin-seattle?start=80

https://www.yelp.com/biz/hello-robin-seattle?start=90

https://www.yelp.com/biz/hello-robin-seattle?start=100

https://www.yelp.com/biz/hello-robin-seattle?start=110

https://www.yelp.com/biz/hello-robin-seattle?start=120

https://www.yelp.com/biz/hello-robin-seattle?start=130

https://www.yelp.com/biz/hello-robin-seattle?start=140

https://www.yelp.com/biz/hello-robin-seattle?start=150

https://www.yelp.com/biz/hello-robin-seattle?start=160

https://www.yelp.com/biz/hello-robin-seattle?start=170

https://www.yelp.com/biz/hello-robin-seattle?start=180

https://www.yelp.com/biz/hello-robin-seattle?start=190

https://www.yelp.com/biz/hello-robin-seattle?start=200

https://www.yelp.com/biz/hello-robin-seattle?start=210

https://www.yelp.com/biz/hello-robin-seattle?start=220

https://www.yelp.com/biz/hello-robin-seattle?start=230

https://www.yelp.com/biz/hello-robin-seattle?start=240

https://www.yelp.com/biz/hello-robin-seattle?start=250

https://www.yelp.com/biz/hello-robin-seattle?start=260

https://www.yelp.com/biz/hello-robin-seattle?start=270

https://www.yelp.com/biz/hello-robin-seattle?start=280

https://www.yelp.com/biz/hello-robin-seattle?start=290

https://www.yelp.com/biz/hello-robin-seattle?start=300

https://www.yelp.com/biz/hello-robin-seattle?start=310

https://www.yelp.com/biz/hello-robin-seattle?start=320

https://www.yelp.com/biz/hello-robin-seattle?start=330

https://www.yelp.com/biz/hello-robin-seattle?start=340

https://www.yelp.com/biz/hello-robin-seattle?start=350

https://www.yelp.com/biz/hello-robin-seattle?start=360

https://www.yelp.com/biz/hello-robin-seattle?start=370