Updated: Bulk extract data using Google Sheets

Those regular blog readers among you will remember my previous batch search webinar in which I showed you how to use the Google Sheet I created to upload a lot of URLs (or search terms) into an extractor (or connector). It was an extremely popular post and I got lots of comments and questions about it. Well, there is now….a NEW version!

I’ve updated the spreadsheet to include some of the feedback I received and generally improve the its performance. So, let me introduce you to Batch Search v0.3.5…

What is batch search?

First of all – for anyone who doesn’t know – a batch search is when you query multiple URLs at once. At the moment, in the current data set page, you can add multiple URLs to your Extractor, but you have to do it one at a time. Which, if you have 50 or 100 URLs can be a slow process.

Google Sheets to the rescue

Using my GS you can input many URLs at once and pass them through any import.io connector or extractor with just the press of a button.

Make a copy

In order to use the Batch Search sheet for yourself, you will first need to make a copy of it in your own Google Drive.  Open this Google Sheet and click “File > Make a copy…”

You should now be able to edit the sheet.

Select your Extractor (or Connector)

Next you need to choose the extractor (or connector) that you want to pass URLs through. Simply head over to your My Data page, click the extractor you want and copy the corresponding URL – you can also use the GUID for that extractor (the part of the URL after “id=”).

Then, paste that URL into the source box in the GS and click “Get Source”. The GS will go and fetch all the relevant information for that extractor and display it below. Next, enter all the URLs (or search terms for Connectors) into column A starting in row 20 (one per line) and click “Get results”.

Note: The first time you use the sheet, Google should prompt you to allow the script

Get data

Once you click “Get results”, the GS will pass each or those URLs through the extractor you chose and display the corresponding data in a new tab on the GS.

Is there anything it can’t do?

Batch search is a pretty powerful tool, but it isn’t perfect! The script that Google runs will automatically time out after 6 minutes so if you upload a LOT of URLs it may time out.

Turn the web into data for free

Create your own datasets in minutes, no coding required

Powerful data extraction platform

Point and click interface

Export your data in any format

Unlimited queries and APIs

Sign me up!

Comments

Hi Andrew, thank you very much for working on this powerful tool. However I have been having difficulty getting this to work properly with a data set from amazon (20 item example).

Here is the first query of the data-set:
http://goo.gl/ywLGMP

Here is the second attempt:
http://goo.gl/PJFefj

No matter what I try it seems he results are hit or miss. Some of the items that crawled successfully in the first query are not in the second, and vice-versa. This appears to also happen when I try to add URLs o the extractor in the data view as well.

Thank you again very much for this tool, please let me know if you have any suggestions.

Hi Liam,

Amazon does something called randomised A/B testing. This basically means that when you land on a page, there’s a 50/50 chance that you either get put on site A or site B. The two sites might look similar, but there are small changes, maybe an ad is placed somewhere else, or a part of the website is a slightly different colour. They do this to constantly optimise their website and make changes so that configurations that sell more products get implemented. You can read more about how it all works here – https://developer.amazon.com/sdk/ab-testing/reference/how-ab-works.html

Code-wise, however, that means the data you are trying to extract is in very different places. In fact, the price you are looking for may be several lines of code beneath where you trained it to be.
So when you send an Extractor query to Amazon, there’s a 50/50 chance, that you end up on the same site (code-wise), that the Extractor was trained on.
And it may look even more odd when you send a dozen Extractor queries after each other, as the URLs where data is actually extracted from will depend on whether they ended in the A or B category.
I hope this clarifies the issue a bit !
If you need any more details feel free to email me at support@import.io
Hao

thank you this adds a lot of clarification to what I’ve been noticing when running my tests.

Thank you also for the link, will read into it more.

I’m also struggling to get more than the first page of search results from my connectors using this spreadsheet – the datasets bring back the 100 limit per term but using the Google sheet I get only 10… Any help would be greatly appreciated!
Thanks

Hi Neal, can you share the following: a link to your data source, some example queries. If you don’t want to share those here, send an email to support@import.io with the same details and we will get it looked at immediately. Andrew

It should "just work", even if the pages you are extracting from require JS to be enabled. Build a connector or extractor, in order to get the extractor to work you will have to enable JS. Test the extractor or connector in the My Data page. Use the extractor or connector in Google Sheets.

I’m having problems getting this to work. I enter everything according to the instructions. On MyData page, the data from my source url shows up (there are 4 columns of data. But when I hit the GetResults button. I get a sheet that has only two columns, the first column header is webpage/url and the second pageUrl. In the rows for each url i entered, I get three lines of text in each cell that have the following error message:
Query Attempt 1 – IMPORT.IO ERROR: {""errorType"":""InputException"",""error"":""Bad url: no protocol: http://www.theguardian.com""}
Query Attempt 2 – IMPORT.IO ERROR: {""errorType"":""InputException"",""error"":""Bad url: no protocol: http://www.theguardian.com""}
Query Attempt 3 – IMPORT.IO ERROR: {""errorType"":""InputException"",""error"":""Bad url: no protocol: http://www.theguardian.com""}"

any idea of what I’m doing wrong or what is not working correctly?

Hi Andrew,

I’v tried this version and it works great. Using this Batch Search makes extraction proccess more faster & more simple, indeed. Thanks Andrew, what a great tools 🙂

Anyway, today i tried to extract some data, and i have this notification:
http://prntscr.com/5znplf

Would you please give me some help?

is there a way to just make it atomically get the data …like importxml/importhtml etc

importxml does not pick up the data that I need…so

Hi Andrew ,
Magic now coming for only current page for no download page limit, could you please help in downloading minimum 5-10 pages at once.

Thanks

Incredible, Andrew! Just completed in <2hrs a job that I estimated would take 20 without import.io automation (connector and this bulk extract GS).

Thanks!

Hi everyone,

Is there any way of working around the time limit for google docs? [i’ve got a few thousand URLs to go through and it keeps on timing out]
Or any update on building this into import.io?

Hi Gav,
I’m afraid i don’t know of a way around it in Gsheets, but we are currently working on a version of this in the import app. If you (or anyone else in this thread) is interested in helping us test it out and giving us feedback, send us your email address (hello@import.io) and we’ll send you a link to test it out. 🙂

Hi guy’s

Is there a way to extract Name, Last Name, and Company Name from a excel spreadsheet or a website that i opened with a lot of people’s profiles on, and sort them on a spreadsheet under their own columns.

Thank you

Need a throttle of sorts as when you query 100s of URLs from the same source they start denying your request until a period of time passes as you look like a denial of service attack. Whether it goes into Google Sheets or the app I think there needs to be a way to limit how fast and how often your connection to the server to get the data.

i have the same issue, many sites banned me for a some time, because i make too many requests. It would be perfect to add some functions like “some number of links that will be parse in one second or one query” and the “time delay” between request (in seconds)

Something happened.
Now Google has added a Addons Store which, it appear to be cancelled the functionality of import.io script, so google sheet isn’t working anymore 🙁

When i press a button, it says: Function in script getSource wasn’t found.

Help!

Hey Andrew,

The script complains "Cannot read Data Source. This does not appear to be either a valid Data Source URL or a Data Source GUID. Check your Data Source and try again. You entered: http://ipr.etsi.org/&quot;
Following that, another warning:
"There appears to be a problem accessing Import.io right now for your source information.

Check our status blog and try again – http://blog.import.io/status-blog/

Exception: Échec de la requête pour https://api.import.io/store/connector/undefined?user=315f6e84-8fd6-449f-ae60-6eccfb9a017e&apikey=7C2or4Abyj9Hhk%2BzDPQtDcwby5szRxJksOQk2qy%2FrJkvA1F7C82JG2WcDII3ofwuip3BK16Y8JLShCxwHgcErQ%3D%3D. Code renvoyé : 404. Réponse tronquée du serveur : {"error":"No such bucket.","code":"NOSUCHBUCKET"} (Utilisez l’option muteHttpExceptions pour examiner la réponse entière.)"

Can you help?

This is great, but for certain columns where the data is stored as a comma delimited list (for example a row that has multiple images) the extractor only returns the first value from the list.

Looking at the data set in import.io the rows value is:
"image1.jpg, image2.jpg, image3.jpg".

The extractor however only returns only "image1.jpg".

Thanks James! Good spot. We will have a look into fixing that and update you.

Hi,
I’ve got a problem when clicking on "Get Results". I guess I didn’t enter the right thing in the "queries" beacause what I get is :

Your data source requires the following input fields:

["webpage/url"].

But your queries are mapped to these input fields:

["http://www.letudiant.fr/etudes/annuaire-enseignement-superieur/formation/niveau-apres-un-bac-3-4/diplome-masters-professionnels/page-2.html"%5D.

Click "Get input fields" to get the correct input field names for your selected data source, then try again.

What should I do ? Where is "Get input fields" ? Thanks for you help

Comments are closed.