Our current Python client library is quite complex and can be tricky to get your head around; the good news is that soon we will be bringing out version 2, which is much easier to use and comes with much more help content.
While we are putting the final touches on the new version, I want to take a few moments to talk about some of the core concepts of the client library.
When we write lines of code, the standard assumption is that one line runs after another, and when the work from one line (for example, running a function) is done, then the next line continues. For example, imagine we have an imaginary “get_data” function in Python which gets whatever data we need and returns it. If we were to write this code, we would expect the result to be printed containing all of the data we wanted:
However, with asynchronous functions, there is one key difference. Instead of waiting for the data to be ready and then returning it, they return immediately and let you know later on when the data is available. If you were to convert the “get_data” function in the above example into an asynchronous function, you would not get the same results: when the code goes to the next line your data would not be ready, and the final print would probably write “False” – but not the data you were looking for! Because the function call returns immediately, we need a way for the asynchronous operation to let us know that it has the data available for us. The way this happens is with the use of a callback function. A callback function is a function you define yourself and then give to the client library, which then calls that function whenever the asynchronous operation has completed. Take the below example. I have modified our imaginary “get_data” function so that in addition to a query string, it takes a callback function as an argument. It will call this callback whenever the data is available, some time in the future.
Why is import•io querying asynchronous?
When you issue a query to import•io, it is possible for it to be executed synchronously in Python. However, using this method, you can only get one page of data, and query a single source at a time. This is why we came up with our client library: you can get a much higher throughput (more queries in less time), do multiple queries simultaneously, and get multiple pages of results per query.
When you issue a query with the client library, you start a chain of events on the import•io servers which execute your query for you, collect the results and extract the data, then return the data to your client. Because there are so many steps on the server, we use an asynchronous protocol to return data as and when it becomes available.
So in order to get you the best possible performance, and your data as soon as possible, we have made the client asynchronous too.
How do I make the client library synchronous?
Fortunately, if you don’t need the asynchronous features of the client library, or simply wish to wait for queries to finish before starting new ones, we have included a mechanism to allow you to modify the client library so you can wait for results before continuing your code.
In order to do this, we use a class called a latch. A latch is a concept in asynchronous programming which helps us to wait for asynchronous code to complete executing before continuing – essentially making the asynchronous operation synchronous. The only downside is that you have to know how many asynchronous operations you are waiting for before you continue.
When you create a latch class, you tell it how many items you want to wait for. So if you are only issuing one query, you construct it with a value of 1. However if you have several queries to do, for example 5, then you would construct it with a 5.
The next step is to tell the latch when each of the operations you are waiting for has finished. This will almost certainly be in your callback function and is accomplished by calling the countdown method on the latch class instance.
Finally, you wait for all of the asynchronous operations to complete by calling the latch’s “await” method. This stops your code from executing until all of the queries have been completed.
Below is a fully annotated guide to using the latch mechanism with a single query on our new client library. It also shows how to store the returned data so that you can access it once the query has completed:
Turn the web into data for free
Create your own datasets in minutes, no coding required
Powerful data extraction platform
Point and click interface
Export your data in any format
Unlimited queries and APIs