The University of Ottawa Library holds an employee training week every year, giving colleagues the opportunity to share experiences, skills, and insights with one another. I jumped on this opportunity to showcase import.io as a means of creating datasets from website content. The tutorial I developed demonstrated how to create a dataset from the City of Ottawa’s open data catalogue. It’s a really simple example to get users familiar with the functionality of import.io, an easy way to scrape web content via a simple interface and without having to code. In this post I will also demo how to use Open Refine to clean the data captured by import.io and how to visualize it using Google Fusion Tables.
Here’s a brief version of my demo:
Step 1: Get the Data
For this demo, we’ll be using the City of Ottawa’s open data catalogue as the source from which we want to obtain our data. Since we want to gather data from multiple, similar pages on the same website, I opted to use the Crawler.
What’s particularly great is that import.io previews the collected data in the bottom portion of the screen. You can make sure it’s collecting the right information, in the right format.
Once I had uploaded my data, I exported the data as a CSV file and began to think about possible adjustments to the format of the data that would make it more valuable to potential users. I first deleted any unnecessary (for my purposes) columns that are added in the download process, such as _widgetName and _source. Upon viewing the file, I noticed that there were some inconsistencies in the data collected. For example, text files are represented as both .txt and txt, xml is both in all caps and in lowercase, and there is some duplication that I decided to remove.
Step 2: Clean it up
This is where Open Refine comes in, a must-try tool for all kinds of data clean-up. It allows you to make specific changes or to group content and perform bulk updates. A big time saver. Download and install the application to get started. Follow the prompts to create a new project and import the file you want to clean up. For this workshop, I used some of the common transformations available by clicking the down arrow beside the title of the column of interest. Here are some of the transformations I used on the file created with import.io:
- Edit cells > common transformations > to lowercase
- Edit cells > split multi-valued cells (split on blank space and period)
- Edit cells > commons transformations > trim leading and trailing whitespace
- Edit cells > fill down
Of course, the actions you pick will depend on what you want to do with the dataset. Do you want to create a visualization or simply make the clean data available for others to play with? Once you’re done cleaning the data in Open Refine, you can download it in a variety of formats. For the workshop I led for staff I wanted to show as many tools as possible, so I also integrated a quick demo of Google Fusion Tables using the cleaned-up dataset from import.io + Open Refine.
Step 3: Visualize it!
Hop on over to Google Drive and check out this page to learn more about Google Fusion Tables. Click create, connect more apps, and search for and add Fusion Tables. You can then click create and Fusion Tables will be a file type option. Click Fusion Table and upload the file you downloaded from Open Refine. The reason I used the command Edit cells > split multi-valued cells in Open Refine, was because I wanted to create a network graph showing each dataset organized by file type. There’s a great tool for this in Google Fusions Tables that works perfectly with the way we cleaned up the data. To create a network graph with the Fusion Table we created, simply click the red plus sign box on the top panel and select add chart.
This is what you’ll get:
I wanted to show the link between dataset_name and file_types so I selected those options on the left. I also selected color by columns to make the comparisons more prominent.
And that’s all there is to it! You can embed the visualization you created in your website and show it to all your friends by clicking tools > publish. To view the Fusion Table and explore the network visualization I created for this post, click here.
Go Try It Yourself!
The goal of the workshop at the U of Ottawa Library was to get staff familiar with gathering, cleaning, and visualizing data. Using the simple example of the open data catalogue, participants were able to get a sense of what these tools are capable of, get some hands-on experience using them, and hopefully be inspired to explore and test further. The possibilities are endless and the workshop has generated some good discussions and new collaborations. These are all great tools with many possibilities and applications far beyond those demonstrated in this post. So, stop reading and go experiment!
by Catherine McGoveran
Turn the web into data for free
Create your own datasets in minutes, no coding required
Powerful data extraction platform
Point and click interface
Export your data in any format
Unlimited queries and APIs