3 easy ways to get your data into R

If you haven’t heard of R before, you should know that it’s one of the most popular statistical programming languages in the world, used by millions of people. It’s open source nature fosters a great community which helps make data analysis accessible to everyone. If you want a better understanding of how R works, and its syntax, we recommend you to take this free Introduction to R tutorial by DataCamp.

While import.io gives you access to millions of data points, R gives you the means to perform powerful analysis on that data and to turn it into beautiful visualizations. It’s a pretty nifty combo!

In this post, you’ll learn 3 easy ways to get your import.io data into R. This is a beginner tutorial so don’t worry if you’re not that familiar with R or import.io’s advanced features.

Let’s get started!

Let’s get some data

First you will extract data from r-bloggers, a very popular blog aggregator on R and statistics, using import.io’s Magic tool.

You can do this by going to your import.io account, choosing the Magic tool for extracting data, and providing www.r-bloggers.com as the URL. Click the “extract data” button and automagically you see your data appear: a table of the posts featured on the front page of r-bloggers. Now let’s get this data into R.

Option 1: Importing your import.io CSV file into R

The Magic feature gives you the option to download your table as a CSV. So one way to get your data into R is by

  1. Downloading the CSV,
  2. and importing that CSV into R.

Downloading the CSV from import.io is easy: just click the download button and select .csv. Great, now your data is available on your PC!

Next, you need to start R. Most people use R together with RStudio, a powerful interface for R that is free to use.

Once you’ve fired up R, it’s time to write a little bit of code. To import your r-bloggers CSV file into R you just need to use one standard available R function: read.csv().

Let’s assume you stored your CSV file on your desktop (“~/Desktop/r-bloggers.csv”), in that case, your code to import your CSV would look like this:

r_bloggers_data <- read.csv("~/Desktop/r-bloggers.csv")
head(r_bloggers_data)

What happens here is that read.csv() loads in your CSV file that is stored at “~/Desktop/r-bloggers.csv” and assigns it to the variable r_bloggers_data. You can see this yourself by simply executing the command head(r_bloggers_data) which shows you the first 6 rows of your table. Now you can start doing simple analysis using the variable r_bloggers_data.

If your file is stored somewhere else, just change “~/Desktop/r-bloggers.csv” to the correct directory.

Option 2: Importing your import.io JSON file into R

Magic also offers the option to download your table as JSON. Just click the download button and select .json instead of .csv. But how do you import a JSON file into R?

To do this, you first need to install an R package. Packages in R are simply collections of functions and data used for specific tasks. Installing a package gives you instant access to all the functions and data from that package.

R comes with a standard set of packages that are already installed once you fire up your R session (read.csv() is part of such a standard package, hence the reason you could instantly use it), but there are many other non-standard packages available as well that can make your R-life a lot easier.

One such package is the jsonlite package which allows you to load a JSON file (what you need to do here). Install and load jsonlite using the following code:

install.packages("jsonlite")
library("jsonlite")

The install.packages() function downloads the package and the library() function makes the functions and data inside the jsonlite package available to your R session.

Now you are ready to import your r-bloggers JSON file into R:

r_bloggers_data <- fromJSON("~/Desktop/r-bloggers.json")
head(r_bloggers_data)

See the similarity with what you did for the CSV file?

fromJSON(), a function inside the jsonlite package reads in the JSON file that is stored at “~/Desktop/r-bloggers.json” and assigns it to the variable r_bloggers_data. If you now do head(r_bloggers_data) your table appears again, ready for further analysis!

That is the power of packages in R. No matter what you’re thinking of building, it’s likely someone has built something similar in the past using a publicly available package. It’s okay if you don’t know how the jsonlite package or fromJSON() function works under the hood (but we can tell you it’s pretty complicated). But you can just use one simple function and your data is available!

Option 3: Accessing import.io from within R

The Magic download feature is very handy, but it can become tedious to always go to the application, input the URL, download the data on your desktop and finally import it into R. Wouldn’t it be nice if you could provide your URL straight into R and pull in the data automatically? Of course it would.

You are going to build a solution making use of Blockspring to connect with import.io via R. Setting this up is a bit more complicated compared to the previous solutions, but once it’s done it’s pretty straightforward to use.

You’ll need

Installing the required packages

In this case, you first need to install and load the Blockspring package that contains functions to access the Blockspring community library (there’s already an import.io one). Since this package is only available via Github, you need to use the install_github() function of the devtools() package.

install.packages("devtools")
library("devtools")
install_github("blockspring/blockspring.R")
library('blockspring')

Getting the r-bloggers data into R

To get the r-bloggers data into R, you will make use of the blockspringRunParsed() function:

r_bloggers_data <- blockspringRunParsed("extract-data-from-url-importio", list( url = "PROVIDE HERE THE LINK" , include_js = FALSE, text = NULL, maximum_results = 10, connector_version = "PROVIDE HERE YOUR IMPORT.IO KEY" ), list("api_key" = "PROVIDE HERE YOUR BLOCKSPRING KEY"))$params

What you need to provide here as a minimum is (i) your url “http://www.r-bloggers.com/“, (ii) your import.io key, and (iii) the blockspring API key. Assigning the output to r_bloggers_data you get the following:

r_bloggers_data <- blockspringRunParsed("extract-data-from-url-importio", list( url = "http://www.r-bloggers.com/" , include_js = FALSE, text = NULL, maximum_results = 10, connector_version = "85bb4a8fb790482e9458bf3d7643415942b8a3bd3138ae8a0516b9c71d5279482f31775781d66edbcdeae64772d727a26be045d4c7f56c640a6c751054cd7b1691eca78bab335e91db3bf621d9b95ad7" ), list("api_key" = "br_17951_d31440d832d81ae0ddbfc34c0d8897d0f3593322"))$params
head(r_bloggers_data)

Hurray! You managed to get your data into R.

Making your data useful for analysis

Did you check out the structure of r_bloggers_data?

Looks a bit raw and unstructured doesn’t it? So you still need to clean it.

You can do that with the following code:

r_bloggers_data_clean <- rbind.fill(lapply(r_bloggers_data[[1]], function(f) {
as.data.frame(Filter(Negate(is.null), f))
}))

Now check out the structure of r_bloggers_data_clean. Looks much better doesn’t it?

How did we do this? Well, if you look atr_bloggers_data you’ll notice that it is, in fact, a list in a list in a list. You’ll also notice that all the relevant information is available in the first list, so we only need to select that one (that’s what we do with r_bloggers_data[[1]]). Next, we apply a homemade function on every list of r_bloggers_data[[1]] (lapply makes sure we apply our function on every list) and the function turns it (together with rbind.fill()) into a clean data.frame.

This code chunk will be the same for every website you crawl, so don’t worry if you’re not fully grasping what is going on in the background.

Next steps

Using two fixed API keys and one URL you managed to import your data straight into R. If you want another URL, you simply need to rerun the above code (with the same API keys) and a different URL provided in the blockspringRunParsed() function. Make sure to try this out!

A logical next step now would be to put the above in a wrapper function or package so that you can simply provide your URL and API keys to the wrapper function, and a clean data set will be provided automatically. Also, that way, it wouldn’t be necessary to understand all the code required at all.

Conclusion

In this post, you had a look at how to get your import.io data into R and prepare it for analysis. However, we did make some simplifications in explaining how everything works and there is still a lot of analysis work to be done from here. If you want to learn more about R and Data Science, check out DataCamp.

Extract data from almost any website


INSTANT ACCESS