What data science tells us about United’s terrible, horrible, no good, very bad month

By now, most of us have seen the disturbing video. On a United Airlines’ flight from Chicago to Louisville, Ky., on April 9, Dr. David Dao, a 69-year-old Louisville physician trying to get home to his patients, was dragged from a seat that he had paid for.

A fellow passenger had captured the incident on her cellphone camera and the video immediately went viral, garnering millions of views in just a few hours. Outrage ensued, outrage that further escalated when United’s chief executive issued a statement that apologized for “having to reaccomodate” the passenger and calling Dr. Dao “disruptive and belligerent”.

Days later, on a United flight from Houston to Calgary, a scorpion fell from an overhead bin and stung a passenger on his head as he ate lunch. And not long after that, a beloved giant bunny was found dead in a cargo hold after a transatlantic flight on United.

It was, in short, an epic month of PR nightmares for the airline.

To determine just how bad it really was, we used Import.io to scrape a collection of news stories from April and early May from the New York Times, the Washington Post, the Los Angeles Times, the Chicago Tribune, and the BBC. In all 152 stories were scraped and then analyzed using natural language processing algorithms in R.

The results of such analysis can offer businesses an accurate look at coverage during a crisis and can then suggest how best to respond from a PR perspective.

Scraping the news sites

The first task was to actually collect the stories. Because news sites all have different layouts, it was necessary to set up individual scrapers for each site. In some cases, like the New York Times, the process was straightforward. For other sites, such as the BBC, which has a lot of extra content on the pages of its stories, we had to use Manual XPATH to collect the relevant data.

data science

Using data science to analyze the data

Some of the data had to go through a manual data wrangling process in order to make it ready for analysis. The scraped files were read into R and bound together to make one dataset consisting of 152 observations of our six variables.

We were specifically interested in looking at the sentiments expressed in the news stories. This involved parsing the text for emotion terms and visualizing the results. We used the Tidy package in R for analysis of emotion words in text. This package implements Saif Mohammad’s NRC Emotion lexicon, which is comprised of basic expressions of eight emotions: anger, fear, anticipation, trust, surprise, sadness, joy, and disgust.

We began with a simple evaluation of emotion terms to total word count by news outlet. The analysis included word tokenization (i.e., separating full texts into single words, or unigrams), the exclusion of numbers and common stop words (e.g., “the,” “and,” “but”) and then using the get_sentiment function to match emotion terms to the specific unigrams.


This analysis produced the above visualization, which reveals a high degree of variability among the various news outlets, with most between 20% and 30%. The median percentage of emotion terms was 25.9%.

data science

data science

Average number of words by emotion

We then calculated the average number of words in the corpus by the emotion indicated. Trust received the highest average number of terms and joy the fewest. From this visualization, we can see that trust terms are used significantly more than other terms. Indeed, it appears that it holds a 2-to-1 ratio over the next most used term.

data science

data science

Emotion word usage over time

We then calculated the use of emotion terms over time. As we can see from the visualization, there is much variability in the usage of specific emotion terms over time. However, in keeping with the previous visualization, we note that trust terms tend to be more highly represented in news coverage, and while it’s somewhat difficult to discern, joy and surprise appear with much less frequency.

data science

data science


Two things are notable about this analysis. First, trust-related terms score significantly more frequently than other terms. This suggests that the event has had a deleterious impact on the trust people have in United. Second, surprise-related terms score at or near the bottom. This would indicate that, despite the shocking nature of the viral video, people weren’t necessarily surprised that an airline would conduct itself in such a manner.

Understanding these two aspects of the news coverage, then, should point the way toward an appropriate response. First, representatives of the company must say and do things that help to rebuild trust. And second, through its actions and words, it must demonstrate that such an event is not the norm and will never be tolerated again.

Data science techniques, such as sentiment analysis of the news or other textual data, can be used in a variety of circumstances from crisis communications research to investor relations research.

Extract data from almost any website