Text mining the presidential debate

This election season has been crazy and also special for people like me who got to vote for the first time. As a naturalized citizen since late year, I was very excited to participate in the election and have my voice in choosing the leader of our country. I didn’t start paying attention to the election until the January of this year. When I started paying attention, it was very clear to me Bernie Sanders would be my candidate. As a person lived in a foreign country for most of my life, most of what he said sounded like common sense ideas rather than liberal, left-wing, pie in the sky proposals which is how the mainstream media portrayed his policies. After listening to speeches by Bernie, I went from a person apathetic toward politics to someone who spend hours volunteering for Bernie’s campaign (despite my busy class schedule & research) and contributed hundreds of dollars for it (I have never made any political contribution before in my life and will likely not do it in the future unless Bernie asks me for it). But despite all that, we all know what happened and I was back into apathetic mode toward politics.

I got a little interest when I saw everyone talking about the first presidential debate. I decided to watch it anyway. When I watched the first debate, I had the same feeling that you get when you eat something and realize it is too rotten to be eaten, but you eat it anyway because there is nothing else to eat.

After a few days I completely forgot about the debate, although it was the major revenue generator for the mainstream media for the next few weeks. Then just last week I saw one of the data analysis blogs post the transcript of the first presidential debate and I got a little curious and started analyzing it.

Main methods that I used are text  mining and sentiment analysis which can give you a lot of information about textual data. However there are more in-depth analysis in Natural Language Processing (NLP) to analyze this data even more extensively.

I separated the transcript into parts spoken by Clinton and Trump. First of all, I looked at how each candidate interrupted each other or interrupted by the moderator. Based on the text information I counted the number of times (using regular expression) the dialogue parts of each candidate that ends with an ellipsis(…) as an interruption. Not surprisingly Clinton was interrupted by Trump or the moderator about 40 times and Trump was interrupted by Clinton or the moderator by 18 times. Anyone who saw the first debate sprinkled with “Wrong” can attest to this observation.

Then I went on to do an exploratory analysis on the words used by each candidate via word cloud. Word cloud is an easy way of visualizing textual information where the size of the word indicates the frequency of the word in the text. Prominently the catch phrase of Trump “tremendous”  and Clinton’s buzzword “jobs”  had large weight on the word cloud. Figure below shows the entire word cloud for both candidates, for clarity I only restricted the words that had at least five occurrences.


From the text of the debate, I was also curious to see what states and countries were mentioned by the candidates. Although this is just an exploratory analysis, it can give more insight into what state each of those candidates are focusing on and what foreign policy decision that they are interested. States mentioned by each of these candidates are neither deep blue states nor ruby red states, rather all the states mentioned by either candidates are swing states. However Trump mentioned more of these states than Clinton. Clinton only mentioned Minnesota according to my analysis Trump mentioned a few more states. The two US maps show the states mentioned according to the frequency.


In terms of foreign countries, Trump mentioned China and Russia 9 times each (Trump way of saying China with an emphasis on Chi is even mocked by the supbar impersonation of Alec Baldwin in SNL).  Trump mentioned more names of the countries than Clinton, however Clinton mentioned Syria and Afghanistan more often than Trump did. Also Trump said Mexico six times during the debate. Essentially middle eastern countries and the global powers like Russian and China took the front seat compared to two hundred other countries around the world.


Finally, I did a sentiment analysis based on the text parts of the two candidates. Sentiment analysis is a subfield of computational linguistics/ natural language processing that tries to understand the emotions expressed in a text based on the connotations of the words in the text. The method that I used was based on an emotional lexicon compiled at National Research Council (NRC) in Canada. This method uses the lexicon to get eight different  emotions expressed in the text and provide a score for each of these emotions. The eight emotions accounted here are anger, anticipation, disgust, fear, joy, negative, positive, sadness, surprise and trust. The bar charts below shows the scores for each of these candidates. The ratio of negative emotions to positive emotions was much higher for Trump than for Clinton. However both fear and disgust scores are very similar for both candidates. It is not surprising that Trump got that high scores but it is also not that surprising that Clinton got that high score for fear and disgust for anyone who closely watched the democratic primary. Because during the primaries Clinton campaign’s main platform was  scaremongering against Bernie’s campaign and they are essentially using the same tactics to win the presidential election by spreading alarming news against Trump’s America than saying what they would do to make the country better.

Looking at the presidential debate transcript gave me more insight into how much tools out there to analyze text data and how much they are growing these days. Sadly, it also gave the reality that neither of the candidates are interested in improving the country but rather mudslinging at each other for political gain. One conclusion was clear, both major party candidates will be a disaster for the future of this country!

P.S.: I did all the analysis in R. Following are the packages that I used for text mining and sentiment analysis: tm,word cloud,  and syuzhet . Following are the packages that I used for plots and rendering of the maps: ggplot2, ggmap,maps,rwordlmapq, and countrycode. In addition to those I also used stringr, dplyr, plyr and rvest for general purpose data manipulation.








Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a website or blog at WordPress.com

Up ↑

%d bloggers like this: