Press "Enter" to skip to content

Bigram Analysis of Democratic Debates

This tutorial will mainly focus on ggplot and bigrams, but it does gloss over clustering for a heatmap.

Bigram Heatmap

This project started a while back, tweeting the plots at the beginning of this month. Life happens I suppose. Bought a new bike, had a birthday, yaddayadda. Better late then never?

I want to preface this with the disclaimer that a phrase repeated isn’t inherently good or bad. Emphasis through repetition is sometimes needed to drive a point home.

Required Libraries

First step, as always, is to include the libraries we will be using.

Define Candidates

This appends the candidates from the second debate to the candidates of the first debate.

Reading the Transcripts

I ended up grabbing the transcripts from nbcnews.com and saving as a CSV file after some regex cleaning. I couldn’t find both days on their site anymore, so I am linking to the CNN transcripts

I also kept the transcripts separate just in case I needed to refer to Night One and Night Two separately down the road. Once I read them in I noticed some white space and that I still needed to remove the ‘:’ character.

Now that we have Transcript A and B ready in and using the same column names, we can bind them.

If we wanted to keep our workspace clean, this would be an excellent opportunity to save the transcripts in a list (transcript$A and transcript$B). Using that method would allow you to use transcript$Full <- transcript %>% bind_rows which looks cleaner.

Working with Bigrams

This part might get a little overwhelming, but essentially this chunk of code will

  1. Loop through each individual candidate
    1. Subset the transcript by current candidate
    2. Loop through Dialog of current subset
      1. Return bigrams
    3. Generate frequency table of returned bigrams
    4. Add column for current candidate

The reason we are nesting an lapply instead of collapsing is to prevent the end of a sentence to be used with the beginning of a new sentence (ex: “He fell in. The boy cried” shouldn’t include the bigram “IN_THE”). While generating n-grams on each dialog separately won’t prevent this, it will reduce occurrences.

If you want to further improve upon this code, you could split the dialog by punctuation marks c('?', '!', '.', ';').

If you want to give the results a test, you can use

Pretty cool, right?

Bigrams, Extended

Now that we have everything all nice and segmented, we will be merging everything into one big table bigram_table to plot.

Plotting Originality

Bigram Originality

Cool!

Heatmap

Clustering Bigrams

This next part is going to be a lot of piping, and I am sure someone has a much better way of doing things.

First we going to overwrite bigrams table with a fresh bind_rows call on the bigrams list.

I did the part above a little different than the first time. There are a handful of ways to rename columns of a data frame. Using select is a very nice alternative.

For the next part we will want to figure out what bigrams to use. I am selecting the top 40 used the most cumulatively among all candidates. We only need a vector of the actual grams.

To give that a test we can use

Looks like what we are looking for; let’s move on.

Time to filter bigram_table and convert to a matrix.

Uff-da. Now that all of that is over, we can plot cluster_matrix.

Plotting the Heatmap

Lots of ways to style the heatmap, but I am going with a viridis heatmap and including those frequency within the cells. Sometimes you also want numbers.

Bigram Heatmap

Be First to Comment

    Leave a Reply

    Your email address will not be published. Required fields are marked *