Text Mining The Name of The Wind

The Name of The Wind, by Patrick Rothfuss, is the first book in the trilogy The Kingkiller Chronicle. Since it came out 13 years ago, it has been my favourite book. I recently finished the online book Text Mining with R, by Julia Silge and and David Robinson, so I wanted to make some analysis on this wonderful text, and try to put some of the new concepts I have on the tidyverse.

First I’m going to do some cleaning on the text. Note that I’m going to filter stop_words (from the tidytext package) and also I’m going to remove some custom words that are not really significant for my analysis. A neat trick from the book, is that you can add a chapter regex in addition to cumsum.

library(tidytext)
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, fig.align="center")
knitr::opts_chunk$set(fig.width=12, fig.height=8) 

notw <- as_tibble(read_delim("C:/Users/omarl/OneDrive/Escritorio/R/notw.txt",
                   delim = "\n", col_names = c("text")))


custom_stop_words <- bind_rows(stop_words,
                               tibble(word = c("id", "im", "youre"),
                                      lexicon = c("custom", "custom", "custom")))

notw_processed <- notw %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter", 
                                                 ignore_case = TRUE)))) %>%
  unnest_tokens(word, text) %>%
  mutate(word = str_remove(word, "'")) %>%
  filter(!word %in% custom_stop_words$word)

Now we can see the most common words in the book. fct_reorder is a must have in this kind of plots.

notw_processed %>%
  count(word, sort = TRUE) %>%
  head(20) %>%
  mutate(word = as_factor(word)) %>%
  mutate(word = fct_reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) + geom_col() + coord_flip() + 
  labs(x = "", title = "Most common words in The Name of The Wind",
       subtitle = "Stop words and numbers are excluded")

I have read this book between 7-9 times. I know exactly why “looked”“,”hand“” and “eyes”" are there. That’s because Kvothe, the main character, is in love with Denna, and there are constant references to her.

Let’s see for the first 12 chapters the most common words. There is also a really handy function for geom_col with facet_wrap, called reorder_within, so we can see the plot sorted.

notw_processed %>%
  filter(chapter < 13) %>%
  count(chapter, word) %>%
  group_by(chapter) %>%
  top_n(10, n) %>%
  slice(1:10) %>%
  ungroup() %>%
  mutate(word = as_factor(word)) %>%
  mutate(word = reorder_within(word, n, chapter)) %>%
  ggplot(aes(x = word, y = n)) + geom_col() + coord_flip() + 
  facet_wrap(~chapter, scale = "free") + scale_x_reordered()

We can divide the chapters between the ones where Kvothe is telling his past history to the Chronicler, and the chapters where Kvothe is talking normally, we would said “present”.

The problem doing analysis of a single word is that there are many adjectives and adverbs that can completely modify the words (i.e. “I don’t like”). So I’m going to use the ngrams value for token in unest_tokens, so I can get groups of 2 words, and check the most common bigrams.

library(ggraph)
library(igraph)
library(varhandle)

notw_bigrams <- notw %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% custom_stop_words$word,
         !word2 %in% custom_stop_words$word,
         !varhandle::check.numeric(word1),
         !varhandle::check.numeric(word2))

bigrams_count <- notw_bigrams %>%
  count(word1,word2, sort = TRUE)

bigrams_count %>%
  unite(word, word1, word2, sep = " ") %>%
  top_n(20) %>%
  mutate(word = as_factor(word)) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) + geom_col() + coord_flip() + 
  labs(title = "Most common bigrams in The Name of The Wind", x = "",
       subtitle = "Stop words and numbers are excluded")

There are some names, like elxa dal, but also some cool things like deep breath. If you have been reading the book, you know why it’s here, so as with blue fire and dark hair.

So, let’s plot the bigrams as a graph

library(igraph)
bigram_graph <- bigrams_count %>%
  head(80) %>%
  graph_from_data_frame()

a <- grid::arrow(type = "closed", length = unit(.15, "inches"))

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void() + labs(title = "Graph of the most common bigrams in The Name of The Wind",
                      subtitle = "Numbers and stop words are ommited")

One of the most cool things I learned that tidytext can do is sentiment analysis in a really simple way.

notw_processed %>%
  inner_join(get_sentiments("afinn")) %>%
  group_by(chapter) %>%
  summarize(sentiment = sum(value)) %>%
  ggplot(aes(x = chapter, y = sentiment)) + geom_point(aes(color = sentiment > 0,
                                                           size = abs(sentiment))) +
  geom_abline(slope = 0, intercept = 0, color = "black") + geom_line(color = "black",
                                                                     alpha = 0.6) + 
  theme(legend.position = "None") + 
  labs(title = "Sentiment progression by Chapter in The Name of The Wind",
       subtitle = "AFINN sentiment dataset used") +
  scale_x_continuous(breaks = seq(0,100,10)) + 
  scale_y_continuous(breaks = seq(-250,100, 30))

We have to remember that those are single words, not bigrams. One thing that we could to get a more accurate analysis is getting the bigrams for each line and check if the first word is a negation. Then multiply by -1 the second term.

So far so good! The book is a drama. Let’s check index = 25

notw_processed %>% 
  inner_join(get_sentiments("afinn")) %>% 
  group_by(index = linenumber %/% 80) %>% 
  filter(index == 25) %>%
  arrange(value)

## Joining, by = "word"

## # A tibble: 240 x 5
## # Groups:   index [1]
##    linenumber chapter word     value index
##         <int>   <int> <chr>    <dbl> <dbl>
##  1       2000      26 terrible    -3    25
##  2       2000      26 kill        -3    25
##  3       2001      26 despair     -3    25
##  4       2002      26 terrible    -3    25
##  5       2002      26 terrible    -3    25
##  6       2003      26 died        -3    25
##  7       2004      26 terrible    -3    25
##  8       2004      26 killed      -3    25
##  9       2005      26 despair     -3    25
## 10       2005      26 dead        -3    25
## # ... with 230 more rows

So, in total, we have a sentiment value of…

notw_processed %>%
  inner_join(get_sentiments("afinn")) %>%
  summarize(value = sum(value))

## # A tibble: 1 x 1
##   value
##   <dbl>
## 1 -3319

Using widyr, we can also count the pairs of words appearing in the same chapter and plot it in a graph.

library(widyr)
notw_processed_pairs <- notw_processed %>%
  filter(word != "chapter") %>%
  pairwise_count(word, chapter, sort = TRUE, upper = FALSE) 

set.seed(1234)
notw_processed_pairs %>%
  filter(n >=72) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "darkred") +
  geom_node_point(size = 5) +
  geom_node_text(aes(label = name), repel = TRUE, 
                 point.padding = unit(0.2, "lines")) +
  theme_void()

Tidytext also has a bind_tf_idf function, so we can check the most relevant words in a chapter

desc_tf_idf <- notw_processed %>% 
  count(chapter, word, sort = TRUE) %>%
  bind_tf_idf(word, chapter, n) %>%
  group_by(chapter) %>%
  top_n(10, tf_idf) %>%
  slice(1:3) %>%
  ungroup() %>%
  arrange(chapter, desc(tf_idf))

Finally, I’m going to do some LDA modeling. Beta is the probability of a word being generated from that topic.

word_counts <- notw_processed %>%
  count(chapter, word, sort = TRUE)


desc_dtm <- word_counts %>%
  cast_dtm(chapter, word, n)

library(topicmodels)

desc_lda <- LDA(desc_dtm, k = 4, control = list(seed = 1234))
tidy_lda <- tidy(desc_lda)

tidy_lda

## # A tibble: 47,096 x 3
##    topic term        beta
##    <int> <chr>      <dbl>
##  1     1 tehlu  2.96e-  4
##  2     2 tehlu  4.01e-  4
##  3     3 tehlu  2.35e-  4
##  4     4 tehlu  5.28e-  3
##  5     1 kvothe 1.04e-  2
##  6     2 kvothe 3.28e-  4
##  7     3 kvothe 2.91e-  3
##  8     4 kvothe 4.05e-  4
##  9     1 kote   7.48e-  3
## 10     2 kote   1.09e-100
## # ... with 47,086 more rows

top_terms <- tidy_lda %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  group_by(topic, term) %>%    
  arrange(desc(beta)) %>%  
  ungroup() %>%
  ggplot(aes(term, beta, fill = as.factor(topic))) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  scale_x_reordered() +
  labs(title = "Top 10 terms in each LDA topic",
       x = NULL, y = expression(beta)) +
  facet_wrap(~ topic, ncol = 4, scales = "free") + 
  theme(axis.text.x = element_text(angle=45, hjust=1))

Remember that k is an hyperparameter, and in this case I knew that with 4 I could get the best of it. We can see 4 blatantly different topics. The first one is the “present” of Kvothe, the second one is when he’s talking about Denna, the third one is when he’s studying in the University and the last one is from his childhood. Really impressive.

I can’t recommend enough those two books. Thanks for reading!

Text Mining The Name of The Wind

Based on Text Mining with R

Omar Lopez Rubio