The Name of The Wind, by Patrick Rothfuss, is the first book in the trilogy The Kingkiller Chronicle. Since it came out 13 years ago, it has been my favourite book. I recently finished the online book Text Mining with R, by Julia Silge and and David Robinson, so I wanted to make some analysis on this wonderful text, and try to put some of the new concepts I have on the tidyverse.
First I’m going to do some cleaning on the text. Note that I’m going to filter stop_words
(from the tidytext package) and also I’m going to remove some custom words that are not really significant for my analysis. A neat trick from the book, is that you can add a chapter regex in addition to cumsum.
library(tidytext)
library(tidyverse)
knitr::opts_chunk$set(echo = TRUE, fig.align="center")
knitr::opts_chunk$set(fig.width=12, fig.height=8)
notw <- as_tibble(read_delim("C:/Users/omarl/OneDrive/Escritorio/R/notw.txt",
delim = "\n", col_names = c("text")))
custom_stop_words <- bind_rows(stop_words,
tibble(word = c("id", "im", "youre"),
lexicon = c("custom", "custom", "custom")))
notw_processed <- notw %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter",
ignore_case = TRUE)))) %>%
unnest_tokens(word, text) %>%
mutate(word = str_remove(word, "'")) %>%
filter(!word %in% custom_stop_words$word)
Now we can see the most common words in the book. fct_reorder
is a must have in this kind of plots.
notw_processed %>%
count(word, sort = TRUE) %>%
head(20) %>%
mutate(word = as_factor(word)) %>%
mutate(word = fct_reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) + geom_col() + coord_flip() +
labs(x = "", title = "Most common words in The Name of The Wind",
subtitle = "Stop words and numbers are excluded")
I have read this book between 7-9 times. I know exactly why “looked”“,”hand“” and “eyes”" are there. That’s because Kvothe, the main character, is in love with Denna, and there are constant references to her.
Let’s see for the first 12 chapters the most common words. There is also a really handy function for geom_col
with facet_wrap
, called reorder_within
, so we can see the plot sorted.
notw_processed %>%
filter(chapter < 13) %>%
count(chapter, word) %>%
group_by(chapter) %>%
top_n(10, n) %>%
slice(1:10) %>%
ungroup() %>%
mutate(word = as_factor(word)) %>%
mutate(word = reorder_within(word, n, chapter)) %>%
ggplot(aes(x = word, y = n)) + geom_col() + coord_flip() +
facet_wrap(~chapter, scale = "free") + scale_x_reordered()
We can divide the chapters between the ones where Kvothe is telling his past history to the Chronicler, and the chapters where Kvothe is talking normally, we would said “present”.
The problem doing analysis of a single word is that there are many adjectives and adverbs that can completely modify the words (i.e. “I don’t like”). So I’m going to use the ngrams
value for token
in unest_tokens
, so I can get groups of 2 words, and check the most common bigrams.
library(ggraph)
library(igraph)
library(varhandle)
notw_bigrams <- notw %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% custom_stop_words$word,
!word2 %in% custom_stop_words$word,
!varhandle::check.numeric(word1),
!varhandle::check.numeric(word2))
bigrams_count <- notw_bigrams %>%
count(word1,word2, sort = TRUE)
bigrams_count %>%
unite(word, word1, word2, sep = " ") %>%
top_n(20) %>%
mutate(word = as_factor(word)) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) + geom_col() + coord_flip() +
labs(title = "Most common bigrams in The Name of The Wind", x = "",
subtitle = "Stop words and numbers are excluded")
There are some names, like elxa dal
, but also some cool things like deep breath
. If you have been reading the book, you know why it’s here, so as with blue fire
and dark hair
.
So, let’s plot the bigrams as a graph
library(igraph)
bigram_graph <- bigrams_count %>%
head(80) %>%
graph_from_data_frame()
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
ggraph(bigram_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void() + labs(title = "Graph of the most common bigrams in The Name of The Wind",
subtitle = "Numbers and stop words are ommited")
One of the most cool things I learned that tidytext can do is sentiment analysis in a really simple way.
notw_processed %>%
inner_join(get_sentiments("afinn")) %>%
group_by(chapter) %>%
summarize(sentiment = sum(value)) %>%
ggplot(aes(x = chapter, y = sentiment)) + geom_point(aes(color = sentiment > 0,
size = abs(sentiment))) +
geom_abline(slope = 0, intercept = 0, color = "black") + geom_line(color = "black",
alpha = 0.6) +
theme(legend.position = "None") +
labs(title = "Sentiment progression by Chapter in The Name of The Wind",
subtitle = "AFINN sentiment dataset used") +
scale_x_continuous(breaks = seq(0,100,10)) +
scale_y_continuous(breaks = seq(-250,100, 30))
We have to remember that those are single words, not bigrams. One thing that we could to get a more accurate analysis is getting the bigrams for each line and check if the first word is a negation. Then multiply by -1 the second term.
So far so good! The book is a drama. Let’s check index = 25
notw_processed %>%
inner_join(get_sentiments("afinn")) %>%
group_by(index = linenumber %/% 80) %>%
filter(index == 25) %>%
arrange(value)
## Joining, by = "word"
## # A tibble: 240 x 5
## # Groups: index [1]
## linenumber chapter word value index
## <int> <int> <chr> <dbl> <dbl>
## 1 2000 26 terrible -3 25
## 2 2000 26 kill -3 25
## 3 2001 26 despair -3 25
## 4 2002 26 terrible -3 25
## 5 2002 26 terrible -3 25
## 6 2003 26 died -3 25
## 7 2004 26 terrible -3 25
## 8 2004 26 killed -3 25
## 9 2005 26 despair -3 25
## 10 2005 26 dead -3 25
## # ... with 230 more rows
So, in total, we have a sentiment value of…
notw_processed %>%
inner_join(get_sentiments("afinn")) %>%
summarize(value = sum(value))
## # A tibble: 1 x 1
## value
## <dbl>
## 1 -3319
Using widyr
, we can also count the pairs of words appearing in the same chapter and plot it in a graph.
library(widyr)
notw_processed_pairs <- notw_processed %>%
filter(word != "chapter") %>%
pairwise_count(word, chapter, sort = TRUE, upper = FALSE)
set.seed(1234)
notw_processed_pairs %>%
filter(n >=72) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "darkred") +
geom_node_point(size = 5) +
geom_node_text(aes(label = name), repel = TRUE,
point.padding = unit(0.2, "lines")) +
theme_void()
Tidytext also has a bind_tf_idf
function, so we can check the most relevant words in a chapter
desc_tf_idf <- notw_processed %>%
count(chapter, word, sort = TRUE) %>%
bind_tf_idf(word, chapter, n) %>%
group_by(chapter) %>%
top_n(10, tf_idf) %>%
slice(1:3) %>%
ungroup() %>%
arrange(chapter, desc(tf_idf))
Finally, I’m going to do some LDA modeling. Beta is the probability of a word being generated from that topic.
word_counts <- notw_processed %>%
count(chapter, word, sort = TRUE)
desc_dtm <- word_counts %>%
cast_dtm(chapter, word, n)
library(topicmodels)
desc_lda <- LDA(desc_dtm, k = 4, control = list(seed = 1234))
tidy_lda <- tidy(desc_lda)
tidy_lda
## # A tibble: 47,096 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 tehlu 2.96e- 4
## 2 2 tehlu 4.01e- 4
## 3 3 tehlu 2.35e- 4
## 4 4 tehlu 5.28e- 3
## 5 1 kvothe 1.04e- 2
## 6 2 kvothe 3.28e- 4
## 7 3 kvothe 2.91e- 3
## 8 4 kvothe 4.05e- 4
## 9 1 kote 7.48e- 3
## 10 2 kote 1.09e-100
## # ... with 47,086 more rows
top_terms <- tidy_lda %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
group_by(topic, term) %>%
arrange(desc(beta)) %>%
ungroup() %>%
ggplot(aes(term, beta, fill = as.factor(topic))) +
geom_col(show.legend = FALSE) +
coord_flip() +
scale_x_reordered() +
labs(title = "Top 10 terms in each LDA topic",
x = NULL, y = expression(beta)) +
facet_wrap(~ topic, ncol = 4, scales = "free") +
theme(axis.text.x = element_text(angle=45, hjust=1))
Remember that k
is an hyperparameter, and in this case I knew that with 4 I could get the best of it. We can see 4 blatantly different topics. The first one is the “present” of Kvothe, the second one is when he’s talking about Denna, the third one is when he’s studying in the University and the last one is from his childhood. Really impressive.
I can’t recommend enough those two books. Thanks for reading!