Novel Draft Analysis with R

31 Jan 2019

Reading time ~7 minutes

Daniel Dykiel ISP: Analysis of a Novel Draft

Introduction

For my January ISP (Independent Study Project), I analyzed the draft of a novel I had written.

I wanted to use R Programming tools, such as text mining and sentiment analysis, to gain insight about my novel and writing process. For this analysis, I looked at all the individual words in the novel and disregarded their context. Since I was unfamiliar with R before this ISP, I also had to learn basic syntax and data manipulation strategies.

Setting Up

Libraries Used

library(dplyr)
library(tidytext)
library(ggplot2)
library(wordcloud)
library(textclean)
library(tidyr)

dplyr was used to manipulate data
ggplot was used for visualizations
wordcloud was used for wordclouds
textclean was used to process text
tidytext was used to format and process text
tidyr was used to manipulate data

Text Processing

I used a format known as “TidyText.” This meant formatting my data frame so that each word was in its own row. This format made the text easier to analyze.

nov_base_text1 <- readLines("TMC.txt")

Encoding(nov_base_text1) <- "latin1"
nov_base_text2 <- textclean::replace_non_ascii(nov_base_text1)

nov_df <- dplyr::data_frame(line = 1:1949, text = nov_base_text2)

nov_word <- tidytext::unnest_tokens(nov_df, output = "word", input = text, token = "words")

my_stop_words <- subset(tidytext::stop_words, word != "face")

nov_word_clean <- dplyr::anti_join(nov_word, my_stop_words, by = c("word" = "word"))

To get the novel text into TidyText format, I had to read in the file line by line, separate the lines into words, get rid of “stop words” such as “and” that were unnecessary for analysis, etc.

Most Frequently Used Terms: Bar Graph

I made a graph of the top twenty most commonly used terms. For clarity, I organized this graph so that the most frequently used terms appeared first.

common_words <- dplyr::count(nov_word_clean, word, sort = TRUE)
top_20_all <- common_words[1:20, 1:2]

word_graph <- 
  ggplot(data = top_20_all, aes(x = reorder(word, n), y = n)) + geom_col(fill = "tan3") 
word_graph + labs(title = "Most Frequently Used Terms", x = ("Word"), y = ("Frequency")) + coord_flip()

Unsurprisingly, the bar graph showed that many of the most commonly used words were character names. This graph highlighted to what extent the “Toymaker” was a central character, as his name was by far the most commonly used. Since the story mainly takes place in a toy shop, it also makes sense that “shop” and “doll” were commonly used.

I was surprised to see that “time” and “moment” were among the top 20 most commonly used words, since I hadn’t considered the passage of time to be a major theme of the novel. However, upon reflection, the passage of time does play an important part in the arc of the main characters.

Most Frequently Used Terms: Word Clouds

I made several word clouds looking at groups of words. This allowed me to look at more closely-associated words, rather than comparing all of the most frequently used terms.

Character Names

char_names <- dplyr::filter(common_words, word %in% c("toymaker", "marie", "clementine", "kenton", "joseph", "stephen", "addy", "rosalind", "eve", "clementine's mother", "benjamin", "gregory", "clemence"))
char_names %>%
  with(wordcloud(word, n, max.words = 100, min.freq = 1, colors = "pink3", random.order = FALSE))

This analysis gave insight as to which characters were the most important. However, this method of analysis wasn’t perfect. For instance, “Kenton” is smaller than “Marie,” which implies he is a less central character than her. In reality, Kenton is the narrator, which makes him an important character.

Since he narrates using “I,” his name appears less frequently than it would if he wasn’t narrating: i.e., instead of saying “Kenton said,” the text would say “I said.”

Body Parts

body_parts <- dplyr::filter(common_words, word %in% c("hand", "neck", "throat", "finger", "fingers", "face", "hair", "teeth", "leg", "legs", "arm", "arms", "back", "spine", "eye", "eyes", "voice", "head", "lips", "bones", "bone", "hands", "palm", "palms", "shoulder", "shoulders", "tongue", "skin"))
body_parts %>%
  with(wordcloud(word, n, max.words = 100, min.freq = 1, color=alpha("red3", seq(0.4,1, 0.05)), random.order = FALSE))

Initially, it surprised me to see that “eyes” was such a frequently used term. However, I realized this is partly because there are no good synonyms or other descriptors for eyes. In reality, character hands (including “hand,” “hands, palm,” and “finger”) are by far the most commonly used descriptors.

Light

color_list_gray <- c("gray18", "gray73", "gray18", "gray73", "gray73", "gray73", "gray18", "gray73", "gray18", "gray18")
light <- dplyr::filter(common_words, word %in% c("light", "dark", "dim", "pale", "bright", "night", "day", "dawn", "dusk",  "nightfall"))
light %>%
  with(wordcloud(word, n, max.words = 100, min.freq = 1, colors = color_list_gray, ordered.colors = TRUE, random.order = FALSE))

Much of the story takes place at night or at dusk/dawn, which this wordcloud confirms. “Light” was a more frequently used word than expected. However, rather than being used in the context of acute brightness, such as “a light sky,” “light” was usually used to describe “dim light” or light from flickering oil lamps.

Colors

color_list <- c("red3", "black", "gray90", "gray30", "blue3", "burlywood4", "chartreuse4", "gray58", "goldenrod1", "goldenrod4", "darkorchid1", "chocolate4", "darkorange2", "yellow3")
colors <- dplyr::filter(common_words, word %in% c("black", "white", "yellow", "gold", "silver", "bronze", "orange", "red", "blue", "green", "gray", "grey", "brown", "tan", "pink", "purple", "mahogany"))
colors %>%
  with(wordcloud(word, n, max.words = 100, min.freq = 1, colors = color_list, ordered.colors = TRUE, random.order = FALSE))

While writing, I imagined this novel having an emotional color palette of dark browns, reds, and oranges. This was partially, but not entirely, reflected in the color wordcloud. Sometimes I had to describe specific objects–such as green leaves or black ribbon–that didn’t match this color palette. This word cloud also doesn’t include words associated with color, such as “wood” or “sunset,” which imply certain colors without directly stating them.

Sentiment Analysis

The sentiments for this section came from the “bing” lexicon, which categorizes words into a positive or negative binary. As it looks at individual words, it doesn’t take context into consideration.

Most Frequently Used Positive Terms

top_positive_words <- 
  nov_word_clean %>%
  inner_join(get_sentiments("bing")) %>%
  filter(sentiment == "positive") %>%
  count(word, sort = TRUE)

graph_positive <- ggplot(data = top_positive_words[1:15, 1:2], aes(x = reorder(word, n), y = n)) + geom_bar(stat = "identity", fill = "lightskyblue") 
graph_positive + labs(title = "Most Frequent Positive Words", x = ("Word"), y = ("Frequency")) + ylim(NA, 50)+ coord_flip()

“Love” and “beautiful,” while marked positive by the bing lexicon, aren’t always used positively in the novel. Additionally, I’m not sure why “sharp” was marked as a positive word.

Most Frequently Used Negative Terms

top_negative_words <-   
  nov_word_clean %>%
  inner_join(get_sentiments("bing")) %>%
  filter(sentiment == "negative") %>%
  count(word, sort = TRUE)

graph_negative <- ggplot(data = top_negative_words[1:15, 1:2], aes(x = reorder(word, n), y = n)) + geom_bar(stat = "identity", fill = "yellowgreen") 
graph_negative + labs(title = "Most Frequent Negative Words", x = ("Word"), y = ("Frequency")) + ylim(NA, 50) + coord_flip()

The bing lexicon doesn’t mark “gun” as a negative word–as it is referenced negatively in the novel–while the nrc lexicon does. Missing certain words is one of the limitations of choosing a lexicon.

Sentiment Change Over Time

nov_sentiment <- nov_word_clean %>%
  inner_join(get_sentiments("bing")) %>%
  count(index = line %/% 100, sentiment) %>%
  ungroup %>%
  spread(sentiment, n, fill = 0) %>%
  mutate(sentiment = positive - negative)

(graph_sentiment <- ggplot(nov_sentiment, aes(x = index, y = sentiment)) + geom_col(fill = "thistle4") + 
    labs(title = "Sentiment Change Over Time", x = ("Index"), y = ("Sentiment")))

This graph shows that the tone of the novel is overwhelmingly negative. The most commonly used positive word, “smile,” is used more frequently than the most commonly used negative word, “cold.” However, this novel uses overall more negative words than positive words.

The dip around index 12 and 13 marks roughly when one of the major characters in the novel dies, which makes sense. However, the less negative section at index 4 marks a depressing memory the narrator recollects, so it should be more negative.

Reflection

Issues I ran into

Cleaning/formatting the data (there was an issue with encoding, which took a long time to realize)
Changing between vectors, data frames, etc.
Downloading packages (I didn’t have Java on my computer, so I couldn’t run the qdap package, which limited some of the types of analysis I could do)

Issues with this analysis

The text mining strategy I used, which looked at individual words, negated context
I wasn’t able to stem words (so, count words like “Clementine” and “Clementine’s” as the same) without the qdap package

Goals for future text mining

Working with text mining strategies with tm and qdap, which also requires formatting the novel as a corpus
Looking at words in context and making graphs of word associations
Tracking changes in sentiment or word usage over the course of the novel, looking at specific chapters rather than evenly-split sections

Thank you for reading my ISP Report! I enjoyed getting the chance to look at my novel draft in a new way. I hope you found it interesting as well.