A Replicable Method for Gathering, Analysing and Visualising Podcast Episode and Review Data - Part 2
Rate and Review - Part 2
This post is the second in a series of four that present a workflow for gathering, analysing and visualising data about podcasts. The work presented here builds on Part 1. It is recommended that you complete the work in Part 1 before attempting the work in this post.
Please feel free to drop me a line if you have any questions about it, or if you would like to discuss ways we can work together.
1 Load Packages and Data
In this section there are some basic data manipulation and organisational tasks to complete, followed by an iterative process of cleaning text. To perform these tasks, we require the following R packages.
library(stringr)
library(dplyr)
library(tm)
library(wordcloud)
library(topicmodels)
library(ggplot2)
library(magrittr)
library(matrixStats)
library(tidytext)
library(tidyverse)
We also need to load in the data generated in part one, related to episodes and reviews.
reviews <- readRDS("reviews.rds")
episode_data <- readRDS('episode_data.rds')
library(stringr)
library(dplyr)
library(tm)
library(wordcloud)
library(topicmodels)
library(ggplot2)
library(magrittr)
library(matrixStats)
library(tidytext)
In the reviews we have two variables of text from reviewers that may proove useful. Alongside the main body of the review (held on the Review variable), we also have information in the review Title variable. The reason for this separation is the Apple interface, which asks reviewers to complete both sections.
To create a body of text to analyse, we can paste these variables together into a new variable, review_text.
reviews$review_text <- paste(reviews$Title, reviews$Review, sep = " : ")
The proccess of text analysis (see below) may remove some rows of data - for example, if a review contains only one or two words, or else is made up entirely of emojis. Because we will need to match the results of NLP analysis to the original reviews database, it is necessary to give each row a unique ID.
reviews$id <- 1:nrow(reviews)
reviews <- reviews[ , c(14, 1:13)]
2: Create Document Term Matrix.
In order to perform computational text analysis, a document term matrix needs to be created. This is essentially a large dataframe that splits the text in a given corpus (our 16811 reviews) and counts the number of times each individual word occurs. Imagine a dataframe where every row is a review, and the number of columns is determined by how many unique words there are across the entire set of reviews.
To begin the process of creating this, we must first clean the text of any odd characters, stem words to their routes (so - for example - words like listen, listening, listened, listener are only counted once), convert all text to lower case (likewise, so that Listen and listen do not get counted separately), and remove extraneous elements like space between paragraphs.
story_stem <- str_replace_all(reviews$review_text, "@", "")
story_stem <- str_replace_all(story_stem, "@\\w+", "")
story_stem <- stemDocument(story_stem)
story_stem <- removePunctuation(story_stem)
story_stem <- tolower(story_stem)
story_stem <- stripWhitespace(story_stem)
story_stem <- gsub("’", '', story_stem)
story_stem <- gsub("\U0001f525\U0001f525\U0001f525 \U0001f525\U0001f525\U0001f525", '', story_stem)
We then add this new story_stem as a new variable to the reviews database.
reviews$story_stem <- story_stem
Next, we can create a list of ‘stop words’ - these are commonly occuring words such as the, it, in, and so on, that will appear regularly in the text (thereby skewing analysis) but that which won’t - in isolation - reveal too much about the content and context of the corpus.
NB: You can add/remove words from this list and re-run this process at any time.
extendedstopwords<-c("a","amp","it'","it", "i", "i'v", "i'm","",
"\U0001f525\U0001f525\U0001f525 \U0001f525\U0001f525\U0001f525",
"httpstcokabfwymm","podcast","httpst…","https…","via","httpstcop7dy6nbt3u",
"httpstcolpyttfjf","httpstcoltblrkmwhj","about","above","across","after",
"again","against","all","almost","alone","along","already","also",
"although","always","am","among","an","and","another","any","anybody",
"anyone","anything","anywhere","are","area","areas","aren't","around",
"as","ask","asked","asking","asks","at","away","b","back","backed",
"backing","backs","be","became","because","become","becomes","been",
"before","began","behind","being","beings","below","best","better",
"between","big","both","but","by","c","came","can","cannot","can't",
"case","cases","certain","certainly","clear","clearly","come","could",
"couldn't","d","did","didn't","differ","different","differently","do",
"does","doesn't","doing","done","don't","down","downed","downing",
"downs","during","e","each","early","either","end","ended","ending",
"ends","enough","even","evenly","ever","every","everybody","everyone",
"everything","everywhere","f","face","faces","fact","facts","far",
"felt","few","find","finds","first","for","four","from","full","fully",
"further","furthered","furthering","furthers","g","gave","general",
"generally","get","gets","give","given","gives","go","going","good",
"goods","got","great","greater","greatest","group","grouped","grouping",
"groups","h","had","hadn't","has","hasn't","have","haven't","having",
"he","he'd","he'll","her","here","here's","hers","herself","he's",
"high","higher","highest","him","himself","his","how","however",
"how's","i","i'd","if","i'll","i'm","important","in","interest",
"interested","interesting","interests","into","is","isn't",
"it","its","it's","itself","i've","j","just","k","keep",
"keeps","kind","knew","know","known","knows","l","large",
"largely","last","later","latest","least","less","let",
"lets","let's","like","likely","long","longer","longest","m",
"made","make","making","man","many","may","me","member",
"members","men","might","more","most","mostly","mr","mrs","much",
"must","mustn't","my","myself","n","necessary","need","needed",
"needing","needs","never","new","newer","newest","next","no",
"nobody","non","noone","nor","not","nothing","now","nowhere",
"number","numbers","o","of","off","often","old","older",
"oldest","on","once","one","only","open","opened","opening",
"opens","or","order","ordered","ordering","orders","other",
"others","ought","our","ours","ourselves","out","over","own",
"p","part","parted","parting","parts","per","perhaps","place",
"places","point","pointed","pointing","points","possible",
"present","presented","presenting","presents","problem",
"problems","put","puts","q","quite","r","rather","really",
"right","room","rooms","s","said","same","saw","say","says",
"second","seconds","see","seem","seemed","seeming","seems",
"sees","several","shall","shan't","she","she'd","she'll",
"she's","should","shouldn't","show","showed","showing",
"shows","side","sides","since","small","smaller",
"smallest","so","some","somebody","someone","something",
"somewhere","state","states","still","such","sure","t",
"take","taken","than","that","that's","the","their",
"theirs","them","themselves","then","there","therefore",
"there's","these","they","they'd","they'll","they're",
"they've","thing","things","think","thinks","this","those",
"though","thought","thoughts","three","through","thus",
"to","today","together","too","took","toward","turn",
"turned","turning","turns","two","u","under","until",
"up","upon","us","use","used","uses","v","very","w",
"want","wanted","wanting","wants","was","wasn't","way",
"ways","we","we'd","well","we'll","wells","went","were",
"we're","weren't","we've","what","what's","when","when's",
"where","where's","whether","which","while","who","whole",
"whom","who's","whose","why","why's","will","with","within",
"without","won't","work","worked","working","works","would",
"wouldn't","x","y","year","years","yes","yet","you","you'd",
"you'll","young","younger","youngest","your","you're",
"yours","yourself","yourselves","you've","z")
extendedstopwords <- c(extendedstopwords, gsub("'","",grep("'",extendedstopwords,value=T)) )
Having cleaned the text and specified words to be removed as part of stopwords, we can now create the DTM.
dtm.control <- list(
tolower = T,
removePunctuation = T,
removeNumbers = T,
stopwords = c(stopwords("english"),extendedstopwords),
stemming = T,
wordLengths = c(3,Inf),
weighting = weightTf
)
dtm <- DocumentTermMatrix(Corpus(VectorSource(story_stem)),
control = dtm.control)
dim(dtm)
## [1] 16811 23789
We can see from above that we have 16811 documents in our corpus (the same number of reviews we collected) and a total of 23789 terms.
The mojority of these terms will appear very infrequently, and will not reveal too much about the corpus as a whole. We can therefore remove these sparse terms, reducing the number of terms across the 16811 documents to 1411.
dtm <- removeSparseTerms(dtm,0.999)
dim(dtm) #16811 documents, 1411 terms
## [1] 16811 1411
We can now explore which words occur across our corpus
matrix <- as.matrix(dtm)
freq <- colSums(as.matrix(dtm))
The visualisation below shows all the words that occur more than 500 times across the entire corpus. We can see that words such as love, music, episode, etc, appear very regularly - which is what we might expect from listener reviews of podcasts.
wordfrequency <- data.frame(term=names(freq),occurrences=freq)
wordfrequency %>%
filter(occurrences > 500) %>%
mutate(term = reorder(term, occurrences)) %>%
ggplot(aes(x = term, y = occurrences)) +
geom_col(fill = "blue") +
coord_flip() +
theme_minimal()
Visualising word frequencies in this way is useful for two reasons:
- It enables you to begin gaining a broad understanding of the corpus
- You can iterate through various frequency counts in order to weed out elements of text not removed by the text cleaning above - again, this means things such as emojis, or strange characters. The stopwords and text cleaning provided above were sufficient to clean out the corpus collected for this research, but you may need to check your own corpus. If you find characters that you want to remove, simply add them to the stopwords vector above, recreate the DTM and frequency counts, and check your visualisation again.
For the corpus used in this research, the process of cleaning out odd characters took around 30 minutes and followed thus:
- Create DTM and frequency counts
- Create visualisations where counts are greater than 500, greater than 400, and so on - making your way through the frequency counts iteratively.
- Each time you find an odd/extraneous character, add it to the stopwords vector - repeat the process.
- Eventually, you will work your way down towards the bottom of the list and will have weeded out everything you do not want.
- Be aware that the further you go down the frequency counts, the more words will appear - so your increments will get ever smaller. See the visualisation below, which checks for words that appear more than 30 times, but fewer than 32 times. This can be time-consuming, but it will produce more useful results in the next section.
wordfrequency <- data.frame(term=names(freq),occurrences=freq)
wordfrequency %>%
filter(occurrences >= 30 & occurrences <= 31) %>%
mutate(term = reorder(term, occurrences)) %>%
ggplot(aes(x = term, y = occurrences)) +
geom_col(fill = "blue") +
coord_flip() +
theme_minimal()
Once you are satisfied that your DTM is free from extraneous elements, proceed to the next section.
3: Topic Modelling
For those unfamiliar with Topic Modelling, a concise overview is provided by Wikipedia:
In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: “dog” and “bone” will appear more often in documents about dogs, “cat” and “meow” will appear in documents about cats, and “the” and “is” will appear approximately equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words. The “topics” produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document’s balance of topics is.
To begin the process, we need to set parameters some parameters.
burnin <- 4000
iter <- 2000
thin <- 500
seed <-list(2003,5,63,100001,765)
nstart <- 5
best <- TRUE
A limitation with Topic Modelling is that you need to specify the number of topics you want to produce before beginning the process. Depending on the size of the corpus, the process can take some time and - if the results are not particularly insightful - you will need to run the process again with a new number of pre-specfied topics.
The code below will run the process for you 8 times, producing results for between 3 topics and 10 topics. You can alter the number of results by amended the range below from 3:10 to something else.
k <- 3:10 ##change to 5:8 if you only want results for 5, 6, 7 and 8 topics
nums <- as.data.frame(k)
Once you have specfied a range of topic numbers, run the code below. As with other parts of the script above, this will provide you with progress updates in your console.
for (i in 1:nrow(nums)) {
start <- Sys.time()
k <- nums$k[[i]]
print(paste("Now working through", k, "Topics", sep = " "))
ldaOut_name <- paste("ldaOut_", nums$k[[i]], sep = "")
rowTotals <- apply(dtm , 1, sum) #Find the sum of words in each Document
empty.rows <- dtm[rowTotals == 0, ]$dimnames[1][[1]] ##these two lines remove
corpus <- story_stem[-as.numeric(empty.rows)]
dtm <- dtm[rowTotals> 0, ]
dim(dtm)
thing <- assign(ldaOut_name, LDA(dtm,k, method="Gibbs", control=list(nstart=nstart, seed = seed, best=best, burnin = burnin, iter = iter, thin=thin)))
print(paste("Now writing files for", k, "Topics and Terms", sep = " "))
topics_name <- paste(ldaOut_name, "_topics", sep = "")
thing2 <- assign(topics_name, as.matrix(topics(thing)))
topics_file_name <- paste("LDAGibbs", k, "Docs_to_topics.csv", sep ="")
write.csv(thing2, topics_file_name)
terms_name <- paste(ldaOut_name, "_terms", sep = "")
thing3 <- assign(terms_name, as.matrix(terms(thing, 50)))
terms_file_name <- paste("LDAGibbs", k, "TopicsToTerms.csv", sep="")
write.csv(thing3, terms_file_name)
probability_name <- paste("TopicProbabilities", k, sep="")
thing4 <- assign(probability_name, as.data.frame(thing@gamma))
probabilty_file_name <- paste(probability_name, ".csv", sep = "")
write.csv(thing4, probabilty_file_name)
end <- Sys.time()
duration <- end - start
print(paste("Processing for", k, "Topics completed in", duration))
}
Once the process completes, you will find three CSV files for each topic in your working directory:
- Documents to Topics - for each document in the corpus, the topic number (1:k) it is most closely aligned with
- Topic Probabilities - the extent to which a document in the corpus aligns with each topc
- Topics to Terms - the top 50 words associated with each topic.
It is this final document that is most useful to you at this stage. By looking at the results for each iteration of K, you can select the topic allocation that appears to make the most sense to you. Here, the results from our research process are shown for 7 Topics with the top 10 words in
## X Topic.1 Topic.2 Topic.3 Topic.4 Topic.5 Topic.6 Topic.7
## 1 1 episod tiesto music interview listen love dolli
## 2 2 everi tranc song guest eric listen fan
## 3 3 week awesom artist veri realli amaz time
## 4 4 sound plea hear host enjoy thank album
## 5 5 guy life favorit time kpop alway mani
## 6 6 wait excel insight alec fun look stori
## 7 7 download mix recommend talk feel day peopl
## 8 8 world hous heard bring nam absolut job
## 9 9 brilliant mejor ani question funni podcast beauti
## 10 10 track top inspir stop becaus forward tell
In the case of the podcast reviews looked at in this research, the team identified 7 Topics as being the best fit. Our notes from our initial analysis of these 7 topics were as follows:
- 1 ROUTINE OF LISTENING / CONNECTION – how people describe connecting with a particular podcast. Listening regularly (every, week, wait, episode, (don’t) miss), becoming a fan of the podcast (subscribe, download), and why (brilliant, quality, star, perfect)
- 2 RELIVING EXPERIENCES / UTILITY (SERVIING A PURPOSE AT A GIVEN TIME) MEANING/CONCEPT (PART 1) – Seems skewed towards dance music with references to club culture (trance, mix, house, club, tiesto, (DJ) set), but there are also lots of superlatives here: wow, awesome, top, super, excellent)
- 3 LEARNING / KNOWLEDGE / CONCEPT/MEANING (Part 2) - Seems to relate primarily to interview format podcasts (music, song, artist) and about knowlede or insight people get from it (discover, insight, inspiring, creative)
- 4 FORMAT – Seems to relate to the conventions of interviews. (interview, guest, host, talk, question, conversation), and how these are useful (inform, entertain, knowledge, engaging)
- 5 PLEASURE / SATISFACTION – Why people are engaged by podcasts; general descriptions of enjoyment (enjoy, fun, funny, happy, cool, nice)
- 6: CELEBRITIES / PERSONALITY – Emotional and other responses to listening to podcasts (love, amaze, thank, favourite, mood, energy, joy, crazy, laugh)
- 7: RELATIONSHIP / FANDOM – Mentions of fandom, names of artists here (dolly, cole), mentions of people, person, life, history, mind.
Clearly, this is just the reading of the research team of this particular set of documents (your own interpretation my differ!) but the themes nevertheless appear to be relatively distinct. These results will be tested and explored in more detail in Part 3.
Once we had settled on 7 topics being the correct fit (or the number you have arrived at in your own work), we needed to combine the results with the original dataframe of reviews. The script below performs that task.
IMPORTANT: For your own work, amend the numbers and file names in the script below from 7 to the number of topics you have chosen.
topics7 <- read.csv("LDAGibbs7Docs_to_topics.csv")
topic_probs7 <- read.csv("TopicProbabilities7.csv")
topics7$topics <- topics7$V1
topicsLDA7 <- topics7$topics
LDA7story_nums <- topics7$X
LDAmatch <- reviews %>%
filter(id %in% LDA7story_nums)
LDAmatch$LDA7story_nums <- LDA7story_nums
LDAmatch$topicsLDA7 <- topicsLDA7
LDAmatch <- cbind(LDAmatch, topic_probs7)
LDAmatch$X <- NULL
names(LDAmatch) ### look at the columns that topic allocations appear in. For seven topics, they appear in columns 18:24. For six topics, this will be 18:23 - and so on. Amend the following lines accordingly.
LDAmatch$meanLDA <- rowMeans(LDAmatch[, 18:24])
LDAmatch$sdLDA <- rowSds(as.matrix(LDAmatch[, 18:24]))
reviews_processed <- LDAmatch
You may recall earlier that the process of cleaning text and/or Topic modelling may have removeed some reviews because they were either sparely populared and/or contained only emojis. We can see from below that the original dataset of 16811 reviews has been reduced to 16033. Nevertheleess, all 46 podcasts in the original review dataframe remain present.
nrow(reviews)
## [1] 16811
nrow(reviews_processed)
## [1] 16033
length(unique(reviews_processed$pod_id))
## [1] 46
4: Sentiment Analysis
A final piece of NLP processing we can perform on the reviews gathered relates to Sentiment Analysis. This is a process whereby documents are scored positively or negatively based on the appearance of certain words. For example, a document containing the word ‘love’ would attract a positive score. A document containing the word ‘hate’ would receive a negative one.
Sentiment Analysis can often be a fairly blunt instrument because any document containing the words love and hate would attract a neutral score (love cancelling out hate, and vice versa), and the libraries involved are often poor at nuance. If someone describes a song as being ‘on fire’, it is likely to attact a negative score!
Nevertheless, because we have 1-5 star ratings associated with text reviews, it may be useful at a later stage to compare sentiment scores alongside star ratings, podcasts, and topic allocations.
To perform sentiment analysis, you will need the following packages.
library(syuzhet)
library(scales)
library(reshape2)
Another limitation with SA is that it will score each word within a document. As such, a document with 100 words will achieve a higher (or lower) score than one with 50 words. By introducing a variable that counts words within documents, we can address this issue when look at results.
reviews_processed$charsinreview <- sapply(reviews_processed$review_text, function(x) nchar(x))
reviews_processed$wordsinreview <- sapply(strsplit(reviews_processed$review_text, "\\s+"), length)
Firstly, we can run the get_nrc_sentiment function. This produces scores for each document in the reviews dataframe according to anger, anticipation, disgust, fear, joy, sadness, surprise and trust. OVerall negative and positive scores are also produced.
The code below runs this task and adds the results to the reviews_processed dataframe.
mySentiment <- get_nrc_sentiment(reviews_processed$review_text)
head(mySentiment, 5)
reviews_processed <- cbind(reviews_processed, mySentiment)
head(reviews_processed[ ,26:35])
## anger anticipation disgust fear joy sadness surprise trust negative positive
## 1 0 0 0 0 1 0 0 1 0 1
## 2 0 1 0 0 0 0 0 1 0 0
## 3 0 3 0 0 3 1 0 2 1 4
## 4 1 0 1 0 2 2 0 0 2 3
## 5 0 2 0 0 2 0 2 2 0 4
## 6 0 4 0 0 3 1 2 4 0 4
There are other SA models available in the syuzhet package. The code below will run these and add the results to the dataframe.
syuzhet_sent <- as.data.frame(get_sentiment(reviews_processed$review_text, method = "syuzhet"))
syuzhet_sent <- syuzhet_sent %>%
rename(syuzhet_sent = `get_sentiment(reviews_processed$review_text, method = "syuzhet")`)
reviews_processed <- cbind(reviews_processed, syuzhet_sent)
bing_sent <- as.data.frame(get_sentiment(reviews_processed$review_text, method = "bing"))
bing_sent <- bing_sent %>%
rename(bing_sent = `get_sentiment(reviews_processed$review_text, method = "bing")`)
reviews_processed <- cbind(reviews_processed, bing_sent)
afinn_sent <- as.data.frame(get_sentiment(reviews_processed$review_text, method = "afinn"))
afinn_sent <- afinn_sent %>%
rename(afinn_sent = `get_sentiment(reviews_processed$review_text, method = "afinn")`)
reviews_processed <- cbind(reviews_processed, afinn_sent)
nrc_sent <- as.data.frame(get_sentiment(reviews_processed$review_text, method = "nrc"))
nrc_sent <- nrc_sent %>%
rename(nrc_sent = `get_sentiment(reviews_processed$review_text, method = "nrc")`)
reviews_processed <- cbind(reviews_processed, nrc_sent)
This code produces an average sentiment score across the models.
reviews_processed$sent_ave <- rowMeans(reviews_processed[, 39:42])
Finally, this code calculates average sentiment scores in relation to the word count of each review.
reviews_processed <- mutate(reviews_processed, sent_by_word = sent_ave/wordsinreview)
The final reviews_processed dataframe contains 44 variables:
- id - a unique id for each review
- podcast - the name of the podcast
- pod_id - the Apple ID
- review_page - the page the review came from (1:10)
- Title - the title of the review
- Author_URL - the review author’s profile URL
- Author_Name - the review authors screenname
- App_Version - this information is blank
- Rating - the 1-5 Star Rating
- Review - the review text
- date - the date the review was posted
- country - the national store the review was posted in
- review_id - sequential review id per podcast
- review_text - the Title and Review variables combined
- story_stem - the review_text variable after pre-NLP processing
- LDA7story_nums - the ids of reviews processed by Topic Modelling
- topicsLDA7 - the LDA Topic Modelling topics assigned to each review
- V1 - the extent to which review belongs to Topic 1
- V2 - the extent to which review belongs to Topic 2
- V3 - the extent to which review belongs to Topic 3
- V4 - the extent to which review belongs to Topic 4
- V5 - the extent to which review belongs to Topic 5
- V6 - the extent to which review belongs to Topic 6
- V7 - the extent to which review belongs to Topic 7
- meanLDA - the mean score across V1:V7
- sdLDA - the Standard Deviation across V1:V7
- charsinreview - the number of characters in a review
- wordsinreview - the number of words in a review
- anger - specific sentiment score for the review
- anticipation - specific sentiment score for the review
- disgust - specific sentiment score for the review
- fear - specific sentiment score for the review
- joy - specific sentiment score for the review
- sadness - specific sentiment score for the review
- surprise - specific sentiment score for the review
- trust - specific sentiment score for the review
- negative - total negative score for the review
- positive - total positive score for the review
- syuzhet_sent - the results of the Syuzhet SA algorithm
- bing_sent - the results of the Bing SA algorithm
- afinn_sent - the results of the Afinn SA algorithm
- nrc_sent - the results of the NRC SA algorithm
- sent_ave - the average of syuzhet_sent:nrc_sent
- sent_by_word - sent_score_ave divided by wordsinreview
5: Write out data
This dataframe can now be written out before proceeding to Part 3 - exploration and analysis of results.
write_rds(reviews, "reviews_processed.rds")