Analysing online review data – Part 2
Previously, we developed a module to take care of getting the review data from Tripadvisor or Yelp in a DataFrame format. Now we want to do some analysis on this data. In this part of the series, we will do some topic modeling using Latent Dirichlet Allocation (LDA) and create a word cloud.
Steps
- Imports
- Silence Deprecation Warnings
- Get The Review Data
- Get A List Of English Stop Words
- Clean And Generate List Representations Of The Reviews
- Look for Bigrams and Trigrams
- Prepare For The LDA Model
- Apply And Visualise The LDA Model
- Create A Word Cloud Of The LDA Model
- Putting It All Together
- Supplementary Material
0-Requirements
import platform
print('Python version: {}'.format(platform.python_version()))
Python version: 3.6.4
1-Imports
First we will import the required libraries that will be useful during the analysis. We will also import the module developed in Part 1. The module can also be found here. Simply take the WebScraper.py file and import it into a project as
import WebScraper
We import the rest of the libraries we need
import nltk # For getting stopwords
nltk.download('stopwords') # Only needs to be run once on the machine
import numpy as np # Not required but may be useful
import pandas as pd # For DataFrames
import gensim # For LDA and finding Bigrams and Trigrams
import WebScraper # The module from Part 1
from wordcloud import WordCloud # For generating a word cloud
import matplotlib.pyplot as plt # General plotting
import pyLDAvis # For visualising the Topics
import pyLDAvis.gensim # For visualising the Topics
import warnings # So that we can override the deprecation warning
2-Silence Deprecation Warnings
It turns out that the pyLDAvis gives a deprecation warning which is repeated probably because of a loop within the library. It might not be visually pleasing to have hundreds of deprecation warnings displayed. This is what a deprecation warning looks like
warnings.warn('hey',category = DeprecationWarning)
We can specify to ignore this category of warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)
warnings.warn('hey',category=UserWarning)
warnings.warn('hey',category=DeprecationWarning)
3-Get The Review Data
This is where we use the module created in Part 1. Since we have imported the module into the project, we can create the WebScraper object and gather the data in only a few lines
# Define the urls to the site of interest
url1 = "https://www.tripadvisor.co.uk/Attraction_Review-g190384-d6755801-Reviews"
url2 = "-The_House_of_Dionysus-Paphos_Paphos_District.html"
# Create the WebScraper object
ms = WebScraper.WebScraper(site='tripadvisor',url1=url1,
url2=url2,increment_string1="-or",increment_string2="",
total_pages=20,increment=10,silent=False)
# Get the review data from all the pages
ms.fullscraper()
# Store the review data
review_data = ms.all_reviews
We can now view the review information in DataFrame form
review_data.head()
4-Get A List Of English Stop Words
Stop words are defined as a bunch of useless words giving little to no information about a piece of text in relation to the investigation we are carrying out. These are usually very common words such as ‘a’, ‘the’, ‘and’ and so on in the English language. The NLTK library imported, already has a list of stop words for the English language for us to use. This makes analysing text in our ‘fullreview’ column much cleaner and easier. Let’s obtain a list of stop words and display the first 10
stopwords = nltk.corpus.stopwords.words('english')
stopwords[0:10]
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
5-Clean And Generate List Representations Of The Reviews
We will use the gensim library to clean the ‘fullreview’ column of our DataFrame by removing punctuations. A quick example shows how this can be done using the gensim.utils.simple_preprocess method to clean and represent a review in the form of a list
gensim.utils.simple_preprocess("Where's my dog?",deacc = True)
['where', 'my', 'dog']
The deacc = True
option implements the gensim.utils.deaccent
method as well in order to remove accent characters. Here’s an example of the deaccent
method taken from the documentation
gensim.utils.deaccent("Šéf chomutovských komunistů dostal poštou bílý prášek")
'Sef chomutovskych komunistu dostal postou bily prasek'
Let’s see how this works on a particular review from our DataFrame. The review we want to clean is
review_data.iloc[0,0]
'This is awesome and is everything books etc say about the place. Allow around 2 hours to get around everything and the park. Take plenty of water and wear a hat!'
After cleaning this review, we have
gensim.utils.simple_preprocess(review_data.iloc[0,0],deacc = True)
['this', 'is', 'awesome', 'and', 'is', 'everything', 'books', 'etc', 'say', 'about', 'the', 'place', 'allow', 'around', 'hours', 'to', 'get', 'around', 'everything', 'and', 'the', 'park', 'take', 'plenty', 'of', 'water', 'and', 'wear', 'hat']
We can then remove stopwords from this review/document (notice the removal of ‘this’, ‘is’, ‘and’ and similar words.)
[word for word in gensim.utils.simple_preprocess(review_data.iloc[0,0],deacc = True) if word not in stopwords]
['awesome', 'everything', 'books', 'etc', 'say', 'place', 'allow', 'around', 'hours', 'get', 'around', 'everything', 'park', 'take', 'plenty', 'water', 'wear', 'hat']
Let’s create a function to package this up for us
def cleanDocument(x, stopwords):
return [word for word in gensim.utils.simple_preprocess(x,deacc = True) if word not in stopwords]
We can then create a new column for the list representation of the document for each document in our DataFrame. Let’s call the column ‘List’
review_data['List'] = review_data['fullreview'].apply(lambda x: cleanDocument(x,stopwords))
review_data.head()
Finally, we have a column which is a list representation of each review with punctuation, accents and stop words removed.
6-Look for Bigrams and Trigrams
Bigrams are pairs of words that often occur together. Similarly, Trigrams a 3 words frequently occurring together. This can be extended to a larger number of words occurring together (n-grams). Some examples include ‘New York’, ‘Text Analysis’, ‘European Union’, ‘bear in mind’ and so on. We can catch these n-grams in a particular text using gensim.models.Phrases
# Create bigrams
bigrams = gensim.models.Phrases(review_data['List'], min_count=3, threshold=50)
bigrams_Phrases= gensim.models.phrases.Phraser(bigrams)
# Create trigrams
trigrams = gensim.models.Phrases(bigrams_Phrases[list(review_data['List'])], min_count=3, threshold=50)
trigram_Phrases = gensim.models.phrases.Phraser(trigrams)
The min_count
argument specifies the minimum number of times this bigram (trigram) should appear before it is accepted as a bigram (trigram). The threshold
determines how difficult it is to be classified as a bigram. To give a clearer example, suppose we have the following list of reviews
a = []
a.append('new york is amazing')
a.append("Yeah I know, it's all about new york")
a.append("What about the tower in new york?")
a.append("new york is the place to be apparently")
a.append("Some more words and new york some more words")
a.append("I loved the show")
a.append("specially in new york")
We first apply the clean function defined above
a_list = list(map(lambda x: cleanDocument(x,stopwords),a))
We then find the bigrams
bigram = gensim.models.Phrases(a_list,min_count=1, threshold=1)
bigram_phraser = gensim.models.phrases.Phraser(bigram)
Now we can use bigram_phraser
to find bigrams in a particular text
# Clean the review
aReview = cleanDocument('Is new york the best place or what?',stopwords)
# Apply bigrams
print(bigram_phraser[aReview])
['new_york', 'best', 'place']
We successfully identified new york as a bigram. We can write a function to do this work for us
def createGrams(ls):
"""
This function expects a list (or series) of lists of words each being a list representation of a document.
It returns a list of bigrams and a list of Trigrams relevant to the list given.
"""
# Create bigrams
bigrams = gensim.models.Phrases(ls, min_count=3, threshold=50)
bigrams_Phrases= gensim.models.phrases.Phraser(bigrams)
# Create trigrams
trigrams = gensim.models.Phrases(bigrams_Phrases[list(ls)], min_count=3, threshold=50)
trigram_Phrases = gensim.models.phrases.Phraser(trigrams)
return [bigram_phraser[i] for i in list(ls)],[trigram_Phrases[i] for i in list(ls)]
We can then simply pass the ‘List’ column of our DataFrame to this function
createGrams(review_data['List'])
([['awesome', 'everything', 'books', 'etc', 'say', 'place', 'allow', 'around', 'hours', 'get', 'around', 'everything', 'park', 'take', 'plenty', 'water', 'wear', 'hat', 'spectacular'], ['increadible', 'mosiacs', 'large', 'site', ... ... , 'people', 'would', 'lived', 'centuries', 'ago', 'much', 'see', 'mosaics', 'particular', 'interest', 'whole', 'experience', 'enhanced', 'spring', 'flowers', 'stepping', 'back', 'time']])
While we’re at it, we might as well combine the createGrams function with the cleanDocument function into another function
def cleanAndCreateGrams(ls,stopwords):
return(createGrams(ls.apply(lambda x: cleanDocument(x,stopwords)))[0])
and create a new column with this applied to it
review_data['GramList'] = cleanAndCreateGrams(review_data['fullreview'],stopwords)
review_data.head()
7-Prepare For The LDA Model
Prior to this section, the preparation of the text was related to how we can clean the documents and transform them into list representations. The LDA model we will be using in the next section as part of the gensim package expects a corpus list and an id2word dictionary. To create the id2word dictionary, we use the gensim.corpora.Dictionary
method which takes a list of documents in list representation (our ‘GramList’ column we created above) and returns a dictionary where each word gets a number as a unique key and a value corresponding to how many times it appears in a particular document. To create a corpus where a word has the same id over all the documents we can use the id2word.doc2bow which does what we’re looking for
# Create Dictionary
id2word = gensim.corpora.Dictionary(review_data['GramList'])
# Create Corpus
texts = review_data['GramList']
# Term Frequency in Document
corpus = [id2word.doc2bow(text) for text in texts]
id2word is a dictionary and bow stands for Bag Of Words. We can get the frequency of each word in id2word by using its index representation
print(f"The frequency of '{id2word[1]}' is {corpus[0][1][1]}")
The frequency of 'around' is 2
and here is an entire document in id, frequency representation
texts[100]
['great', 'place', 'walk', 'around', 'see', 'fantastical', 'well', 'preserved', 'floor', 'mosaics', 'little', 'shade', 'avoid', 'midday', 'hot', 'small', 'vending', 'machine', 'area', 'drinks', 'hours', 'saw', 'area', 'fantastic', 'mosaics']
id2word.doc2bow(texts[100])
[(1, 1), (8, 1), (10, 1), (25, 1), (32, 1), (47, 2), (175, 1), (184, 1), (203, 1), (233, 1), (258, 2), (277, 1), (299, 1), (493, 1), (532, 1), (568, 1), (627, 1), (693, 1), (694, 1), (695, 1), (696, 1), (697, 1), (698, 1)]
We can transform the ids back into the original words
# from id to word
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:2]]
[[('allow', 1), ('around', 2), ('awesome', 1), ('books', 1), ('etc', 1), ('everything', 2), ('get', 1), ('hat', 1), ('hours', 1), ('park', 1), ('place', 1), ('plenty', 1), ('say', 1), ('spectacular', 1), ('take', 1), ('water', 1), ('wear', 1)], [('along', 1), ('also', 1), ('anyone', 1), ('bargain', 1), ('beach', 1), ('coral', 1), ('cost', 1), ('euro', 1), ('great', 1), ('increadible', 1), ('large', 1), ('mosiacs', 1), ('must', 1), ('paphos', 2), ('reccoment', 1), ('see', 1), ('site', 1), ('towards', 1), ('views', 1), ('visiting', 1), ('would', 1)]]
The above shows 2 reviews/documents.
8-Apply The LDA Model
LDA assumes that each document is composed of a collection of topics with varying probabilities and that each topic is a collection of words with varying probabilities. Now that we have the requirements for running the LDA model (the id2word dictionary and the corpus), let’s go ahead and apply the gensim.models.ldamodel.LdaModel
method
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=4,
chunksize=100,
passes=10,
alpha='auto',
per_word_topics=True)
Strictly speaking, the corpus used above is the training data the LDA model will use to estimate the parameters of the Dirichlet distribution inherent in the model. Here, we have specified the number of topics to be 4. This is the number of topics the model will be looking to extract. The per_word_topics
specifies that we want a list of the most likely topics for each word.
Due to the way LDA is implemented using prior distributions (in this case it is possible to specify the hyperparameters alpha and eta), the model can be updated with further training data using the lda_model.update
method. We can print the key words in the 4 topics to see how much weighting each word contributes to a topic
# Print the Keyword in the 4 topics
print(lda_model.print_topics())
[(0, '0.008*"wonderful" + 0.007*"real" + 0.007*"site" + 0.007*"especially" + 0.007*"restored" + 0.007*"visit" + 0.006*"around" + 0.006*"visited" + 0.006*"work" + 0.006*"first"'), (1, '0.027*"mosaics" + 0.022*"site" + 0.014*"ruins" + 0.013*"see" + 0.013*"well" + 0.013*"roman" + 0.012*"interesting" + 0.011*"park" + 0.010*"buildings" + 0.010*"visit"'), (2, '0.021*"mosaics" + 0.019*"see" + 0.017*"history" + 0.016*"must" + 0.016*"good" + 0.016*"site" + 0.016*"well" + 0.016*"place" + 0.016*"visit" + 0.015*"worth"'), (3, '0.039*"mosaics" + 0.028*"visit" + 0.020*"interesting" + 0.018*"paphos" + 0.018*"park" + 0.017*"site" + 0.017*"see" + 0.016*"archaeological" + 0.016*"well" + 0.015*"place"')]
Once way to assess the performance of the model (in particular whether we have chosen the correct number of topics) is to use the coherence score
# Compute Coherence Score
coherence_model_lda = gensim.models.CoherenceModel(model=lda_model, texts=review_data['GramList'], dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: {}'.format(coherence_lda))
Coherence Score: 0.2894762136456679
Since we have a way of determining the performance of the model, we can loop through all possible number of topics and choose the one with the best coherence score
max_coherence_score = 0
best_n_topics = -1
best_model = None
for i in range(2,6):
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=i,
chunksize=100,
alpha='auto',
per_word_topics=True)
# Compute Coherence Score
coherence_model_lda = gensim.models.CoherenceModel(model=lda_model, texts=review_data['GramList'], dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
if max_coherence_score < coherence_lda:
max_coherence_score = coherence_lda
best_n_topics = i
best_model = lda_model
print('\n The Coherence Score with {} topics is {}'.format(i,coherence_lda))
The Coherence Score with 2 topics is 0.2323484892131155 The Coherence Score with 3 topics is 0.21912753596868087 The Coherence Score with 4 topics is 0.250791284519631 The Coherence Score with 5 topics is 0.2402133610574262
Now that we have the best LDA model, let’s visualise it
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(best_model, corpus, id2word)
vis
Let’s consolidate all of this into a function
def ldaModel(x):
# Create Dictionary
id2word = gensim.corpora.Dictionary(x)
# Create Corpus
texts = x
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]
max_coherence_score = 0
best_n_topics = -1
best_model = None
for i in range(2,6):
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=i,
random_state=100,
update_every=1,
chunksize=100,
passes=10,
alpha='auto',
per_word_topics=True)
# Compute Coherence Score
coherence_model_lda = gensim.models.CoherenceModel(model=lda_model, texts=review_data['GramList'], dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
if max_coherence_score < coherence_lda:
max_coherence_score = coherence_lda
best_n_topics = i
best_model = lda_model
print('\n The Coherence Score with {} topics is {}'.format(i,coherence_lda))
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(best_model, corpus, id2word)
return best_model, vis
Better yet, let’s create a function to do that cleaning as well as the lda model
def ldaFromReviews(x, stopwords):
cleanedReviewsAsLists = cleanAndCreateGrams(x,stopwords)
return ldaModel(cleanedReviewsAsLists)
And this is how we would apply it
model,ldavis = ldaFromReviews(review_data['fullreview'],stopwords)
The Coherence Score with 2 topics is 0.21609545591563656 The Coherence Score with 3 topics is 0.3024847517693075 The Coherence Score with 4 topics is 0.2894762136456679 The Coherence Score with 5 topics is 0.2692166142993114
model
now contains our trained LDA model and ldavis
contains the visualisation.
9-Create A Word Cloud Of The LDA Model
Another way to represent the popular words in a corpus is by way of the word cloud. Let’s create one big dictionary with the word and the frequency throughout the corpus
freq_dict = []
[freq_dict.extend(i) for i in corpus[:]]
frequency_dict = dict()
for i,j in freq_dict:
key = id2word[i]
if key in frequency_dict:
frequency_dict[key] += j
else:
frequency_dict[key] = j
Now we can use the wordcloud library to visualise the words and their prevalence
wordcloud = WordCloud(background_color = 'white',
relative_scaling = 1.0
).generate_from_frequencies(frequency_dict)
wordcloud.to_image()
And of course we can consolidate this in to a function which does this for us
def generate_wordcloud_from_freq(frequency_dict):
"""A function to create a wordcloud according to the text frequencies as well as the text itself"""
wordcloud = WordCloud(background_color = 'white',
relative_scaling = 1.0
).generate_from_frequencies(frequency_dict)
return wordcloud
def generate_wordcloud(freq_dict,id2word,corpus):
freq_dict = []
[freq_dict.extend(i) for i in corpus[:]]
frequency_dict = dict()
for i,j in freq_dict:
key = id2word[i]
if key in frequency_dict:
frequency_dict[key] += j
else:
frequency_dict[key] = j
return generate_wordcloud_from_freq(frequency_dict)
All we have to do then is to call this function
wc = generate_wordcloud(freq_dict,id2word,corpus)
wc.to_image()
10-Putting It All Together
At the bottom of this article is a module which incorporates what we’ve seen above (can also be found here at github: https://github.com/TanselArif-21/Topic-Modeling ). Here’s a demonstration of the usage:
The main page we’re interested in is https://www.tripadvisor.co.uk/Attraction_Review-g190384-d6755801-Reviews-The_House_of_Dionysus-Paphos_Paphos_District.html
We first import the WebScraper.py module from here or from Part 1 into our script.
import WebScraper
If we click on the next page on Tripadvisor for this url, we see a pattern. The url of the next page is
https://www.tripadvisor.co.uk/Attraction_Review-g190384-d6755801-Reviews-or10-The_House_of_Dionysus-Paphos_Paphos_District.html
We immediately see 4 parts to the url:
- https://www.tripadvisor.co.uk/Attraction_Review-g190384-d6755801-Reviews
- -or
- 10
- -The_House_of_Dionysus-Paphos_Paphos_District.html
We can utilise the WebScraper to increment 20 pages (that’s 200 reviews) with an increment of 10 at a time
url1 = "https://www.tripadvisor.co.uk/Attraction_Review-g190384-d6755801-Reviews"
url2 = "-The_House_of_Dionysus-Paphos_Paphos_District.html"
increment_string1="-or"
total_pages=20
increment=10
myScraper = WebScraper.WebScraper(site='tripadvisor',url1=url1,url2=url2,
increment_string1=increment_string1,
total_pages=total_pages,increment=increment,silent=False)
ms.fullscraper()
review_data = ms.all_reviews
Now we can import the TopicModeler from here or using the code in the next section
import TopicModeling
Then we can create the TopicModeling object and the visualisations
myTopicModel = TopicModeling.TopicModeling(review_data)
myTopicModel.ldaFromReviews()
myTopicModel.generate_wordcloud()
The Coherence Score with 2 topics is 0.17248626808884526 The Coherence Score with 3 topics is 0.19218773360003868 The Coherence Score with 4 topics is 0.2017723166654421 The Coherence Score with 5 topics is 0.22960310254410485
The visualisation concerning different topics can be obtained with myTopicModel.ldavis
and the wordcloud can be visualised with the method myTopicModel.showWordCloud()
11-Supplementary Material
import nltk
nltk.download('stopwords')
import numpy as np
import pandas as pd
import gensim
from wordcloud import WordCloud
import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt
import warnings
print('Fitering Deprecation Warnings!')
warnings.filterwarnings("ignore",category=DeprecationWarning)
class TopicModeling:
'''
This class can be used to carry out LDA and generate word clouds
for visualisation.
Example Usage (review_data is a dataframe with a column called 'fullreview'):
import TopicModeling
myTopicModel = TopicModeling.TopicModeling(review_data)
myTopicModel.ldaFromReviews()
myTopicModel.generate_wordcloud()
'''
def __init__(self, df, review_column = 'fullreview'):
'''
Constructure.
:param df: this is a dataframe with a column containing reviews
:param review_column: the name of the review column in the passed in df
'''
# Get the stopwords
self.stopwords = nltk.corpus.stopwords.words('english')
# Attach a copy of the dataframe to this object
self.df = df.copy()
# Save the column name to be used for the reviews
self.review_column = review_column
# This will be the corpus
self.corpus = None
# This will be the ids of the words
self.id2word = None
def cleanDocument(self, x):
'''
This method takes a document (single review), cleans it and turns
it in to a list of words
:param x: a document (review) as a string
'''
return [word for word in gensim.utils.simple_preprocess(x,deacc = True)
if word not in self.stopwords]
def createGrams(self, ls):
"""
This method expects a list (or series) of lists of words each being a
list representation of a document. It returns a list of bigrams and
a list of Trigrams relevant to the list given.
:param ls: a list (or series) of a list of words
"""
# Create bigrams (i.e. train the bigrams)
bigrams = gensim.models.Phrases(ls, min_count=3, threshold=50)
bigrams_Phrases= gensim.models.phrases.Phraser(bigrams)
# Create trigrams (i.e. train the trigrams)
trigrams = gensim.models.Phrases(bigrams_Phrases[list(ls)], min_count=3, threshold=50)
trigram_Phrases = gensim.models.phrases.Phraser(trigrams)
# Return each document's list representation while considering n-grams
return [bigrams_Phrases[i] for i in list(ls)],[trigram_Phrases[i] for i in list(ls)]
def cleanAndCreateGrams(self, ls):
'''
This method takes a list (or series) of list representations of documents and cleans each
one while finding n-grams (bigrams and trigrams)
:param ls: a list (or series) of a list of words
'''
return(self.createGrams(ls.apply(lambda x: self.cleanDocument(x)))[0])
def prepdf(self):
'''
This method prepares the review dataframe attached to this object by cleaning
each review and transforming it into list representation
'''
self.df['prepped'] = self.cleanAndCreateGrams(self.df[self.review_column])
def ldaModel(self, x = None):
'''
This method runs the LDA model on the column containing the reviews in list
representation. If the reviews column has not already been prepared, this
method will prepare it. Optionally, the user can feed in an already prepped
column to run LDA on.
:param x: a list of lists of words. Each list is expected to have been prepped
by removing stopwords and finding n-grams
:returns: a tuple of the best lda model and the visualisation
'''
# if x hasn't been provided, use the prepped column of the dataframe attached to this object
if x is None:
# if this dataframe has not been prepared, prepare it
if 'prepped' in self.df.columns:
x = self.df['prepped']
else:
self.prepdf()
# Create Dictionary
self.id2word = gensim.corpora.Dictionary(x)
# Term Document Frequency
self.corpus = [self.id2word.doc2bow(text) for text in x]
# These are to store the performance and the best model
max_coherence_score = 0
best_n_topics = -1
best_model = None
# Loop through each topic number and check if it has improved the performance
for i in range(2,6):
# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=self.corpus,
id2word=self.id2word,
num_topics=i,
random_state=100,
update_every=1,
chunksize=100,
passes=10,
alpha='auto',
per_word_topics=True)
# Calculate Coherence Score
coherence_model_lda = gensim.models.CoherenceModel(model=lda_model,
texts=x, dictionary=self.id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
# If this has the best coherence score so far, save it
if max_coherence_score < coherence_lda:
max_coherence_score = coherence_lda
best_n_topics = i
best_model = lda_model
# Print progress
print('\n The Coherence Score with {} topics is {}'.format(i,coherence_lda))
# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(best_model, self.corpus, self.id2word)
return best_model, vis
def ldaFromReviews(self):
'''
A method to run the LDA model on the reviews dataframe. If the dataframe
has been prepared for the LDA already, the model is directly run. Otherwise
the dataframe is prepared first. The resulting model and visualisation is
attached to this object.
'''
# If the dataframe hasn't yet been prepped, prep it
if 'prepped' not in self.df.columns:
self.prepdf()
# Save the model and the visualisation to this object
self.ldamodel,self.ldavis = self.ldaModel()
def generate_wordcloud_from_freq(self):
"""
A method to create a wordcloud according to the text frequencies
attached to this object. Takes into account the stopwords variable
of this object.
"""
wordcloud = WordCloud(background_color = 'white',
relative_scaling = 1.0,
stopwords = self.stopwords
).generate_from_frequencies(self.frequency_dict)
return wordcloud
def generate_wordcloud(self):
'''
This method gets the frequency dictionary from the corpus that
has already been formed and creates a wordcloud. The corpus is
an id-frequency list for each document. The resulting frequency
dictionary is an id-frequency list for the entire corpus.
'''
# If there isn't a corpus, run lda
if self.corpus is None:
self.ldaFromReviews()
# Get a frequncy list or tuples for each document in the corpus
self.freq_list = []
[self.freq_list.extend(i) for i in self.corpus[:]]
# Now create a single dictionary with id-frequency key value pairs for all docs
self.frequency_dict = dict()
for i,j in self.freq_list:
key = self.id2word[i]
if key in self.frequency_dict:
self.frequency_dict[key] += j
else:
self.frequency_dict[key] = j
# Save wordcloud to the object
self.wordCloud = self.generate_wordcloud_from_freq()
def showWordCloud(self):
'''
A method to display the wordcloud
'''
return self.wordCloud.to_image()