Analysing online review data – Part 1
The House of Dionysus
The House of Dionysus is a popular tourist destination in Paphos, Cyprus. It is interesting to get an idea of the general feelings of the visitors to this destination. It turns out that the popular review site Tripadvisor has plenty of reviews for this destination.
Below, we use Python to obtain information by way of web scraping to get review information into a dataframe. The result of this article is a class object found at the end of the article or equivalently on github.
Steps
- Pull the web page(s) in to Python in the form of HTML
- Find the HTML Tags corresponding to a review
- Find the HTML Tags corresponding to the rating of the review
- Find the HTML Tags corresponding to the date of the review
- Store all the above into a DataFrame to be processed and transformed
- Putting it all together
- Supplementary Material
0-Requirements
import platform
print('Python version: {}'.format(platform.python_version()))
Python version: 3.6.4
The requirements file contents are as follows:
certifi==2019.6.16
chardet==3.0.4
idna==2.8
lxml==4.3.4
numpy==1.16.4
pandas==0.24.2
python-dateutil==2.8.0
pytz==2019.1
requests==2.22.0
six==1.12.0
urllib3==1.25.3
1-Pull the web page(s) in to Python in the form of HTML
from lxml import html
import requests
# Get the url of the web page
url = 'https://www.tripadvisor.co.uk/Attraction_Review-g190384-d6755801-Reviews-The_House_of_Dionysus-Paphos_Paphos_District.html'
# Get the request object from the server
page = requests.get(url)
Investigating the first 100 characters of the content shows that the content obtained is HTML
page.content[0:100]
b'<!DOCTYPE html><html lang="en-GB" xmlns:og="http://opengraphprotocol.org/schema/"><head><meta http-e'
Now we can convert the content above to a HTML element object. In particular, this object represents the top most tag
# Convert the request content to an html object
top = html.fromstring(page.content)
2-Find the HTML Tags corresponding to a review
Go to the Tripadvisor URL above and inspect the HTML using the F12 key on Chrome. It seems that each review is contained within a HTML tags with class = ‘review-container’. We can use our HTML element object above to get all the tags in the HTML document corresponding to this class and return a list
top.find_class('review-container')
[<Element div at 0x17e72f56458>,
<Element div at 0x17e7331c958>,
<Element div at 0x17e7331ca48>,
<Element div at 0x17e7331ca98>,
<Element div at 0x17e7331cae8>,
<Element div at 0x17e7331cb38>,
<Element div at 0x17e7331cb88>,
<Element div at 0x17e7331cbd8>,
<Element div at 0x17e7331cc28>,
<Element div at 0x17e7331cc78>]
Each element of this list is a HTML element object corresponding to the relevant tag. We can have a look at the content of each review by utilising the methods of the element object
top.find_class('review-container')[0].text_content()
'\n\n Panayiotis01Nicosia, Cyprus51Reviewed yesterday A trip to the pastIt’s a really beautiful archeological place in Paphos . You can see how old this island is and how beautifulDate of experience: March 2019Thank Panayiotis01 \n\n'
But the ‘review-container’ tag holds the entire review, including the rating of the review, the title of the review and so on. We might want to gain access to the review itself. Inspecting the site’s HTML further, we see that the actual review is contained in a tag with class = ‘entry’
top.find_class('review-container')[0].find_class('entry')[0].text_content()
'It’s a really beautiful archeological place in Paphos . You can see how old this island is and how beautiful'
3-Find the HTML Tags corresponding to the rating of the review
Inspecting the site’s HTML further, we can see that the rating of the review is contained in a tag with class = ‘ui_bubble_rating’
top.find_class('review-container')[0].find_class('ui_bubble_rating')[0].text_content()
''
Notice that the actual rating is contained in the class name. So a tag with class = ‘ui_bubble_rating bubble_50’ corresponds to a rating of 5 stars. Whereas a tag with class = ‘ui_bubble_rating bubble_25’ corresponds to a rating of 2.5 stars. Since the rating is contained within the class name of the tag, noting that each review only has one rating, we can process the text to extract the rating from the ‘review-container’ tag
'bubble_50"' in str(html.tostring(top.find_class('review-container')[0])).replace('>',' ').split()
True
The above shows that ‘bubble_50″‘ is contained in the HTML representation of the review container. We can then assign this review with a rating of 5 stars. We can write a function to extract the rating
def findStars(x,site):
x2 = str(x).replace('>', ' ').split()
if ('bubble_5"' in x2):
return 0.5
elif ('bubble_10"' in x2):
return 1
elif ('bubble_15"' in x2):
return 1.5
elif ('bubble_20"' in x2):
return 2
elif ('bubble_25"' in x2):
return 2.5
elif ('bubble_30"' in x2):
return 3
elif ('bubble_35"' in x2):
return 3.5
elif ('bubble_40"' in x2):
return 4
elif ('bubble_45"' in x2):
return 4.5
elif ('bubble_50"' in x2):
return 5
else:
return 0
4-Find the HTML Tags corresponding to the date of the review
Inspecting the website, it can be seen that the date of the review is contained in the tag with class = ‘ratingDate’. Extracting the contents of this class gives us a representation of the date the review was posted
top.find_class('review-container')[0].find_class('ratingDate')[0].text_content()
'Reviewed yesterday '
This is a relative date. We can leave the handling of relative dates to the user of the resulting dataframe.
5-Store all the above into a DataFrame to be processed and transformed
In step 2, we obtain a list of HTML element objects corresponding to the tag with class = ‘review-container’. We can loop through this list and for each element object, get the review, the rating and the date then store this information into a dataframe.
# Lists for reviews, dates and ratings
reviews = []
dates = []
ratings = []
# For each review container
for container in top.find_class('review-container'):
# Get the review
reviews.append(container.find_class('entry')[0].text_content())
# Get the date
dates.append(container.find_class('ratingDate')[0].text_content())
# Get the rating
ratings.append(findStars(html.tostring(container)))
# Combine into a single dataframe
df_reviews = pd.concat([pd.DataFrame(reviews),pd.DataFrame(dates),pd.DataFrame(ratings)],axis=1)
df_reviews.columns = ['Review','Date','Rating']
df_reviews
6-Putting it all together
At the bottom of this article is a module which incorporates what we’ve seen above (github: https://github.com/TanselArif-21/WebScraping ). The module provides functionality for 2 sites: Tripadvisor and Yelp. Here’s a demonstration of the usage:
The main page we’re interested in is https://www.tripadvisor.co.uk/Attraction_Review-g190384-d6755801-Reviews-The_House_of_Dionysus-Paphos_Paphos_District.html
We first import this module (which is called WebScraper.py) into our script
import WebScraper
Using the WebScraper class provided, we can create a WebScraper object and read in the review information on this page. We specify which site we are reading (tripadvisor) and whether we would like a silent read (i.e. no diagnostic information)
myScraper = WebScraper.WebScraper("https://www.tripadvisor.co.uk/Attraction_Review-g190384-d6755801-Reviews-The_House_of_Dionysus-Paphos_Paphos_District.html", 'tripadvisor', silent = False)
myScraper.scrape()
The output of the read shows that diagnostics were run to check for consistency between the title, review, date and rating columns. We can retrieve the reviews as a DataFrame using the member attribute ‘reviews’
myScraper.reviews
However, it is hardly useful to just read a single page of the review site. If we click on the next page on Tripadvisor, we see a pattern. The url of the next page is
https://www.tripadvisor.co.uk/Attraction_Review-g190384-d6755801-Reviews-or10-The_House_of_Dionysus-Paphos_Paphos_District.html.
We immediately see 4 parts to the url:
- https://www.tripadvisor.co.uk/Attraction_Review-g190384-d6755801-Reviews
- -or
- 10
- -The_House_of_Dionysus-Paphos_Paphos_District.html
The first part is static, let’s call it url1. The second part is also static but it wasn’t part of the original url, let’s call it increment_string1. The third part looks like an incremental and each page increments by 10. Let’s call this number the increment. The final part is also static and so let’s call it url2. We can utilise the WebScraper to increment 20 pages (that’s 200 reviews) with an increment of 10 at a time
url1 = "https://www.tripadvisor.co.uk/Attraction_Review-g190384-d6755801-Reviews"
url2 = "-The_House_of_Dionysus-Paphos_Paphos_District.html"
increment_string1="-or"
total_pages=20
increment=10
myScraper = WebScraper.WebScraper(site='tripadvisor',url1=url1,url2=url2,
increment_string1=increment_string1,
total_pages=total_pages,increment=increment,silent=False)
myScraper.fullscraper()
The above diagnostics show 20 pages and progress towards completion. To get the resulting full review DataFrame, we use the all_reviews member
myScraper.all_reviews
The default wait time between requests to a website is 1 second. This can be extended by supplying an input for this variable
myScraper = WebScraper.WebScraper(site='tripadvisor',url1=url1,url2=url2,
increment_string1="-or",increment_string2="",
total_pages=20,increment=10,silent=False, seconds_wait=5)
More information on the module can be obtained by utilising the help function to access the docstrings
help(WebScraper.WebScraper)
We can now use this module to obtain review information to run our analyses without worrying about getting this information. More in Part 2…
7-Supplementary Material
from lxml import html
import requests
import pandas as pd
import time
class WebScraper:
"""
This class aids in retrieving review information such as the review,
the title of the review, the date of the review and the rating of the review.
Since each review site's html is constructed differently, site specific
functions are required.
"""
def __init__(self, url = '', site = '', silent = True, url1 = '', url2 = '', increment_string1 = '',
increment_string2 = '',total_pages = 1, increment=10, seconds_wait = 1):
"""
Constructor.
url: the main url of the website to scrape for one-off webscraping
site: the site the url is on. i.e. tripadvisor
silent: determines whether diagnostics are to be displayed
url1: the first part of a series of urls that doesn't change
url2: the second part of a series of urls that doesn't change
increment_string1: the first incremental part of the url
increment_string2: the second incremental part of the url
total_pages: total number of pages to increment
increment: the amount each page should increment each time
seconds_wait: wait time between requests
Remark: url1, url2 are the static parts of the urls that do not change in incrementation
Example:
url = "https://www.tripadvisor.co.uk/Attraction_Review-g190384-d6755801-Reviews-The_House_of_Dionysus-Paphos_Paphos_District.html"
url1 + increment_string1 + increment_string1 + increment + increment_string2 + url2= "https://www.tripadvisor.co.uk/Attraction_Review-g190384-d6755801-Reviews-or10-The_House_of_Dionysus-Paphos_Paphos_District.html"
In the above, url is the main page (this can be left blank).
url1 = "https://www.tripadvisor.co.uk/Attraction_Review-g190384-d6755801-Reviews"
increment_string1 = "or"
increment = 10 (this is how much each page increments by)
increment_string2 = ""
url2 = "-The_House_of_Dionysus-Paphos_Paphos_District.html"
"""
self.url = url
self.url1 = url1
self.url2 = url2
self.first_url = url1 + url2
self.increment_string1 = increment_string1
self.increment_string2 = increment_string2
self.total_pages = total_pages
self.increment = increment
self.site = site
self.seconds_wait = seconds_wait
self.silent = silent
self.supported_sites = ['tripadvisor','yelp']
def findStars(self,x):
"""
This function extracts the rating from the html element.
x: string representation of the html element
returns: int. The rating.
"""
if self.site.lower() == 'tripadvisor':
x2 = str(x).replace('>', ' ').split()
if ('bubble_5"' in x2):
return 0.5
elif ('bubble_10"' in x2):
return 1
elif ('bubble_15"' in x2):
return 1.5
elif ('bubble_20"' in x2):
return 2
elif ('bubble_25"' in x2):
return 2.5
elif ('bubble_30"' in x2):
return 3
elif ('bubble_35"' in x2):
return 3.5
elif ('bubble_40"' in x2):
return 4
elif ('bubble_45"' in x2):
return 4.5
elif ('bubble_50"' in x2):
return 5
else:
return 0
elif self.site.lower() == 'yelp':
x2 = str(x)
if ('0.5 star' in x2):
return 0.5
elif ('1.0 star' in x2):
return 1
elif ('1.5 star' in x2):
return 1.5
elif ('2.0 star' in x2):
return 2
elif ('2.5 star' in x2):
return 2.5
elif ('3.0 star' in x2):
return 3
elif ('3.5 star' in x2):
return 3.5
elif ('4.0 star' in x2):
return 4
elif ('4.5 star' in x2):
return 4.5
elif ('5.0 star' in x2):
return 5
else:
return 0
def diagnostics(self,*args):
'''
This function checks that the lists given as arguments are of equal sizes
args: An arbitrary number of lists
silent: A boolean indicating whether diagnostic results are to be displayed
'''
# Check if the silent flag is False
if not self.silent:
print('Diagnostics: Checking if dataframes are of equal size...')
[print('Size: {}'.format(len(i))) for i in args if not self.silent]
# The first list size
l = len(args[0])
# For each list, check if the sizes are equal to the first list
for i in args:
if len(i) != l:
if not self.silent:
print('Unequal Sizes!')
return False
if not self.silent:
print('Diagnostics complete!')
return True
def scrape(self,url = ''):
'''
This functioni scrapes relevant review tags from a website url. If a url
is provided, it is intended for single use of a particular web page. If it
is not provided, get it from the object.
url: A string url
site: A string indicating the site name to be scraped
silent: A boolean indicating whether diagnostic results are to be displayed
'''
# A variable to store the success of the read
success = False
# If a main url is not provided, get it from the object
if not url:
url = self.url
# These are to store the actual review components
reviews_array = []
ratings_array = []
titles_array = []
dates_array=[]
# Get the request object from the server
page = requests.get(url)
# Convert the request content to an html object
top = html.fromstring(page.content)
# Site specific html configuration
if self.site.lower() == 'tripadvisor':
# Get all the review containers
reviews = top.find_class('review-container')
# Loop through the review containers and get the actual reviews
for i in reviews:
reviews_array.append((i.find_class('entry')[0]).text_content())
# Within each review container is a class, the name of
# which determines the rating to display
# We use the findStars function to determine the rating
# from the class name
for i in reviews:
ratings_array.append(self.findStars(html.tostring(i)))
# Get the titles from each review container
for i in reviews:
titles_array.append(i.find_class('noQuotes')[0].text_content())
# Get the dates from each review container
for i in reviews:
dates_array.append(i.find_class('ratingDate')[0].text_content())
# Diagnostics
success = self.diagnostics(ratings_array,reviews_array,dates_array,titles_array)
elif self.site.lower() == 'yelp':
#rev_class_1 = 'review-content'
#rev_class_2 = 'p'
#rat_class = 'biz-rating'
#dat_class_2 = 'rating-qualifier'
# Get all the review contents
reviews = top.find_class('review-content')
# Loop through the review contents and get the actual reviews
for i in reviews:
reviews_array.append(i.find('p').text_content())
# Set empty the titles. i.e. there are no titles for yelp
titles_array = reviews_array.copy()
# Within each review-content is a class called biz-rating, the name of
# which determines the rating to display
# We use the findStars function to determine the rating from the class name
for i in [getattr(i,'find_class')('biz-rating')[0] for i in reviews]:
ratings_array.append(self.findStars(html.tostring(i)))
# Get the dates. When a review is updated, the word updated review is present
# in the dates string
for i in reviews:
dates_array.append((i.find_class('rating-qualifier')[0].text_content()).\
replace('Updated review','').lstrip().rstrip())
# Diagnostics
success = self.diagnostics(ratings_array,reviews_array,dates_array)
else:
print('The site {} is not supported'.format(self.site))
return False
# Convert to a dataframe
df_review = pd.DataFrame(reviews_array, columns=['Review'])
df_ratings = pd.DataFrame(ratings_array, columns=['Rating'])
df_titles = pd.DataFrame(titles_array, columns=['title'])
df_reviewdates = pd.DataFrame(dates_array, columns=['date'])
# Consolidate into a single dataframe
df_fullreview = pd.concat([df_review,df_titles,df_ratings['Rating'], df_reviewdates],axis=1)
df_fullreview.dropna(inplace=True)
# Combine review and title into a single column
df_fullreview['fullreview'] = df_fullreview['Review'] + ' ' + df_fullreview['title']
# Store the reviews to a member variable
self.reviews = df_fullreview
return df_fullreview,success
def fullscraper(self):
'''
This function increments the site url to the next page according to update
criteria and scrapes that page. The full url of subsequent pages is
url = url1 + increment_string1 + increment + increment_string2 + url2.
'''
# A variable to store the success of the read
success = False
# Main data frame
df = pd.DataFrame()
# Progress output
print('Getting reviews ' + str(0)+'/ '+str(self.total_pages))
# url incrementation differs per website
if self.site.lower() in self.supported_sites:
# Keep trying to read the first page until the read is successful
while not success:
# read the first page
df,success = self.scrape(self.first_url)
if not success:
print('Error in reading - Re-reading')
# Wait for 1 second
time.sleep(self.seconds_wait)
print('Getting reviews ' + str(1)+'/ '+str(self.total_pages))
# now loop through each page and read it
for i in range(1,self.total_pages):
# whenever there is an error in reading a page, we retry
success = False
# compose the url of this page
url_temp = self.url1 + self.increment_string1 + str(i*self.increment) + self.increment_string2 + self.url2
# try to read the page until the read is successful
while not success:
df_temp,success = self.scrape(url_temp)
if not success:
print('Error in reading - Re-reading')
# Wait for 1 second
time.sleep(self.seconds_wait)
# Build the dataframe
df = pd.concat([df,df_temp])
# Print progress
print('Getting reviews ' + str(i+1)+'/ '+str(self.total_pages))
print('Complete!!!')
# Store the read information into a member variable
self.all_reviews = df.reset_index().iloc[:,1:]
return df.reset_index().iloc[:,1:]
if __name__ == '__main__':
# Single Usage
url = "https://www.tripadvisor.co.uk/Attraction_Review-g190384-d6755801-Reviews-The_House_of_Dionysus-Paphos_Paphos_District.html"
site = 'tripadvisor'
ms = WebScraper(url, site, silent = False)
ms.scrape()
print(ms.reviews)
# Mutli-page Usage
inurl1 = "https://www.tripadvisor.co.uk/Attraction_Review-g190384-d6755801-Reviews"
inurl2 = "-The_House_of_Dionysus-Paphos_Paphos_District.html"
ms = WebScraper(site='tripadvisor',url1=inurl1,
url2=inurl2,increment_string1="-or",increment_string2="",
total_pages=20,increment=10,silent=False)
ms.fullscraper()
print(ms.all_reviews)