https://github.com/PhilChodrow/cos_2017/tree/master/3_modeling_and_ml
#### Intro stuff to be loaded in during part 1 ####
setwd("~/cos_2017/3_modeling_and_ml")
# install.packages("tidyverse")
library(tidyverse)
# Read in listings data set, and convert price column to numeric
listings = read.csv("../data/listings.csv")
listings$price = as.numeric(gsub('\\$|,','',as.character(listings$price)))
# Read in the reviews data set, making sure to set stringsAsFactors=FALSE
reviews = read.csv("../data/reviews.csv", stringsAsFactors = FALSE)
View the data from reviews.csv
. What does each row represent?
head(reviews, 3)
## listing_id id date reviewer_id reviewer_name
## 1 1178162 4724140 2013-05-21 4298113 Olivier
## 2 1178162 4869189 2013-05-29 6452964 Charlotte
## 3 1178162 5003196 2013-06-06 6449554 Sebastian
## comments
## 1 My stay at islam's place was really cool! Good location, 5min away from subway, then 10min from downtown. The room was nice, all place was clean. Islam managed pretty well our arrival, even if it was last minute ;) i do recommand this place to any airbnb user :)
## 2 Great location for both airport and city - great amenities in the house: Plus Islam was always very helpful even though he was away
## 3 We really enjoyed our stay at Islams house. From the outside the house didn't look so inviting but the inside was very nice! Even though Islam himself was not there everything was prepared for our arrival. The airport T Station is only a 5-10 min walk away. The only little issue was that all the people in the house had to share one bathroom. But it was not really a problem and it worked out fine. We would recommend Islams place for a stay in Boston.
Display the top 10 most reviewed Airbnb listings using the reviews
data frame. Are the counts consistent with the data in the listings
data frame?
sort(listings$number_of_reviews, decreasing = TRUE)[1:10]
sort(table(reviews$listing_id), decreasing = TRUE)[1:10]
Later on, we will want to merge the two data frames: listings
and reviews
. The ID variables that we will use for this merge operation are listings$id
and reviews$listing_id
. It is important to understand the data structure before performing the analysis. Both data frames have 2829 unique listings with >= 1 review - let’s confirm this fact.
length(unique(listings$id))
nrow(filter(listings, number_of_reviews>0))
length(unique(reviews$listing_id))
We will take review_scores_rating
as the dependent variable that we are trying to predict. This is the average customer rating of the Airbnb listing, on a scale of 0-100. Plot a simple histogram of review_scores_rating
, and count the number of values != NA.
hist(listings$review_scores_rating)
sum(!is.na(listings$review_scores_rating))
Next, create a new data frame with just the review scores data from listings.csv
. Filter out rows with review_scores_rating
=NA.
listings_scores = listings %>%
filter(number_of_reviews > 0) %>%
select("LISTING_ID"=id, "RATING"=review_scores_rating) %>%
filter(!is.na(RATING))
str(listings_scores)
## 'data.frame': 2772 obs. of 2 variables:
## $ LISTING_ID: int 3075044 6976 1436513 7651065 12386020 5706985 2843445 753446 849408 12023024 ...
## $ RATING : int 94 98 100 99 100 90 96 96 94 80 ...
Exercise 2.1: Writing a simple function in R
The syntax for writing the function f(x) = x^2 is
f <- function(x){
return(x*x)
}
Write a function to convert the listing rating from a scale of 0-100 to (“Terrible”,“Low”,“Mid”,“High”,“Perfect”).
Given an integer input rating from 0-100, the function should output:
“Perfect” if rating = 100
“High” if 95 <= rating < 99
“Mid” if 90 <= rating < 94
“Low” if 80 <= rating < 89
“Terrible” if rating <= 79
For example: convert_rating(64) should output “Terrible”
Solution:
convert_rating <- function(rating){
if(rating == 100){
return("Perfect")
}else if(rating >= 95){
return("High")
}else if(rating >= 90){
return("Mid")
}else if(rating >= 80){
return("Low")
}else{
return("Terrible")
}
}
Test a few values to make sure that the function works
convert_rating(100)
convert_rating(98)
convert_rating(90)
convert_rating(82)
convert_rating(46)
To apply the convert_rating()
function to each element in an array, we need to “vectorize” the function first. Avoid using for-loops in R whenever possible because those are slow.
v_convert_rating <- Vectorize(convert_rating, c("rating"))
# Test a few values to make sure that the function works.
v_convert_rating(c(100,32,87))
## [1] "Perfect" "Terrible" "Low"
Compute the new column using a mutate call.
listings_scores <- listings_scores %>%
mutate(RATING_TEXT = v_convert_rating(RATING))
# Take a look
table(listings_scores$RATING_TEXT)
##
## High Low Mid Perfect Terrible
## 754 545 670 628 175
These groupings look relatively well-balanced, which is desirable. For the NLP task, we will try to predict this rating category based upon the text data from reviews.csv
.
Let’s go back to reviews data frame.
str(reviews)
## 'data.frame': 68275 obs. of 6 variables:
## $ listing_id : int 1178162 1178162 1178162 1178162 1178162 1178162 1178162 1178162 1178162 1178162 ...
## $ id : int 4724140 4869189 5003196 5150351 5171140 5198929 6702817 6873023 7646702 8094418 ...
## $ date : chr "2013-05-21" "2013-05-29" "2013-06-06" "2013-06-15" ...
## $ reviewer_id : int 4298113 6452964 6449554 2215611 6848427 6663826 8099222 7671888 8197342 9040491 ...
## $ reviewer_name: chr "Olivier" "Charlotte" "Sebastian" "Marine" ...
## $ comments : chr "My stay at islam's place was really cool! Good location, 5min away from subway, then 10min from downtown. The room was nice, al"| __truncated__ "Great location for both airport and city - great amenities in the house: Plus Islam was always very helpful even though he was "| __truncated__ "We really enjoyed our stay at Islams house. From the outside the house didn't look so inviting but the inside was very nice! Ev"| __truncated__ "The room was nice and clean and so were the commodities. Very close to the airport metro station and located in quite safe area"| __truncated__ ...
Currently, we have a data frame with 68275 rows. We would like to have a data frame with 2829 rows - one per each listing. We can use the group_by()
and summarize()
functions to transform the data frame in this way.
reviews_by_listing = reviews %>%
select(listing_id,comments) %>%
group_by(listing_id) %>%
summarize(all_comments=paste(comments,collapse=" "))
# Check out the updated data frame - 2829 rows.
str(reviews_by_listing)
## Classes 'tbl_df', 'tbl' and 'data.frame': 2829 obs. of 2 variables:
## $ listing_id : int 3353 5506 6695 6976 8792 9273 9765 9824 9855 9857 ...
## $ all_comments: chr "Very friendly and helpful. Convenient location. The location is great as it's right next to the Green T stop and there are many"| __truncated__ "Terry's Hotel Alterntv in Boston was a perfect place to stay for myself and my partner. We mixed our trip with business and pl"| __truncated__ "Terry's apartment is beautifully decorated and feels like a home away from home. Kudos especially to the kitchen, which was rea"| __truncated__ "A Wonderful, pleasant, and charming host. The bed is very comfortable and the room is nice. Travel wise one is 15 minutes by t"| __truncated__ ...
View the first listing’s comments.
reviews_by_listing$all_comments[1]
Observations? What are some problems that we might run into with bag-of-words?
Natural Language Processing slides
Now, we are ready to construct the Bag-of-Words with Airbnb reviews for the prediction task. The following step-by-step procedure for building the Bag-of-Words is adapted from MIT EdX - Analytics Edge, Lecture 8.
Step 0: Install and load two packages for pre-processing:
# install.packages("tm")
library(tm)
# install.packages("SnowballC")
library(SnowballC)
Step 1: Convert reviewer comments to a corpus, automatically processing escape characters like “\n”.
Step 2: Change all the text to lower case, and convert to “PlainTextDocument” (required after to_lower function).
Step 3: Remove all punctuation.
Step 4: Remove stop words (this step may take a minute).
Step 5: Stem our document.
corpus = Corpus(VectorSource(reviews_by_listing$all_comments)) %>%
tm_map(tolower) %>%
tm_map(PlainTextDocument) %>%
tm_map(removePunctuation) %>%
tm_map(removeWords, stopwords("english")) %>%
tm_map(stemDocument)
# Take a look
strwrap(corpus[[1]])[1:3]
## [1] "friend help conveni locat locat great right next green t stop mani"
## [2] "food place nearbi room small fit queen mattress almost noth els"
## [3] "day use room night work host friend help giuesepp cordial friend"
# Take a look at tm's stopwords:
stopwords("english")[1:100]
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
## [11] "yours" "yourself" "yourselves" "he" "him"
## [16] "his" "himself" "she" "her" "hers"
## [21] "herself" "it" "its" "itself" "they"
## [26] "them" "their" "theirs" "themselves" "what"
## [31] "which" "who" "whom" "this" "that"
## [36] "these" "those" "am" "is" "are"
## [41] "was" "were" "be" "been" "being"
## [46] "have" "has" "had" "having" "do"
## [51] "does" "did" "doing" "would" "should"
## [56] "could" "ought" "i'm" "you're" "he's"
## [61] "she's" "it's" "we're" "they're" "i've"
## [66] "you've" "we've" "they've" "i'd" "you'd"
## [71] "he'd" "she'd" "we'd" "they'd" "i'll"
## [76] "you'll" "he'll" "she'll" "we'll" "they'll"
## [81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
## [86] "haven't" "hadn't" "doesn't" "don't" "didn't"
## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
## [96] "cannot" "couldn't" "mustn't" "let's" "that's"
Step 6: Create a word count matrix (rows are reviews, columns are words).
frequencies = DocumentTermMatrix(corpus)
# Take a look
frequencies
## <<DocumentTermMatrix (documents: 2829, terms: 45645)>>
## Non-/sparse entries: 838822/128290883
## Sparsity : 99%
## Maximal term length: 365
## Weighting : term frequency (tf)
Step 7: Account for sparsity.
# Use findFreqTerms to get a feeling for which words appear the most.
# Words that appear at least 10000 times:
findFreqTerms(frequencies, lowfreq=10000)
## [1] "also" "apart" "boston" "clean"
## [5] "close" "comfort" "definit" "easi"
## [9] "everyth" "get" "good" "great"
## [13] "help" "home" "host" "hous"
## [17] "locat" "love" "need" "neighborhood"
## [21] "nice" "perfect" "place" "realli"
## [25] "recommend" "room" "stay" "time"
## [29] "walk" "well"
# All 45645 terms will not be useful to us. Might as well get rid of some of them - why?
# Solution: only keep terms that appear in x% or more of the reviews
# 5% or more (142 or more)
sparse = removeSparseTerms(frequencies, 0.95)
# How many did we keep? (1136 terms, compared to 45645 previously)
sparse
## <<DocumentTermMatrix (documents: 2829, terms: 1136)>>
## Non-/sparse entries: 597260/2616484
## Sparsity : 81%
## Maximal term length: 13
## Weighting : term frequency (tf)
# colnames(sparse)
Step 8: Create data frame.
commentsTM = as.data.frame(as.matrix(sparse))
# View data frame (rows are reviews, columns are words)
str(commentsTM, list.len=10)
## 'data.frame': 2829 obs. of 1136 variables:
## $ 100 : num 0 0 0 0 1 0 1 0 0 0 ...
## $ 1015 : num 0 0 0 0 0 1 0 0 0 0 ...
## $ 1520 : num 0 2 0 0 0 0 0 0 0 0 ...
## $ 2nd : num 0 0 0 0 0 0 0 0 0 4 ...
## $ 3rd : num 0 2 0 0 0 0 0 2 0 0 ...
## $ 510 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ abl : num 1 1 2 3 0 3 0 2 0 0 ...
## $ absolut : num 0 0 2 2 0 2 0 2 0 1 ...
## $ accept : num 0 0 0 0 0 0 0 1 0 0 ...
## $ access : num 3 5 8 6 0 2 1 0 0 0 ...
## [list output truncated]
# Drop columns that include numbers
commentsTM = commentsTM[,!grepl("[0-9]",names(commentsTM))]
We have finished building the term frequency data frame commentsTM
. Next, we need to merge the two data frames commentsTM
(features) and listings_scores
(labels) before we can run our machine learning algorithms on this data. This will be a full-join by LISTING_ID
.
# Add our dependent variable:
commentsTM$LISTING_ID = reviews_by_listing$listing_id
commentsTM = full_join(listings_scores, commentsTM)
## Joining, by = "LISTING_ID"
# Remove all rows with NA's
commentsTM = na.omit(commentsTM)
# View the first few data frame columns
# Note: Column names corresponding to word frequencies are lowercase,
# while all other column names are uppercase.
names(commentsTM)[1:10]
## [1] "LISTING_ID" "RATING" "RATING_TEXT" "abl" "absolut"
## [6] "accept" "access" "accommod" "accomod" "accueil"
Exercise 2.2: Your own Bag-of-Words Following steps 0-8, build a Bag-of-Words data frame on the listings description data. Add price as the dependent variable, name it “PRICE”, remove the rows with price=NA, and move this column to the front of the new data frame. (Hint: Construct the term-matrix listingsTM by modifying the NLP code in the file bonus.R.)
Up to here, we have just pre-processed and prepared our data. Now, we are ready to build models.
Next, we will use our Bag-of-Words to build a CART model to predict review scores. How could a model like this be useful in practice? We will follow the same cross-validation procedure as we did before to select the cp parameter for our CART model. The only difference is that now our features will be word counts, and our predictions will be the discrete values: (“Terrible”,“Low”,“Mid”,“High”,“Perfect”)
To begin, convert RATING_TEXT
to a factor variable, and set the order of the level values so that they appear properly in our confusion matrix.
commentsTM$RATING_TEXT = commentsTM$RATING_TEXT %>%
as.factor() %>%
ordered(levels=c("Terrible","Low","Mid","High","Perfect"))
str(commentsTM$RATING_TEXT)
## Ord.factor w/ 5 levels "Terrible"<"Low"<..: 3 4 5 4 5 3 4 4 3 2 ...
Split data into training and testing sets
# install.packages("caTools")
library(caTools)
set.seed(123)
spl = sample.split(commentsTM$RATING_TEXT, SplitRatio = 0.7)
commentsTrain = subset(commentsTM, spl==TRUE)
commentsTest = subset(commentsTM, spl==FALSE)
Let’s use CART! Why CART?
# install.packages("rpart")
library(rpart)
# install.packages("rpart.plot")
library(rpart.plot)
First, train the model using the default parameter values (cp=0.01) Of course, we cannot include RATING
or LISTING_ID
as predictive variables - why not?
commentsCART = rpart(RATING_TEXT ~ . - RATING - LISTING_ID,
data=commentsTrain,
method="class")
# Display decision tree. Does it make intuitive sense?
prp(commentsCART)
Next, let’s perform cross-validation on our Bag-of-Words CART model to tune our choice for cp. Useful resource for cross-validation of cp in rpart: https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf
Step 1: Begin by constructing a tree with a small cp parameter
set.seed(2222)
commentsCART = rpart(RATING_TEXT ~ . - RATING - LISTING_ID,
data=commentsTrain,
cp=0.001,
method="class")
Step 2: View the cross-validated error vs. cp
# In the `printcp()` table:
# "nsplit" = number of splits in tree
# "rel error" = scaled training error
# "xerror" = scaled cross-validation error
# "xstd" = standard deviation of xerror
printcp(commentsCART)
##
## Classification tree:
## rpart(formula = RATING_TEXT ~ . - RATING - LISTING_ID, data = commentsTrain,
## method = "class", cp = 0.001)
##
## Variables actually used in tree construction:
## [1] absolut access accommod advertis afford air
## [7] although anoth awesom bathroom beauti believ
## [13] beyond boston clean close code comfort
## [19] communic deal deck definit differ dirti
## [25] drink end enough even ever everyth
## [31] exact expect first found four futur
## [37] get good gracious great high home
## [43] host howev immacul late let locat
## [49] move negat never night offer one
## [55] overal person phone pick price profession
## [61] provid quaint question receiv recommend said
## [67] second short smaller smell spacious stay
## [73] stop take thank thought train transit
## [79] transport travel trip une visit weekend
## [85] well will wonder
##
## Root node error: 1413/1941 = 0.72798
##
## n= 1941
##
## CP nsplit rel error xerror xstd
## 1 0.2016985 0 1.00000 1.00000 0.013875
## 2 0.0368011 1 0.79830 0.81246 0.015327
## 3 0.0113234 2 0.76150 0.77707 0.015455
## 4 0.0084926 5 0.72753 0.78273 0.015437
## 5 0.0077849 6 0.71904 0.77424 0.015463
## 6 0.0070771 11 0.68011 0.77070 0.015473
## 7 0.0068412 12 0.67304 0.76929 0.015477
## 8 0.0049540 15 0.65251 0.75796 0.015506
## 9 0.0046001 17 0.64260 0.76362 0.015492
## 10 0.0042463 20 0.62845 0.75442 0.015514
## 11 0.0038924 26 0.60297 0.75513 0.015513
## 12 0.0035386 38 0.55343 0.75938 0.015503
## 13 0.0031847 50 0.51026 0.76079 0.015499
## 14 0.0028309 52 0.50389 0.76008 0.015501
## 15 0.0025950 57 0.48974 0.75655 0.015509
## 16 0.0024770 60 0.48195 0.75584 0.015511
## 17 0.0021231 62 0.47700 0.76929 0.015477
## 18 0.0018401 72 0.45577 0.77141 0.015471
## 19 0.0017693 78 0.44444 0.77424 0.015463
## 20 0.0014154 83 0.43454 0.77495 0.015461
## 21 0.0012385 100 0.41047 0.77565 0.015459
## 22 0.0011795 104 0.40552 0.77353 0.015465
## 23 0.0010616 107 0.40198 0.77565 0.015459
## 24 0.0010000 109 0.39986 0.78202 0.015439
# In the `plotcp()` plot:
# size of tree = (number of splits in tree) + 1
# dashed line occurs at 1 std. dev. above the minimum xerror
# Rule of Thumb: select the model size which first
# goes below the dotted line
plotcp(commentsCART)
Step 3: Prune the tree, and take a look
commentsCART = prune(commentsCART,cp=0.007)
prp(commentsCART)