https://github.com/PhilChodrow/cos_2017/tree/master/3_modeling_and_ml

#### Intro stuff to be loaded in during part 1 ####
setwd("~/cos_2017/3_modeling_and_ml")
# install.packages("tidyverse")
library(tidyverse)

# Read in listings data set, and convert price column to numeric
listings = read.csv("../data/listings.csv")
listings$price = as.numeric(gsub('\\$|,','',as.character(listings$price)))

# Read in the reviews data set, making sure to set stringsAsFactors=FALSE
reviews = read.csv("../data/reviews.csv", stringsAsFactors = FALSE)

Natural Language Processing

View the data from reviews.csv. What does each row represent?

head(reviews, 3)
##   listing_id      id       date reviewer_id reviewer_name
## 1    1178162 4724140 2013-05-21     4298113       Olivier
## 2    1178162 4869189 2013-05-29     6452964     Charlotte
## 3    1178162 5003196 2013-06-06     6449554     Sebastian
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                 comments
## 1                                                                                                                                                                                                 My stay at islam's place was really cool! Good location, 5min away from subway, then 10min from downtown. The room was nice, all place was clean. Islam managed pretty well our arrival, even if it was last minute ;) i do recommand this place to any airbnb user :)
## 2                                                                                                                                                                                                                                                                                                                                    Great location for both airport and city - great amenities in the house: Plus Islam was always very helpful even though he was away
## 3 We really enjoyed our stay at Islams house. From the outside the house didn't look so inviting but the inside was very nice! Even though Islam himself was not there everything was prepared for our arrival. The airport T Station is only a 5-10 min walk away. The only little issue was that all the people in the house had to share one bathroom. But it was not really a problem and it worked out fine. We would recommend Islams place for a stay in Boston.

Display the top 10 most reviewed Airbnb listings using the reviews data frame. Are the counts consistent with the data in the listings data frame?

sort(listings$number_of_reviews, decreasing = TRUE)[1:10]
sort(table(reviews$listing_id), decreasing = TRUE)[1:10]

Later on, we will want to merge the two data frames: listings and reviews. The ID variables that we will use for this merge operation are listings$id and reviews$listing_id. It is important to understand the data structure before performing the analysis. Both data frames have 2829 unique listings with >= 1 review - let’s confirm this fact.

length(unique(listings$id))
nrow(filter(listings, number_of_reviews>0))
length(unique(reviews$listing_id))

We will take review_scores_rating as the dependent variable that we are trying to predict. This is the average customer rating of the Airbnb listing, on a scale of 0-100. Plot a simple histogram of review_scores_rating, and count the number of values != NA.

hist(listings$review_scores_rating)
sum(!is.na(listings$review_scores_rating))

Next, create a new data frame with just the review scores data from listings.csv. Filter out rows with review_scores_rating=NA.

listings_scores = listings %>% 
  filter(number_of_reviews > 0) %>%
  select("LISTING_ID"=id, "RATING"=review_scores_rating) %>%
  filter(!is.na(RATING))
str(listings_scores)
## 'data.frame':    2772 obs. of  2 variables:
##  $ LISTING_ID: int  3075044 6976 1436513 7651065 12386020 5706985 2843445 753446 849408 12023024 ...
##  $ RATING    : int  94 98 100 99 100 90 96 96 94 80 ...

Exercise 2.1: Writing a simple function in R
The syntax for writing the function f(x) = x^2 is

f <- function(x){
  return(x*x)
}

Write a function to convert the listing rating from a scale of 0-100 to (“Terrible”,“Low”,“Mid”,“High”,“Perfect”).
Given an integer input rating from 0-100, the function should output:
“Perfect” if rating = 100
“High” if 95 <= rating < 99
“Mid” if 90 <= rating < 94
“Low” if 80 <= rating < 89
“Terrible” if rating <= 79
For example: convert_rating(64) should output “Terrible”

Solution:

convert_rating <- function(rating){
  if(rating == 100){
    return("Perfect")
  }else if(rating >= 95){
    return("High")
  }else if(rating >= 90){
    return("Mid")
  }else if(rating >= 80){
    return("Low")
  }else{
    return("Terrible")
  }
}

Test a few values to make sure that the function works

convert_rating(100)
convert_rating(98)
convert_rating(90)
convert_rating(82)
convert_rating(46)

To apply the convert_rating() function to each element in an array, we need to “vectorize” the function first. Avoid using for-loops in R whenever possible because those are slow.

v_convert_rating <- Vectorize(convert_rating, c("rating"))
# Test a few values to make sure that the function works.
v_convert_rating(c(100,32,87))
## [1] "Perfect"  "Terrible" "Low"

Compute the new column using a mutate call.

listings_scores <- listings_scores %>%
  mutate(RATING_TEXT = v_convert_rating(RATING))

# Take a look
table(listings_scores$RATING_TEXT)
## 
##     High      Low      Mid  Perfect Terrible 
##      754      545      670      628      175

These groupings look relatively well-balanced, which is desirable. For the NLP task, we will try to predict this rating category based upon the text data from reviews.csv.

Let’s go back to reviews data frame.

str(reviews)
## 'data.frame':    68275 obs. of  6 variables:
##  $ listing_id   : int  1178162 1178162 1178162 1178162 1178162 1178162 1178162 1178162 1178162 1178162 ...
##  $ id           : int  4724140 4869189 5003196 5150351 5171140 5198929 6702817 6873023 7646702 8094418 ...
##  $ date         : chr  "2013-05-21" "2013-05-29" "2013-06-06" "2013-06-15" ...
##  $ reviewer_id  : int  4298113 6452964 6449554 2215611 6848427 6663826 8099222 7671888 8197342 9040491 ...
##  $ reviewer_name: chr  "Olivier" "Charlotte" "Sebastian" "Marine" ...
##  $ comments     : chr  "My stay at islam's place was really cool! Good location, 5min away from subway, then 10min from downtown. The room was nice, al"| __truncated__ "Great location for both airport and city - great amenities in the house: Plus Islam was always very helpful even though he was "| __truncated__ "We really enjoyed our stay at Islams house. From the outside the house didn't look so inviting but the inside was very nice! Ev"| __truncated__ "The room was nice and clean and so were the commodities. Very close to the airport metro station and located in quite safe area"| __truncated__ ...

Currently, we have a data frame with 68275 rows. We would like to have a data frame with 2829 rows - one per each listing. We can use the group_by() and summarize() functions to transform the data frame in this way.

reviews_by_listing = reviews %>%
  select(listing_id,comments) %>%
  group_by(listing_id) %>%
  summarize(all_comments=paste(comments,collapse=" "))

# Check out the updated data frame - 2829 rows.
str(reviews_by_listing)
## Classes 'tbl_df', 'tbl' and 'data.frame':    2829 obs. of  2 variables:
##  $ listing_id  : int  3353 5506 6695 6976 8792 9273 9765 9824 9855 9857 ...
##  $ all_comments: chr  "Very friendly and helpful. Convenient location. The location is great as it's right next to the Green T stop and there are many"| __truncated__ "Terry's Hotel Alterntv in Boston was a perfect place to stay for myself and my partner.  We mixed our trip with business and pl"| __truncated__ "Terry's apartment is beautifully decorated and feels like a home away from home. Kudos especially to the kitchen, which was rea"| __truncated__ "A Wonderful, pleasant, and charming host.  The bed is very comfortable and the room is nice. Travel wise one is 15 minutes by t"| __truncated__ ...

View the first listing’s comments.

reviews_by_listing$all_comments[1]

Observations? What are some problems that we might run into with bag-of-words?

Natural Language Processing slides

Now, we are ready to construct the Bag-of-Words with Airbnb reviews for the prediction task. The following step-by-step procedure for building the Bag-of-Words is adapted from MIT EdX - Analytics Edge, Lecture 8.

Step 0: Install and load two packages for pre-processing:

# install.packages("tm")
library(tm)
# install.packages("SnowballC")
library(SnowballC)

Step 1: Convert reviewer comments to a corpus, automatically processing escape characters like “\n”.
Step 2: Change all the text to lower case, and convert to “PlainTextDocument” (required after to_lower function).
Step 3: Remove all punctuation.
Step 4: Remove stop words (this step may take a minute).
Step 5: Stem our document.

corpus = Corpus(VectorSource(reviews_by_listing$all_comments)) %>%
  tm_map(tolower) %>%
  tm_map(PlainTextDocument) %>%
  tm_map(removePunctuation) %>%
  tm_map(removeWords, stopwords("english")) %>%
  tm_map(stemDocument)

# Take a look
strwrap(corpus[[1]])[1:3]
## [1] "friend help conveni locat locat great right next green t stop mani"
## [2] "food place nearbi room small fit queen mattress almost noth els"   
## [3] "day use room night work host friend help giuesepp cordial friend"
# Take a look at tm's stopwords:
stopwords("english")[1:100]
##   [1] "i"          "me"         "my"         "myself"     "we"        
##   [6] "our"        "ours"       "ourselves"  "you"        "your"      
##  [11] "yours"      "yourself"   "yourselves" "he"         "him"       
##  [16] "his"        "himself"    "she"        "her"        "hers"      
##  [21] "herself"    "it"         "its"        "itself"     "they"      
##  [26] "them"       "their"      "theirs"     "themselves" "what"      
##  [31] "which"      "who"        "whom"       "this"       "that"      
##  [36] "these"      "those"      "am"         "is"         "are"       
##  [41] "was"        "were"       "be"         "been"       "being"     
##  [46] "have"       "has"        "had"        "having"     "do"        
##  [51] "does"       "did"        "doing"      "would"      "should"    
##  [56] "could"      "ought"      "i'm"        "you're"     "he's"      
##  [61] "she's"      "it's"       "we're"      "they're"    "i've"      
##  [66] "you've"     "we've"      "they've"    "i'd"        "you'd"     
##  [71] "he'd"       "she'd"      "we'd"       "they'd"     "i'll"      
##  [76] "you'll"     "he'll"      "she'll"     "we'll"      "they'll"   
##  [81] "isn't"      "aren't"     "wasn't"     "weren't"    "hasn't"    
##  [86] "haven't"    "hadn't"     "doesn't"    "don't"      "didn't"    
##  [91] "won't"      "wouldn't"   "shan't"     "shouldn't"  "can't"     
##  [96] "cannot"     "couldn't"   "mustn't"    "let's"      "that's"

Step 6: Create a word count matrix (rows are reviews, columns are words).

frequencies = DocumentTermMatrix(corpus)

# Take a look
frequencies
## <<DocumentTermMatrix (documents: 2829, terms: 45645)>>
## Non-/sparse entries: 838822/128290883
## Sparsity           : 99%
## Maximal term length: 365
## Weighting          : term frequency (tf)

Step 7: Account for sparsity.

# Use findFreqTerms to get a feeling for which words appear the most.
# Words that appear at least 10000 times:
findFreqTerms(frequencies, lowfreq=10000)
##  [1] "also"         "apart"        "boston"       "clean"       
##  [5] "close"        "comfort"      "definit"      "easi"        
##  [9] "everyth"      "get"          "good"         "great"       
## [13] "help"         "home"         "host"         "hous"        
## [17] "locat"        "love"         "need"         "neighborhood"
## [21] "nice"         "perfect"      "place"        "realli"      
## [25] "recommend"    "room"         "stay"         "time"        
## [29] "walk"         "well"
# All 45645 terms will not be useful to us. Might as well get rid of some of them - why?
# Solution: only keep terms that appear in x% or more of the reviews
# 5% or more (142 or more)
sparse = removeSparseTerms(frequencies, 0.95)

# How many did we keep? (1136 terms, compared to 45645 previously)
sparse
## <<DocumentTermMatrix (documents: 2829, terms: 1136)>>
## Non-/sparse entries: 597260/2616484
## Sparsity           : 81%
## Maximal term length: 13
## Weighting          : term frequency (tf)
# colnames(sparse)

Step 8: Create data frame.

commentsTM = as.data.frame(as.matrix(sparse))

# View data frame (rows are reviews, columns are words)
str(commentsTM, list.len=10)
## 'data.frame':    2829 obs. of  1136 variables:
##  $ 100          : num  0 0 0 0 1 0 1 0 0 0 ...
##  $ 1015         : num  0 0 0 0 0 1 0 0 0 0 ...
##  $ 1520         : num  0 2 0 0 0 0 0 0 0 0 ...
##  $ 2nd          : num  0 0 0 0 0 0 0 0 0 4 ...
##  $ 3rd          : num  0 2 0 0 0 0 0 2 0 0 ...
##  $ 510          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ abl          : num  1 1 2 3 0 3 0 2 0 0 ...
##  $ absolut      : num  0 0 2 2 0 2 0 2 0 1 ...
##  $ accept       : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ access       : num  3 5 8 6 0 2 1 0 0 0 ...
##   [list output truncated]
# Drop columns that include numbers
commentsTM = commentsTM[,!grepl("[0-9]",names(commentsTM))]

We have finished building the term frequency data frame commentsTM. Next, we need to merge the two data frames commentsTM (features) and listings_scores (labels) before we can run our machine learning algorithms on this data. This will be a full-join by LISTING_ID.

# Add our dependent variable:
commentsTM$LISTING_ID = reviews_by_listing$listing_id
commentsTM = full_join(listings_scores, commentsTM)
## Joining, by = "LISTING_ID"
# Remove all rows with NA's
commentsTM = na.omit(commentsTM)

# View the first few data frame columns
# Note: Column names corresponding to word frequencies are lowercase,
# while all other column names are uppercase.
names(commentsTM)[1:10]
##  [1] "LISTING_ID"  "RATING"      "RATING_TEXT" "abl"         "absolut"    
##  [6] "accept"      "access"      "accommod"    "accomod"     "accueil"

Exercise 2.2: Your own Bag-of-Words Following steps 0-8, build a Bag-of-Words data frame on the listings description data. Add price as the dependent variable, name it “PRICE”, remove the rows with price=NA, and move this column to the front of the new data frame. (Hint: Construct the term-matrix listingsTM by modifying the NLP code in the file bonus.R.)

Up to here, we have just pre-processed and prepared our data. Now, we are ready to build models.

Building a CART model using Bag-of-Words

Next, we will use our Bag-of-Words to build a CART model to predict review scores. How could a model like this be useful in practice? We will follow the same cross-validation procedure as we did before to select the cp parameter for our CART model. The only difference is that now our features will be word counts, and our predictions will be the discrete values: (“Terrible”,“Low”,“Mid”,“High”,“Perfect”)

To begin, convert RATING_TEXT to a factor variable, and set the order of the level values so that they appear properly in our confusion matrix.

commentsTM$RATING_TEXT = commentsTM$RATING_TEXT %>%
  as.factor() %>%
  ordered(levels=c("Terrible","Low","Mid","High","Perfect"))
str(commentsTM$RATING_TEXT)
##  Ord.factor w/ 5 levels "Terrible"<"Low"<..: 3 4 5 4 5 3 4 4 3 2 ...

Split data into training and testing sets

# install.packages("caTools")
library(caTools)
set.seed(123)
spl = sample.split(commentsTM$RATING_TEXT, SplitRatio = 0.7)
commentsTrain = subset(commentsTM, spl==TRUE)
commentsTest = subset(commentsTM, spl==FALSE)

Let’s use CART! Why CART?

# install.packages("rpart")
library(rpart)
# install.packages("rpart.plot")
library(rpart.plot)

First, train the model using the default parameter values (cp=0.01) Of course, we cannot include RATING or LISTING_ID as predictive variables - why not?

commentsCART = rpart(RATING_TEXT ~ . - RATING - LISTING_ID,
                     data=commentsTrain,
                     method="class")
# Display decision tree.  Does it make intuitive sense?
prp(commentsCART)

Next, let’s perform cross-validation on our Bag-of-Words CART model to tune our choice for cp. Useful resource for cross-validation of cp in rpart: https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf

Step 1: Begin by constructing a tree with a small cp parameter

set.seed(2222)
commentsCART = rpart(RATING_TEXT ~ . - RATING - LISTING_ID, 
                     data=commentsTrain,
                     cp=0.001,
                     method="class")

Step 2: View the cross-validated error vs. cp

# In the `printcp()` table:
# "nsplit"    = number of splits in tree
# "rel error" = scaled training error
# "xerror"    = scaled cross-validation error
# "xstd"      = standard deviation of xerror
printcp(commentsCART)
## 
## Classification tree:
## rpart(formula = RATING_TEXT ~ . - RATING - LISTING_ID, data = commentsTrain, 
##     method = "class", cp = 0.001)
## 
## Variables actually used in tree construction:
##  [1] absolut    access     accommod   advertis   afford     air       
##  [7] although   anoth      awesom     bathroom   beauti     believ    
## [13] beyond     boston     clean      close      code       comfort   
## [19] communic   deal       deck       definit    differ     dirti     
## [25] drink      end        enough     even       ever       everyth   
## [31] exact      expect     first      found      four       futur     
## [37] get        good       gracious   great      high       home      
## [43] host       howev      immacul    late       let        locat     
## [49] move       negat      never      night      offer      one       
## [55] overal     person     phone      pick       price      profession
## [61] provid     quaint     question   receiv     recommend  said      
## [67] second     short      smaller    smell      spacious   stay      
## [73] stop       take       thank      thought    train      transit   
## [79] transport  travel     trip       une        visit      weekend   
## [85] well       will       wonder    
## 
## Root node error: 1413/1941 = 0.72798
## 
## n= 1941 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.2016985      0   1.00000 1.00000 0.013875
## 2  0.0368011      1   0.79830 0.81246 0.015327
## 3  0.0113234      2   0.76150 0.77707 0.015455
## 4  0.0084926      5   0.72753 0.78273 0.015437
## 5  0.0077849      6   0.71904 0.77424 0.015463
## 6  0.0070771     11   0.68011 0.77070 0.015473
## 7  0.0068412     12   0.67304 0.76929 0.015477
## 8  0.0049540     15   0.65251 0.75796 0.015506
## 9  0.0046001     17   0.64260 0.76362 0.015492
## 10 0.0042463     20   0.62845 0.75442 0.015514
## 11 0.0038924     26   0.60297 0.75513 0.015513
## 12 0.0035386     38   0.55343 0.75938 0.015503
## 13 0.0031847     50   0.51026 0.76079 0.015499
## 14 0.0028309     52   0.50389 0.76008 0.015501
## 15 0.0025950     57   0.48974 0.75655 0.015509
## 16 0.0024770     60   0.48195 0.75584 0.015511
## 17 0.0021231     62   0.47700 0.76929 0.015477
## 18 0.0018401     72   0.45577 0.77141 0.015471
## 19 0.0017693     78   0.44444 0.77424 0.015463
## 20 0.0014154     83   0.43454 0.77495 0.015461
## 21 0.0012385    100   0.41047 0.77565 0.015459
## 22 0.0011795    104   0.40552 0.77353 0.015465
## 23 0.0010616    107   0.40198 0.77565 0.015459
## 24 0.0010000    109   0.39986 0.78202 0.015439
# In the `plotcp()` plot:
# size of tree = (number of splits in tree) + 1
# dashed line occurs at 1 std. dev. above the minimum xerror
# Rule of Thumb: select the model size which first
#                goes below the dotted line
plotcp(commentsCART)

Step 3: Prune the tree, and take a look

commentsCART = prune(commentsCART,cp=0.007)
prp(commentsCART)