https://github.com/PhilChodrow/cos_2017/tree/master/3_modeling_and_ml
#### Intro stuff to be loaded in during part 1 ####
setwd("~/cos_2017/3_modeling_and_ml")
# install.packages("tidyverse")
library(tidyverse)
# Read in listings data set, and convert price column to numeric
listings = read.csv("../data/listings.csv")
listings$price = as.numeric(gsub('\\$|,','',as.character(listings$price)))
# Read in the reviews data set, making sure to set stringsAsFactors=FALSE
reviews = read.csv("../data/reviews.csv", stringsAsFactors = FALSE)
View the data from reviews.csv
. What does each row represent?
head(reviews, 3)
## listing_id id date reviewer_id reviewer_name
## 1 1178162 4724140 2013-05-21 4298113 Olivier
## 2 1178162 4869189 2013-05-29 6452964 Charlotte
## 3 1178162 5003196 2013-06-06 6449554 Sebastian
## comments
## 1 My stay at islam's place was really cool! Good location, 5min away from subway, then 10min from downtown. The room was nice, all place was clean. Islam managed pretty well our arrival, even if it was last minute ;) i do recommand this place to any airbnb user :)
## 2 Great location for both airport and city - great amenities in the house: Plus Islam was always very helpful even though he was away
## 3 We really enjoyed our stay at Islams house. From the outside the house didn't look so inviting but the inside was very nice! Even though Islam himself was not there everything was prepared for our arrival. The airport T Station is only a 5-10 min walk away. The only little issue was that all the people in the house had to share one bathroom. But it was not really a problem and it worked out fine. We would recommend Islams place for a stay in Boston.
Display the top 10 most reviewed Airbnb listings using the reviews
data frame. Are the counts consistent with the data in the listings
data frame?
sort(listings$number_of_reviews, decreasing = TRUE)[1:10]
sort(table(reviews$listing_id), decreasing = TRUE)[1:10]
Later on, we will want to merge the two data frames: listings
and reviews
. The ID variables that we will use for this merge operation are listings$id
and reviews$listing_id
. It is important to understand the data structure before performing the analysis. Both data frames have 2829 unique listings with >= 1 review - let’s confirm this fact.
length(unique(listings$id))
nrow(filter(listings, number_of_reviews>0))
length(unique(reviews$listing_id))
We will take review_scores_rating
as the dependent variable that we are trying to predict. This is the average customer rating of the Airbnb listing, on a scale of 0-100. Plot a simple histogram of review_scores_rating
, and count the number of values != NA.
hist(listings$review_scores_rating)
sum(!is.na(listings$review_scores_rating))
Next, create a new data frame with just the review scores data from listings.csv
. Filter out rows with review_scores_rating
=NA.
listings_scores = listings %>%
filter(number_of_reviews > 0) %>%
select("LISTING_ID"=id, "RATING"=review_scores_rating) %>%
filter(!is.na(RATING))
str(listings_scores)
## 'data.frame': 2772 obs. of 2 variables:
## $ LISTING_ID: int 3075044 6976 1436513 7651065 12386020 5706985 2843445 753446 849408 12023024 ...
## $ RATING : int 94 98 100 99 100 90 96 96 94 80 ...
Exercise 2.1: Writing a simple function in R
The syntax for writing the function f(x) = x^2 is
f <- function(x){
return(x*x)
}
Write a function to convert the listing rating from a scale of 0-100 to (“Terrible”,“Low”,“Mid”,“High”,“Perfect”).
Given an integer input rating from 0-100, the function should output:
“Perfect” if rating = 100
“High” if 95 <= rating < 99
“Mid” if 90 <= rating < 94
“Low” if 80 <= rating < 89
“Terrible” if rating <= 79
For example: convert_rating(64) should output “Terrible”
Solution:
convert_rating <- function(rating){
if(rating == 100){
return("Perfect")
}else if(rating >= 95){
return("High")
}else if(rating >= 90){
return("Mid")
}else if(rating >= 80){
return("Low")
}else{
return("Terrible")
}
}
Test a few values to make sure that the function works
convert_rating(100)
convert_rating(98)
convert_rating(90)
convert_rating(82)
convert_rating(46)
To apply the convert_rating()
function to each element in an array, we need to “vectorize” the function first. Avoid using for-loops in R whenever possible because those are slow.
v_convert_rating <- Vectorize(convert_rating, c("rating"))
# Test a few values to make sure that the function works.
v_convert_rating(c(100,32,87))
## [1] "Perfect" "Terrible" "Low"
Compute the new column using a mutate call.
listings_scores <- listings_scores %>%
mutate(RATING_TEXT = v_convert_rating(RATING))
# Take a look
table(listings_scores$RATING_TEXT)
##
## High Low Mid Perfect Terrible
## 754 545 670 628 175
These groupings look relatively well-balanced, which is desirable. For the NLP task, we will try to predict this rating category based upon the text data from reviews.csv
.
Let’s go back to reviews data frame.
str(reviews)
## 'data.frame': 68275 obs. of 6 variables:
## $ listing_id : int 1178162 1178162 1178162 1178162 1178162 1178162 1178162 1178162 1178162 1178162 ...
## $ id : int 4724140 4869189 5003196 5150351 5171140 5198929 6702817 6873023 7646702 8094418 ...
## $ date : chr "2013-05-21" "2013-05-29" "2013-06-06" "2013-06-15" ...
## $ reviewer_id : int 4298113 6452964 6449554 2215611 6848427 6663826 8099222 7671888 8197342 9040491 ...
## $ reviewer_name: chr "Olivier" "Charlotte" "Sebastian" "Marine" ...
## $ comments : chr "My stay at islam's place was really cool! Good location, 5min away from subway, then 10min from downtown. The room was nice, al"| __truncated__ "Great location for both airport and city - great amenities in the house: Plus Islam was always very helpful even though he was "| __truncated__ "We really enjoyed our stay at Islams house. From the outside the house didn't look so inviting but the inside was very nice! Ev"| __truncated__ "The room was nice and clean and so were the commodities. Very close to the airport metro station and located in quite safe area"| __truncated__ ...
Currently, we have a data frame with 68275 rows. We would like to have a data frame with 2829 rows - one per each listing. We can use the group_by()
and summarize()
functions to transform the data frame in this way.
reviews_by_listing = reviews %>%
select(listing_id,comments) %>%
group_by(listing_id) %>%
summarize(all_comments=paste(comments,collapse=" "))
# Check out the updated data frame - 2829 rows.
str(reviews_by_listing)
## Classes 'tbl_df', 'tbl' and 'data.frame': 2829 obs. of 2 variables:
## $ listing_id : int 3353 5506 6695 6976 8792 9273 9765 9824 9855 9857 ...
## $ all_comments: chr "Very friendly and helpful. Convenient location. The location is great as it's right next to the Green T stop and there are many"| __truncated__ "Terry's Hotel Alterntv in Boston was a perfect place to stay for myself and my partner. We mixed our trip with business and pl"| __truncated__ "Terry's apartment is beautifully decorated and feels like a home away from home. Kudos especially to the kitchen, which was rea"| __truncated__ "A Wonderful, pleasant, and charming host. The bed is very comfortable and the room is nice. Travel wise one is 15 minutes by t"| __truncated__ ...
View the first listing’s comments.
reviews_by_listing$all_comments[1]
Observations? What are some problems that we might run into with bag-of-words?
Natural Language Processing slides
Now, we are ready to construct the Bag-of-Words with Airbnb reviews for the prediction task. The following step-by-step procedure for building the Bag-of-Words is adapted from MIT EdX - Analytics Edge, Lecture 8.
Step 0: Install and load two packages for pre-processing:
# install.packages("tm")
library(tm)
# install.packages("SnowballC")
library(SnowballC)
Step 1: Convert reviewer comments to a corpus, automatically processing escape characters like “\n”.
Step 2: Change all the text to lower case, and convert to “PlainTextDocument” (required after to_lower function).
Step 3: Remove all punctuation.
Step 4: Remove stop words (this step may take a minute).
Step 5: Stem our document.
corpus = Corpus(VectorSource(reviews_by_listing$all_comments)) %>%
tm_map(tolower) %>%
tm_map(PlainTextDocument) %>%
tm_map(removePunctuation) %>%
tm_map(removeWords, stopwords("english")) %>%
tm_map(stemDocument)
# Take a look
strwrap(corpus[[1]])[1:3]
## [1] "friend help conveni locat locat great right next green t stop mani"
## [2] "food place nearbi room small fit queen mattress almost noth els"
## [3] "day use room night work host friend help giuesepp cordial friend"
# Take a look at tm's stopwords:
stopwords("english")[1:100]
## [1] "i" "me" "my" "myself" "we"
## [6] "our" "ours" "ourselves" "you" "your"
## [11] "yours" "yourself" "yourselves" "he" "him"
## [16] "his" "himself" "she" "her" "hers"
## [21] "herself" "it" "its" "itself" "they"
## [26] "them" "their" "theirs" "themselves" "what"
## [31] "which" "who" "whom" "this" "that"
## [36] "these" "those" "am" "is" "are"
## [41] "was" "were" "be" "been" "being"
## [46] "have" "has" "had" "having" "do"
## [51] "does" "did" "doing" "would" "should"
## [56] "could" "ought" "i'm" "you're" "he's"
## [61] "she's" "it's" "we're" "they're" "i've"
## [66] "you've" "we've" "they've" "i'd" "you'd"
## [71] "he'd" "she'd" "we'd" "they'd" "i'll"
## [76] "you'll" "he'll" "she'll" "we'll" "they'll"
## [81] "isn't" "aren't" "wasn't" "weren't" "hasn't"
## [86] "haven't" "hadn't" "doesn't" "don't" "didn't"
## [91] "won't" "wouldn't" "shan't" "shouldn't" "can't"
## [96] "cannot" "couldn't" "mustn't" "let's" "that's"
Step 6: Create a word count matrix (rows are reviews, columns are words).
frequencies = DocumentTermMatrix(corpus)
# Take a look
frequencies
## <<DocumentTermMatrix (documents: 2829, terms: 45645)>>
## Non-/sparse entries: 838822/128290883
## Sparsity : 99%
## Maximal term length: 365
## Weighting : term frequency (tf)
Step 7: Account for sparsity.
# Use findFreqTerms to get a feeling for which words appear the most.
# Words that appear at least 10000 times:
findFreqTerms(frequencies, lowfreq=10000)
## [1] "also" "apart" "boston" "clean"
## [5] "close" "comfort" "definit" "easi"
## [9] "everyth" "get" "good" "great"
## [13] "help" "home" "host" "hous"
## [17] "locat" "love" "need" "neighborhood"
## [21] "nice" "perfect" "place" "realli"
## [25] "recommend" "room" "stay" "time"
## [29] "walk" "well"
# All 45645 terms will not be useful to us. Might as well get rid of some of them - why?
# Solution: only keep terms that appear in x% or more of the reviews
# 5% or more (142 or more)
sparse = removeSparseTerms(frequencies, 0.95)
# How many did we keep? (1136 terms, compared to 45645 previously)
sparse
## <<DocumentTermMatrix (documents: 2829, terms: 1136)>>
## Non-/sparse entries: 597260/2616484
## Sparsity : 81%
## Maximal term length: 13
## Weighting : term frequency (tf)
# colnames(sparse)
Step 8: Create data frame.
commentsTM = as.data.frame(as.matrix(sparse))
# View data frame (rows are reviews, columns are words)
str(commentsTM, list.len=10)
## 'data.frame': 2829 obs. of 1136 variables:
## $ 100 : num 0 0 0 0 1 0 1 0 0 0 ...
## $ 1015 : num 0 0 0 0 0 1 0 0 0 0 ...
## $ 1520 : num 0 2 0 0 0 0 0 0 0 0 ...
## $ 2nd : num 0 0 0 0 0 0 0 0 0 4 ...
## $ 3rd : num 0 2 0 0 0 0 0 2 0 0 ...
## $ 510 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ abl : num 1 1 2 3 0 3 0 2 0 0 ...
## $ absolut : num 0 0 2 2 0 2 0 2 0 1 ...
## $ accept : num 0 0 0 0 0 0 0 1 0 0 ...
## $ access : num 3 5 8 6 0 2 1 0 0 0 ...
## [list output truncated]
# Drop columns that include numbers
commentsTM = commentsTM[,!grepl("[0-9]",names(commentsTM))]
We have finished building the term frequency data frame commentsTM
. Next, we need to merge the two data frames commentsTM
(features) and listings_scores
(labels) before we can run our machine learning algorithms on this data. This will be a full-join by LISTING_ID
.
# Add our dependent variable:
commentsTM$LISTING_ID = reviews_by_listing$listing_id
commentsTM = full_join(listings_scores, commentsTM)
## Joining, by = "LISTING_ID"
# Remove all rows with NA's
commentsTM = na.omit(commentsTM)
# View the first few data frame columns
# Note: Column names corresponding to word frequencies are lowercase,
# while all other column names are uppercase.
names(commentsTM)[1:10]
## [1] "LISTING_ID" "RATING" "RATING_TEXT" "abl" "absolut"
## [6] "accept" "access" "accommod" "accomod" "accueil"
Exercise 2.2: Your own Bag-of-Words Following steps 0-8, build a Bag-of-Words data frame on the listings description data. Add price as the dependent variable, name it “PRICE”, remove the rows with price=NA, and move this column to the front of the new data frame. (Hint: Construct the term-matrix listingsTM by modifying the NLP code in the file bonus.R.)
Up to here, we have just pre-processed and prepared our data. Now, we are ready to build models.
Next, we will use our Bag-of-Words to build a CART model to predict review scores. How could a model like this be useful in practice? We will follow the same cross-validation procedure as we did before to select the cp parameter for our CART model. The only difference is that now our features will be word counts, and our predictions will be the discrete values: (“Terrible”,“Low”,“Mid”,“High”,“Perfect”)
To begin, convert RATING_TEXT
to a factor variable, and set the order of the level values so that they appear properly in our confusion matrix.
commentsTM$RATING_TEXT = commentsTM$RATING_TEXT %>%
as.factor() %>%
ordered(levels=c("Terrible","Low","Mid","High","Perfect"))
str(commentsTM$RATING_TEXT)
## Ord.factor w/ 5 levels "Terrible"<"Low"<..: 3 4 5 4 5 3 4 4 3 2 ...
Split data into training and testing sets
# install.packages("caTools")
library(caTools)
set.seed(123)
spl = sample.split(commentsTM$RATING_TEXT, SplitRatio = 0.7)
commentsTrain = subset(commentsTM, spl==TRUE)
commentsTest = subset(commentsTM, spl==FALSE)
Let’s use CART! Why CART?
# install.packages("rpart")
library(rpart)
# install.packages("rpart.plot")
library(rpart.plot)
First, train the model using the default parameter values (cp=0.01) Of course, we cannot include RATING
or LISTING_ID
as predictive variables - why not?
commentsCART = rpart(RATING_TEXT ~ . - RATING - LISTING_ID,
data=commentsTrain,
method="class")
# Display decision tree. Does it make intuitive sense?
prp(commentsCART)
Next, let’s perform cross-validation on our Bag-of-Words CART model to tune our choice for cp. Useful resource for cross-validation of cp in rpart: https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf
Step 1: Begin by constructing a tree with a small cp parameter
set.seed(2222)
commentsCART = rpart(RATING_TEXT ~ . - RATING - LISTING_ID,
data=commentsTrain,
cp=0.001,
method="class")
Step 2: View the cross-validated error vs. cp
# In the `printcp()` table:
# "nsplit" = number of splits in tree
# "rel error" = scaled training error
# "xerror" = scaled cross-validation error
# "xstd" = standard deviation of xerror
printcp(commentsCART)
##
## Classification tree:
## rpart(formula = RATING_TEXT ~ . - RATING - LISTING_ID, data = commentsTrain,
## method = "class", cp = 0.001)
##
## Variables actually used in tree construction:
## [1] absolut access accommod advertis afford air
## [7] although anoth awesom bathroom beauti believ
## [13] beyond boston clean close code comfort
## [19] communic deal deck definit differ dirti
## [25] drink end enough even ever everyth
## [31] exact expect first found four futur
## [37] get good gracious great high home
## [43] host howev immacul late let locat
## [49] move negat never night offer one
## [55] overal person phone pick price profession
## [61] provid quaint question receiv recommend said
## [67] second short smaller smell spacious stay
## [73] stop take thank thought train transit
## [79] transport travel trip une visit weekend
## [85] well will wonder
##
## Root node error: 1413/1941 = 0.72798
##
## n= 1941
##
## CP nsplit rel error xerror xstd
## 1 0.2016985 0 1.00000 1.00000 0.013875
## 2 0.0368011 1 0.79830 0.81246 0.015327
## 3 0.0113234 2 0.76150 0.77707 0.015455
## 4 0.0084926 5 0.72753 0.78273 0.015437
## 5 0.0077849 6 0.71904 0.77424 0.015463
## 6 0.0070771 11 0.68011 0.77070 0.015473
## 7 0.0068412 12 0.67304 0.76929 0.015477
## 8 0.0049540 15 0.65251 0.75796 0.015506
## 9 0.0046001 17 0.64260 0.76362 0.015492
## 10 0.0042463 20 0.62845 0.75442 0.015514
## 11 0.0038924 26 0.60297 0.75513 0.015513
## 12 0.0035386 38 0.55343 0.75938 0.015503
## 13 0.0031847 50 0.51026 0.76079 0.015499
## 14 0.0028309 52 0.50389 0.76008 0.015501
## 15 0.0025950 57 0.48974 0.75655 0.015509
## 16 0.0024770 60 0.48195 0.75584 0.015511
## 17 0.0021231 62 0.47700 0.76929 0.015477
## 18 0.0018401 72 0.45577 0.77141 0.015471
## 19 0.0017693 78 0.44444 0.77424 0.015463
## 20 0.0014154 83 0.43454 0.77495 0.015461
## 21 0.0012385 100 0.41047 0.77565 0.015459
## 22 0.0011795 104 0.40552 0.77353 0.015465
## 23 0.0010616 107 0.40198 0.77565 0.015459
## 24 0.0010000 109 0.39986 0.78202 0.015439
# In the `plotcp()` plot:
# size of tree = (number of splits in tree) + 1
# dashed line occurs at 1 std. dev. above the minimum xerror
# Rule of Thumb: select the model size which first
# goes below the dotted line
plotcp(commentsCART)
Step 3: Prune the tree, and take a look
commentsCART = prune(commentsCART,cp=0.007)
prp(commentsCART)
Step 4: Evaluate model in-sample and out-of-sample accuracy, using a confusion matrix (because this is a classification problem).
# CART on training set:
PredictCARTTrain = predict(commentsCART, type="class")
confusionMatrixTrain = table(commentsTrain$RATING_TEXT, PredictCARTTrain)
confusionMatrixTrain
## PredictCARTTrain
## Terrible Low Mid High Perfect
## Terrible 37 37 7 10 31
## Low 11 180 44 51 96
## Mid 4 128 93 166 78
## High 1 72 34 392 29
## Perfect 2 33 29 88 288
# Accuracy?
sum(diag(confusionMatrixTrain))/nrow(commentsTrain)
## [1] 0.5100464
# Predictions on test set
PredictCART = predict(commentsCART, newdata=commentsTest, type="class")
confusionMatrix = table(commentsTest$RATING_TEXT, PredictCART)
confusionMatrix
## PredictCART
## Terrible Low Mid High Perfect
## Terrible 4 18 6 2 23
## Low 5 74 17 29 38
## Mid 3 62 32 75 29
## High 1 38 21 150 16
## Perfect 0 11 18 35 124
# Accuracy?
sum(diag(confusionMatrix))/nrow(commentsTest)
## [1] 0.4620939
Question: How much worse would we have done if we didn’t use cross-validation, and just stuck with the default cp value (0.01)?
Step 5: Compare model to the baseline.
# Most frequent response variable in training set is "High"
# => Baseline accuracy is 0.2720
table(commentsTest$RATING_TEXT)["High"]/nrow(commentsTest)
## High
## 0.2719615
Can we improve the accuracy of our model in any way? Let’s try adding a few more features from the listings
data frame.
str(listings)
more_features = listings %>%
select(LISTING_ID=id, SUPER_HOST=host_is_superhost,
RESPONSE_TIME=host_response_time,
PRICE=price)
commentsTM = full_join(more_features,commentsTM)
str(commentsTM,list.len=10)
Rerun the CART model with the following code, and check the out-of-sample performance. Does it improve? Why or why not?
set.seed(123)
spl = sample.split(commentsTM$RATING_TEXT, SplitRatio = 0.7)
commentsTrain = subset(commentsTM, spl==TRUE)
commentsTest = subset(commentsTM, spl==FALSE)
commentsCART = rpart(RATING_TEXT ~ . - RATING - LISTING_ID,
data=commentsTrain,
method="class")
prp(commentsCART)
# CART on training set
PredictCARTTrain = predict(commentsCART, type="class")
confusionMatrixTrain = table(commentsTrain$RATING_TEXT, PredictCARTTrain)
# Accuracy?
sum(diag(confusionMatrixTrain))/nrow(commentsTrain)
# Predictions on test set
PredictCART = predict(commentsCART, newdata=commentsTest, type="class")
confusionMatrix = table(commentsTest$RATING_TEXT, PredictCART)
# Accuracy?
sum(diag(confusionMatrix))/nrow(commentsTest)
Exercise 2.3: Bag-of-Words + LASSO
Using the Bag-of-Words constructed in Exercise 2.3, build a LASSO model to predict price based upon listing descriptions only. (Hint: To build the LASSO model, follow the instructions in part 1 of this module.)
k-means Clustering slides
Thus far, our machine learning task has been to predict labels, which were either continuous-valued (for regression) or discrete-valued (for classification). To do this, we input to the ML algorithms some known (feature, label) examples (the training set), and the ML algorithm outputs a function which enables us to make predictions for some unknown (feature, ?) examples (the testing set). This problem setup is known as Supervised Learning.
Next, we consider Unsupervised Learning, where we are not given labelled examples, and we simply run ML algorithms on (feature) data, with the purpose of finding interesting structure and patterns in the data. Let’s run one of the widely-used unsupervised learning algorithms, k-means clustering, on the listings
data frame to explore the Airbnb data set.
# First, let's look at help page for the function `k-means()`:
help(kmeans)
# View the data.frame `listings`:
str(listings, list.len=5)
## 'data.frame': 3585 obs. of 95 variables:
## $ id : int 12147973 3075044 6976 1436513 7651065 12386020 5706985 2843445 753446 849408 ...
## $ listing_url : Factor w/ 3585 levels "https://www.airbnb.com/rooms/10004575",..: 462 1876 2737 1261 2918 543 2402 1830 2890 3196 ...
## $ scrape_id : num 2.02e+13 2.02e+13 2.02e+13 2.02e+13 2.02e+13 ...
## $ last_scraped : Factor w/ 1 level "2016-09-07": 1 1 1 1 1 1 1 1 1 1 ...
## $ name : Factor w/ 3504 levels " Cozy Spot in Boston's Little Italy",..: 3188 1052 2215 3041 1179 2544 2338 7 344 2489 ...
## [list output truncated]
Let’s create a new data.frame listings_numeric
which has the subset of columns that we wish to cluster on. For the k-means()
function, all of these columns must be numeric.
listings_numeric = listings %>% select(id,latitude,longitude,
accommodates, bathrooms,
bedrooms, review_scores_rating,
price) %>%
na.omit()
str(listings_numeric)
## 'data.frame': 2753 obs. of 8 variables:
## $ id : int 3075044 6976 1436513 7651065 12386020 5706985 2843445 753446 849408 12023024 ...
## $ latitude : num 42.3 42.3 42.3 42.3 42.3 ...
## $ longitude : num -71.1 -71.1 -71.1 -71.1 -71.1 ...
## $ accommodates : int 2 2 4 2 2 3 2 2 5 2 ...
## $ bathrooms : num 1 1 1 1.5 1 1 2 1 1 1 ...
## $ bedrooms : int 1 1 1 1 1 1 1 1 2 1 ...
## $ review_scores_rating: int 94 98 100 99 100 90 96 96 94 80 ...
## $ price : num 65 65 75 79 75 100 75 58 229 60 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:832] 1 19 33 37 54 55 57 66 72 74 ...
## .. ..- attr(*, "names")= chr [1:832] "1" "19" "33" "37" ...
Next, run the k-means algorithm on the numeric data.frame, with k = 5
cluster centroids:
# k-means clustering
set.seed(1234)
kmeans_clust = kmeans(listings_numeric[,-1:-3],5, iter=1000, nstart=100)
kmeans_clust
Look at the average values within the clusters. What are the characteristics of these 5 groups? How many listings are in each cluster?
kmeans_clust$centers
## accommodates bathrooms bedrooms review_scores_rating price
## 1 2.105873 1.160050 0.9875931 91.00331 80.53515
## 2 3.197987 1.121924 1.1174497 92.54586 170.09284
## 3 4.289109 1.306931 1.7287129 92.82574 274.25941
## 4 6.468750 2.328125 3.0312500 96.53125 712.96875
## 5 5.876106 1.823009 2.5044248 92.34513 426.60177
table(kmeans_clust$cluster)
##
## 1 2 3 4 5
## 1209 894 505 32 113
Finally, let’s display the clusters geographically using the (latitude, longitude) data. First, use ggmap
to obtain a map of Boston. Adapted from https://www.r-bloggers.com/drug-crime-density-in-boston/
# install.packages("ggmap")
# devtools::install_github("dkahle/ggmap")
library(ggmap)
## Google Maps API Terms of Service: http://developers.google.com/maps/terms.
## Please cite ggmap if you use it: see citation("ggmap") for details.
# requires internet connection
bos_plot=ggmap(get_map('Boston, Massachusetts',zoom=13,source='google',maptype='terrain'))
## Source : https://maps.googleapis.com/maps/api/staticmap?center=Boston,+Massachusetts&zoom=13&size=640x640&scale=2&maptype=terrain&language=en-EN
## Source : https://maps.googleapis.com/maps/api/geocode/json?address=Boston%2C%20Massachusetts
bos_plot
To get good color schemes, use RColorbrewer
.
# install.packages("RColorbrewer")
library(RColorBrewer)
display.brewer.all()
Plot maps and Airbnb locations using the ggplot
syntax.
bos_plot +
geom_point(data=listings_numeric,aes(x=longitude,y=latitude,colour=factor(kmeans_clust$cluster)),
alpha=.5,size=1) +
xlab("Longitude") + ylab("Latitude") +
scale_colour_brewer("Cluster", palette = "Set1")
Can you see where the clusters are? Also, what is the proper number of clusters? We will revisit this in the next session, because it requires some more advanced tidyverse tools. Stay tuned!
In this module, we have covered examples of machine learning methods for linear regression, LASSO, CART, and k-means. This is just the tip of the iceberg. There are tons more machine learning methods which can be easily implemented in R. We provide some bonus R code for random forest, regularized logistic regression, and SVM applied to the Airbnb data set in the file bonus.R.