Stacking Models To Predict Airbnb Prices In Manhattan (NYC)
Machine Learning
R
Author
Mathias Steilen
Published
October 7, 2022
Abstract
The purpose of this post is to demonstrate the usefulness of the package stacks in blending individual machine learning models together into a linear combination of them (using LASSO regression), often increasing final model performance.
What are we looking at?
The data comes from Kaggle, the biggest online platform for machine learning enthusiasts hosting datasets and competitions around data science. More precisely, the data set was part of season 1 episode 5 of SLICED, a data science competition streamed by Nick Wan and Meg Risdal on Twitch.
This dataset is about the prices of Airbnb listings in New York City. The purpose of this post is to demonstrate the usefulness of the package stacks in blending individual machine learning models together into a linear combination of them, often increasing final model performance.
Data Cleaning
Firstly, I start by loading the data. The first file is the one to be used for training, whereas the holdout will only be used for submission of out-of-sample predictions, as it doesn’t contain the target variable. The training data is fairly large, holding information on 34,226 listings with 15 predictors and the response variable of the listings price per night.
There exists considerable missing data in the reviews per month (double) and last review (date) column and little missing data in the host name (character) and listing name (character) column. All missing values can likely be imputed, that is if they are missing at random.
The target variable distributions of missing and non-missing values in the two columns with considerable missingness look like they don’t exhibit statistically significant differences and are comparable. Therefore, the imputation of these missing values should not pose a problem, even though this conclusion has to be taken with a grain of salt: It is not possible to determine for sure if a variable is missing at random with observed data, it can merely be assumed.
The next step is to walk through the available predictors and understand relations to the target variable. Below, every variable is briefly looked at and presented, enabling a better understanding of the complete training data.
If you are just interested in how to build a model stack with stacks, feel free to skip this part and continue at Building And Training The Stacked Model.
id: unique identifier
The unique identifier column for each listing is a random number and should not hold any predictive power.
cor(data$price, data$id)
[1] 0.009927947
name: name of the listing
The name variable contains the title of the listing, which will be useful for tokenisation at a later stage.
data %>%count(name, sort = T)
# A tibble: 33,705 × 2
name n
<chr> <int>
1 Hillside Hotel 15
2 New york Multi-unit building 11
3 Home away from home 10
4 Private room 10
5 Private Room 9
6 <NA> 9
7 Artsy Private BR in Fort Greene Cumberland 8
8 Loft Suite @ The Box House Hotel 8
9 Brooklyn Apartment 6
10 Harlem Gem 6
# ℹ 33,695 more rows
host_id: unique identifier for the host of the listing
The unique identifier column for each listing is a random number and should not contain any predictive power in theory. However, with multiple listings per host and some hosts specialising in luxury apartments or affordable housing, for instance, there might be additional insights in the variable. Therefore, it will still be included in the recipe.
cor(data$price, data$host_id)
[1] 0.01191348
host_name: name of the host
The host name variable contains the first names of the hosts. There is likely no deterministic component to them except for known hosts, like Blueground, who specialise in renting out furnished appartments for longer periods. Therefore, it will not be used as a predictive variable, as the relation might be too fragile for the out-of-sample application of the model.
neighbourhood_group: borough where the listing is located (e.g., “Manhattan”)
There are only five neighbourhood groups. However, there are very large differences in the price distributions between them, so they will be particulary useful as nominal predictors.
neighbourhood: neighborhood where the listing is located (e.g., “East Harlem”)
Neighbourhoods, similarly to neighbourhood groups, are useful indicators for prices of Airbnbs in NYC. However, due to the high cardinality, they will have to be lumped together in order not to exceed the memory limits of my machine.
data %>%group_by(neighbourhood) %>%summarise(n =n(),mean_price =mean(price)) %>%mutate(neighbourhood =paste0(neighbourhood, " (N=", n, ")") %>%fct_reorder(mean_price)) %>%slice_max(order_by = n, n =25) %>%ggplot(aes(mean_price, neighbourhood, size = n)) +geom_point(colour ="midnightblue") +labs(title ="Mean NYC Airbnb Prices By Neighbourhood",subtitle ="Only the top 25 most frequent neighbourhoods are shown.",x ="Mean Price Per Night",y =NULL,size ="Observation Count:") +scale_x_continuous(labels = scales::dollar_format()) +scale_size_continuous(labels = scales::comma_format(),range =c(2,5)) +theme_bw() +theme(plot.title =element_text(face ="bold", size =12),plot.subtitle =element_text(face ="italic", colour ="grey50"),legend.position ="bottom")
latitude: latitude of the listing location
Latitude and longitude give information about the location of the listings. This enables me to make a map. Again, it looks like Manhattan is the most expensive place to rent an Airbnb, which will be useful for the model.
data %>%ggplot(aes(longitude, latitude, z = price)) +stat_bin_hex(bins =100) +labs(title ="Hexagon Plot Of Log NYC Airbnb Prices",subtitle ="Longitude and latitude are binned into 100 hexagons.",x =NULL,y =NULL,fill =NULL) +coord_equal() +scale_alpha_continuous(range =c(0, 1), trans ="log") +scale_fill_gradient(low ="azure3", high ="red") +theme_void() +theme(plot.title =element_text(face ="bold", size =12),plot.subtitle =element_text(face ="italic", colour ="grey50"),legend.position ="none")
longitude: longitude of the listing location
See above.
room_type: type of room (‘Entire home/apt’, ‘Private room’, or ‘Shared room’)
There are three room types. Entire homes/appartments are obviously the most expensive. Given the disparity between the groups, these nominal predictors will be able to explain a lot of variance in the target variable.
data %>%group_by(room_type) %>%summarise(n =n(),mean_price =mean(price)) %>%mutate(room_type =paste0(room_type, " (N=", n, ")") %>%fct_reorder(mean_price)) %>%slice_max(order_by = n, n =25) %>%ggplot(aes(mean_price, room_type, size = n)) +geom_point(colour ="midnightblue") +labs(title ="Mean NYC Airbnb Prices By Room Type",subtitle =NULL,x ="Mean Price Per Night",y =NULL,size ="Observation Count:") +scale_x_continuous(labels = scales::dollar_format()) +scale_size_continuous(labels = scales::comma_format(),range =c(2,5)) +theme_bw() +theme(plot.title =element_text(face ="bold", size =12),plot.subtitle =element_text(face ="italic", colour ="grey50"),legend.position ="bottom")
The combination of neighbourhood group and room type looks interesting for prediction. Staten Island apartments are relatively expensive in comparison to private rooms, while shared rooms in the Bronx rank higher relatively than the other two categories
data %>%group_by(neighbourhood_group, room_type) %>%summarise(mean_price =mean(price),n =n()) %>%arrange(-mean_price) %>%ggplot(aes(neighbourhood_group %>%fct_reorder(mean_price), mean_price, fill = neighbourhood_group)) +geom_col() +facet_wrap(~ room_type) +labs(title ="Mean NYC Airbnb Prices By Room Type and Neighbourhood Group",subtitle =NULL,y ="Mean Price Per Night",x ="Neighbourhood Group") +scale_y_continuous(labels = scales::dollar_format()) + ggsci::scale_fill_jama() +theme_bw() +theme(plot.title =element_text(face ="bold", size =12),plot.subtitle =element_text(face ="italic", colour ="grey50"),legend.position ="none",axis.text.x =element_text(angle =90, hjust =1))
price: cost for one night booking of the listing
The target is log-normally distributed, hence a log transform would be appropriate for a linear model, for instance. SLICED using RMSLE as a metric for evaluation, the target variable will transformed in the XGBoost model as well in this case, in order for the on-board RMSE metric of the Tidymodels package to apply (just a decision of convenience in this case).
minimum_nights: minimum number of nights required to book the listing
Creating 100 bins for minimum nights, taking the average price per night by percentile and plotting the correlation in a scatter plot reveals that there might be a positive effect of minimum nights on listing price. This has likely to do with apartment being more expensive than private rooms and apartments being more likely of having a contractual obligation of minimum stay.
data %>%mutate(minimum_nights_ntile =ntile(minimum_nights, n =100)) %>%group_by(minimum_nights_ntile) %>%summarise(mean_price =mean(price)) %>%ggplot(aes(minimum_nights_ntile, mean_price)) +geom_point() +geom_smooth(method ="loess") +labs(title ="Correlation Of Minimum Nights And NYC Airbnb Price",subtitle ="Minimum nights have been binned into 100 percentiles and averaged.",y ="Mean Price Per Night",x ="Minimum Nights Percentiles") +scale_x_continuous(labels = scales::comma_format()) +scale_y_continuous(labels = scales::dollar_format()) +theme_bw() +theme(plot.title =element_text(face ="bold", size =12),plot.subtitle =element_text(face ="italic", colour ="grey50"),legend.position ="none")
number_of_reviews: number of reviews the listing has
The number of reviews shows a slightly negative effect.
data %>%mutate(ntile =ntile(number_of_reviews, n =100)) %>%group_by(ntile) %>%summarise(mean_price =mean(price)) %>%ggplot(aes(ntile, mean_price)) +geom_point() +geom_smooth(method ="loess") +labs(title ="Correlation Of Number Of Reviews And NYC Airbnb Price",subtitle ="Number of reviews have been binned into 100 percentiles and averaged.",y ="Mean Price Per Night",x ="Number of Reviews Percentiles") +scale_x_continuous(labels = scales::comma_format()) +scale_y_continuous(labels = scales::dollar_format()) +theme_bw() +theme(plot.title =element_text(face ="bold", size =12),plot.subtitle =element_text(face ="italic", colour ="grey50"),legend.position ="none")
last_review: date the last review of the listing was made
This variable will probably generate more interpretable value for the model if I transform it into a days since last review from today’s perspective. This will have to be done in the recipe, in order for it to equally be applied to the holdout set as well.
There does not seem to be any relation between days since last review and price. If anything, more recently reviewed Airbnbs have slightly higher prices, but the confidence bands don’t show a large statistically significant effect.
data %>%mutate(ntile =ntile(last_review, n =100)) %>%group_by(ntile) %>%summarise(mean_price =mean(price)) %>%ggplot(aes(ntile, mean_price)) +geom_point() +geom_smooth(method ="loess") +labs(title ="Correlation Of Days Since Last Review And NYC Airbnb Price",subtitle ="Days since last review have been binned into 100 percentiles and averaged.",y ="Mean Price Per Night",x ="Days Since Last Review Percentiles") +scale_x_continuous(labels = scales::comma_format()) +scale_y_continuous(labels = scales::dollar_format()) +theme_bw() +theme(plot.title =element_text(face ="bold", size =12),plot.subtitle =element_text(face ="italic", colour ="grey50"),legend.position ="none")
reviews_per_month: number of reviews the listing gets per month on average
There seems to be a slightly negative relation between reviews per month and price.
data %>%mutate(ntile =ntile(reviews_per_month, n =100)) %>%group_by(ntile) %>%summarise(mean_price =mean(price)) %>%ggplot(aes(ntile, mean_price)) +geom_point() +geom_smooth(method ="loess") +labs(title ="Correlation Of Reviews Per Month And NYC Airbnb Price",subtitle ="Reviews per month have been binned into 100 percentiles and averaged.",y ="Mean Price Per Night",x ="Reviews Per Month Percentiles") +scale_x_continuous(labels = scales::comma_format()) +scale_y_continuous(labels = scales::dollar_format()) +theme_bw() +theme(plot.title =element_text(face ="bold", size =12),plot.subtitle =element_text(face ="italic", colour ="grey50"),legend.position ="none")
calculated_host_listings_count: number of listing the host has
It seems like hosts with the highest number of listings, that is professional hosts, likely have higher room prices on average, even though that might be a non-significant outlier.
The binned averages show a non-linear trend downwards and a sudden increase in the highest ten percent for both entire homes and private rooms. However, this trend is reversed for shared rooms. This means, that the hosts with exceptionally many listings likely have expensive homes or private rooms, but likely cheaper shared rooms. I wonder if the first two are a reflection of a dominant position of larger firms in the market, or whether it has something to do with the quality of the offerings. For the shared apartments, I wonder whether it has something to do with subsidised housing or larger firms specialising in affordable rooms and owning entire buildings of rooms. Either way, very interesting.
data %>%mutate(ntile =ntile(calculated_host_listings_count, n =100)) %>%group_by(ntile, room_type) %>%summarise(mean_price =mean(price)) %>%ggplot(aes(ntile, mean_price, colour = room_type)) +geom_point() +geom_smooth(method ="loess", se = F) +labs(title ="Correlation Of Listings Per Host And NYC Airbnb Price",subtitle ="Listings per host have been binned into 100 percentiles and averaged.",y ="Mean Price Per Night",x ="Listings Per Host (Percentiles)",colour ="Room Type") +scale_x_continuous(labels = scales::comma_format()) +scale_y_continuous(labels = scales::dollar_format()) + ggsci::scale_colour_jama() +theme_bw() +theme(plot.title =element_text(face ="bold", size =12),plot.subtitle =element_text(face ="italic", colour ="grey50"),legend.position ="bottom")
availability_365: number of days out of the year the listing is available
Again, very interesting to observe that an interaction between room type and another variable exists. For private rooms, availability throughout the year has virtually no effect on price, whereas more available shared rooms are usually cheaper and more available entire apartments and houses are the most expensive. I believe that the latter might be due to location, as most touristy places are available all year around, therefore attracting people with concentrated spending power over shorter time periods. These offerings also have to compensate for the time being empty, if no tourists are around.
Anyway, this being speculation, let’s get into the thick of it and continue with correlation analysis for the numeric predictors.
data %>%mutate(ntile =ntile(availability_365, n =100)) %>%group_by(ntile, room_type) %>%summarise(mean_price =mean(price)) %>%ggplot(aes(ntile, mean_price, colour = room_type)) +geom_point() +geom_smooth(method ="loess", se = F) +labs(title ="Correlation Of Availability And NYC Airbnb Price",subtitle ="Availability per year has been binned into 100 percentiles and averaged.",y ="Mean Price Per Night",x ="Availability Per Year (Percentiles)",colour ="Room Type") +scale_x_continuous(labels = scales::comma_format()) +scale_y_continuous(labels = scales::dollar_format()) + ggsci::scale_colour_jama() +theme_bw() +theme(plot.title =element_text(face ="bold", size =12),plot.subtitle =element_text(face ="italic", colour ="grey50"),legend.position ="bottom")
Proceeding with correlations
Let’s take a look at the correlation matrix from the GGally package to gauge relations of numeric predictors with the target variable.
It does not look like numeric variables are good predictors of the price in this setting. However, the correlation only being a linear measure, it might well be that a non-linear machine learning model will figure out non-linear relationships that are hidden right now.
Proceeding to look at variance inflation factors:
data %>%select_if(is.numeric) %>%lm(formula = price ~ .) %>%vif()
id host_id
2.164903 1.654317
latitude longitude
1.011668 1.067277
minimum_nights number_of_reviews
1.039035 2.322328
reviews_per_month calculated_host_listings_count
2.297050 1.082071
availability_365
1.141782
VIFs are lower than 10 for all numeric variables, so there is no problem of multicollinearity that has to be dealt with or at least mentioned.
With the ideas gathered from the exploratory data analysis, I can now proceed with building the model.
Building And Training The Stacked Model
First, the data is split into training and testing sets. Also, three-fold cross validation is employed for reliable calculation of performance metrics, bearing in mind time efficiency.
dt_split <- data %>%mutate(price =log(price +1)) %>%initial_split(strata ="price")dt_train <-training(dt_split)dt_test <-testing(dt_split)folds <-vfold_cv(dt_train, v =3, strata ="price")
The recipe in the tidymodels framework makes it very straightforward to include all feature engineering in one step, preventing data leakage from the test set and uniformly applying the same steps to the holdout in the final fit.
In the model specification, you can specify the variable importance, which is calculated based on impurity in this case. Proceeding with setting up the workflow:
en_tune %>%show_best(metric ="rsq") %>%transmute(model ="Elastic Net", .metric, mean, n, std_err)
# A tibble: 5 × 5
model .metric mean n std_err
<chr> <chr> <dbl> <int> <dbl>
1 Elastic Net rsq 0.532 3 0.00559
2 Elastic Net rsq 0.532 3 0.00557
3 Elastic Net rsq 0.532 3 0.00544
4 Elastic Net rsq 0.532 3 0.00544
5 Elastic Net rsq 0.532 3 0.00554
Before creating a stacked model, let’s take a look at the variable importance within both individual models.
gb_final_wflow <- gb_wflow %>%finalize_workflow(select_best(gb_tune, metric ="rmse"))gb_final_fit <- gb_final_wflow %>%last_fit(dt_split)gb_final_fit %>%pluck(".workflow", 1) %>%extract_fit_parsnip() %>%vi() %>%slice_max(order_by = Importance, n =20) %>%ggplot(aes(Importance, reorder(Variable, Importance))) +geom_col(fill ="midnightblue", colour ="white") +labs(title ="Variable Importance",subtitle ="Only the most important predictors are shown.",y ="Predictor",x ="Relative Variable Importance") +theme_bw() +theme(plot.title =element_text(face ="bold", size =12),plot.subtitle =element_text(face ="italic", colour ="grey50"))
For the XGBoost model, the type of room as well as the location, especially information about Manhattan, was important to predict price. It becomes visible now, how important the inclusion of nominal predictors was for model performance.
For the elastic net, interestingly, the neighbourhoods were very decisive. Only the 30 most important variables are shown, and most of them contain information on geographic location from the neighbourhood variable. Furthermore, most of the important variables negatively impact price.
With both these individual tuning results, a blended (“stacked”) model can easily be built with the stacks package.
The stacks package creates a model additively blending the predictions from the separately trained models before. The optimisation for this is built into the package shows the output above. Interestingly, no elastic net candidate was chosen. Instead, a linear combination of XGBoost models is selected. In order to proceed with the prediction on the final holdout set, the stack is now fitted onto the training data.
blended_gb_en <- blended_gb_en %>%fit_members()
Evaluating Model Performance On The Training Data
Using the fitted model to predict and evaluate on the test set:
# A tibble: 1 × 4
.metric .estimator .estimate model
<chr> <chr> <dbl> <chr>
1 rsq standard 0.614 Stacked Model
Success! The blended model stack attained an \(R^2\) slightly higher than the individual XGBoost model on the test data set. This goes to show how stacking individual models can give the final predictions an additional edge.
This model ranks at 5/30 on the SLICED competition leader board, which I believe speaks volumes about the power of XGBoost and stacking in competition settings given the lack of new features and extensive tuning in this post.
Conclusively, it can be said that the models performed fairly well in fitting the data, even though the predictions are not highly impressive seen from an absolute perspective. In order to make them highly accurate, more information on the level of the listings would have been useful, for instance size of the rooms, capacity, proxies for luxuriousness, details on reviews and information on amenities.
I hope this post has been interesting to you. In case of constructive feedback or if you want to exchange about this or a related topic, feel free to reach out.