Mathias Steilen - Stacking Models To Predict Airbnb Prices In Manhattan (NYC)

Abstract

The purpose of this post is to demonstrate the usefulness of the package stacks in blending individual machine learning models together into a linear combination of them (using LASSO regression), often increasing final model performance.

What are we looking at?

The data comes from Kaggle, the biggest online platform for machine learning enthusiasts hosting datasets and competitions around data science. More precisely, the data set was part of season 1 episode 5 of SLICED, a data science competition streamed by Nick Wan and Meg Risdal on Twitch.

This dataset is about the prices of Airbnb listings in New York City. The purpose of this post is to demonstrate the usefulness of the package stacks in blending individual machine learning models together into a linear combination of them, often increasing final model performance.

Data Cleaning

Firstly, I start by loading the data. The first file is the one to be used for training, whereas the holdout will only be used for submission of out-of-sample predictions, as it doesn’t contain the target variable. The training data is fairly large, holding information on 34,226 listings with 15 predictors and the response variable of the listings price per night.

names(data)

 [1] "id"                             "name"                          
 [3] "host_id"                        "host_name"                     
 [5] "neighbourhood_group"            "neighbourhood"                 
 [7] "latitude"                       "longitude"                     
 [9] "room_type"                      "price"                         
[11] "minimum_nights"                 "number_of_reviews"             
[13] "last_review"                    "reviews_per_month"             
[15] "calculated_host_listings_count" "availability_365"

There exists considerable missing data in the reviews per month (double) and last review (date) column and little missing data in the host name (character) and listing name (character) column. All missing values can likely be imputed, that is if they are missing at random.

colMeans(is.na(data)) %>% 
  tidy() %>% 
  rename(pct = x) %>% 
  mutate(names = fct_reorder(names, pct)) %>% 
  # filter(pct > 0) %>% 
  ggplot(aes(pct, names)) +
  geom_col(fill = "midnightblue") +
  labs(title = "Missing Data In Variables",
       subtitle = "Percent missingness calculated for each column",
       y = NULL,
       x = NULL) +
  scale_x_continuous(labels = scales::percent_format()) +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"))

The target variable distributions of missing and non-missing values in the two columns with considerable missingness look like they don’t exhibit statistically significant differences and are comparable. Therefore, the imputation of these missing values should not pose a problem, even though this conclusion has to be taken with a grain of salt: It is not possible to determine for sure if a variable is missing at random with observed data, it can merely be assumed.

data %>% 
  transmute(reviews_per_month = ifelse(is.na(reviews_per_month),
                                       "missing",
                                       "not missing"),
            last_review = ifelse(is.na(last_review),
                                 "missing",
                                 "not missing"),
            price) %>%
  pivot_longer(-c(price), names_to = "variable", values_to = "state") %>% 
  ggplot(aes(variable, price, fill = state)) +
  geom_boxplot(outlier.alpha = 0.2) +
  labs(y = "Price",
       x = NULL,
       fill = "Status:") +
  ggsci::scale_fill_locuszoom() +
  scale_y_log10(labels = scales::dollar_format()) +
  theme_bw()

Exploratory Data Analysis

The next step is to walk through the available predictors and understand relations to the target variable. Below, every variable is briefly looked at and presented, enabling a better understanding of the complete training data.

If you are just interested in how to build a model stack with stacks, feel free to skip this part and continue at Building And Training The Stacked Model.

id: unique identifier

The unique identifier column for each listing is a random number and should not hold any predictive power.

cor(data$price, data$id)

[1] 0.009927947

name: name of the listing

The name variable contains the title of the listing, which will be useful for tokenisation at a later stage.

data %>% count(name, sort = T)

# A tibble: 33,705 × 2
   name                                           n
   <chr>                                      <int>
 1 Hillside Hotel                                15
 2 New york Multi-unit building                  11
 3 Home away from home                           10
 4 Private room                                  10
 5 Private Room                                   9
 6 <NA>                                           9
 7 Artsy Private BR in Fort Greene Cumberland     8
 8 Loft Suite @ The Box House Hotel               8
 9 Brooklyn Apartment                             6
10 Harlem Gem                                     6
# ℹ 33,695 more rows

host_id: unique identifier for the host of the listing

The unique identifier column for each listing is a random number and should not contain any predictive power in theory. However, with multiple listings per host and some hosts specialising in luxury apartments or affordable housing, for instance, there might be additional insights in the variable. Therefore, it will still be included in the recipe.

cor(data$price, data$host_id)

[1] 0.01191348

host_name: name of the host

The host name variable contains the first names of the hosts. There is likely no deterministic component to them except for known hosts, like Blueground, who specialise in renting out furnished appartments for longer periods. Therefore, it will not be used as a predictive variable, as the relation might be too fragile for the out-of-sample application of the model.

data %>% 
  group_by(host_name) %>% 
  summarise(n = n(),
            mean_price = mean(price)) %>% 
  filter(n > 100) %>% 
  mutate(host_name = paste0(host_name, " (N=", n, ")") %>% 
           fct_reorder(mean_price)) %>% 
  ggplot(aes(mean_price, host_name, size = n)) +
  geom_point(colour = "midnightblue") +
  labs(title = "Mean NYC Airbnb Prices By Host Names",
       subtitle = "Only names with observation count >100 are shown",
       x = "Mean Price Per Night",
       y = NULL,
       size = "Observation Count:") +
  scale_x_continuous(labels = scales::dollar_format()) +
  scale_size_continuous(labels = scales::comma_format(),
                        range = c(2,5)) +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"),
        legend.position = "bottom")

neighbourhood_group: borough where the listing is located (e.g., “Manhattan”)

There are only five neighbourhood groups. However, there are very large differences in the price distributions between them, so they will be particulary useful as nominal predictors.

data %>% 
  ggplot(aes(neighbourhood_group %>% fct_reorder(price),
             price,
             fill = neighbourhood_group)) +
  geom_boxplot(show.legend = F, outlier.alpha = 0.5, alpha = 0.6) +
  labs(title = "NYC Airbnb Price Distribution By Neighbourhood Group",
       subtitle = NULL,
       x = NULL,
       y = "Price Per Night") +
  scale_y_log10(labels = scales::dollar_format()) +
  ggsci::scale_fill_locuszoom() +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"),
        legend.position = "bottom")

data %>% 
  group_by(neighbourhood_group) %>% 
  summarise(n = n(),
            mean_price = mean(price)) %>% 
  mutate(neighbourhood_group = paste0(neighbourhood_group, " (N=", n, ")") %>% 
           fct_reorder(mean_price)) %>% 
  ggplot(aes(mean_price, neighbourhood_group, size = n)) +
  geom_point(colour = "midnightblue") +
  labs(title = "Mean NYC Airbnb Prices By Neighbourhood Group",
       subtitle = NULL,
       x = "Mean Price Per Night",
       y = NULL,
       fill = "Observation Count:") +
  scale_x_continuous(labels = scales::dollar_format()) +
  scale_size_continuous(labels = scales::comma_format(),
                        range = c(2,5)) +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"),
        legend.position = "bottom")

neighbourhood: neighborhood where the listing is located (e.g., “East Harlem”)

Neighbourhoods, similarly to neighbourhood groups, are useful indicators for prices of Airbnbs in NYC. However, due to the high cardinality, they will have to be lumped together in order not to exceed the memory limits of my machine.

data %>% 
  group_by(neighbourhood) %>% 
  summarise(n = n(),
            mean_price = mean(price)) %>% 
  mutate(neighbourhood = paste0(neighbourhood, " (N=", n, ")") %>% 
           fct_reorder(mean_price)) %>% 
  slice_max(order_by = n, n = 25) %>% 
  ggplot(aes(mean_price, neighbourhood, size = n)) +
  geom_point(colour = "midnightblue") +
  labs(title = "Mean NYC Airbnb Prices By Neighbourhood",
       subtitle = "Only the top 25 most frequent neighbourhoods are shown.",
       x = "Mean Price Per Night",
       y = NULL,
       size = "Observation Count:") +
  scale_x_continuous(labels = scales::dollar_format()) +
  scale_size_continuous(labels = scales::comma_format(),
                        range = c(2,5)) +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"),
        legend.position = "bottom")

latitude: latitude of the listing location

Latitude and longitude give information about the location of the listings. This enables me to make a map. Again, it looks like Manhattan is the most expensive place to rent an Airbnb, which will be useful for the model.

data %>% 
  ggplot(aes(longitude, latitude, z = price)) +
  stat_bin_hex(bins = 100) +
  labs(title = "Hexagon Plot Of Log NYC Airbnb Prices",
       subtitle = "Longitude and latitude are binned into 100 hexagons.",
       x = NULL,
       y = NULL,
       fill = NULL) +
  coord_equal() +
  scale_alpha_continuous(range = c(0, 1), trans = "log") +
  scale_fill_gradient(low = "azure3", high = "red") +
  theme_void() +
  theme(plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"),
        legend.position = "none")

longitude: longitude of the listing location

See above.

room_type: type of room (‘Entire home/apt’, ‘Private room’, or ‘Shared room’)

There are three room types. Entire homes/appartments are obviously the most expensive. Given the disparity between the groups, these nominal predictors will be able to explain a lot of variance in the target variable.

data %>%
  group_by(room_type) %>% 
  summarise(n = n(),
            mean_price = mean(price)) %>% 
  mutate(room_type = paste0(room_type, " (N=", n, ")") %>% 
           fct_reorder(mean_price)) %>% 
  slice_max(order_by = n, n = 25) %>% 
  ggplot(aes(mean_price, room_type, size = n)) +
  geom_point(colour = "midnightblue") +
  labs(title = "Mean NYC Airbnb Prices By Room Type",
       subtitle = NULL,
       x = "Mean Price Per Night",
       y = NULL,
       size = "Observation Count:") +
  scale_x_continuous(labels = scales::dollar_format()) +
  scale_size_continuous(labels = scales::comma_format(),
                        range = c(2,5)) +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"),
        legend.position = "bottom")

The combination of neighbourhood group and room type looks interesting for prediction. Staten Island apartments are relatively expensive in comparison to private rooms, while shared rooms in the Bronx rank higher relatively than the other two categories

data %>% 
  group_by(neighbourhood_group, room_type) %>% 
  summarise(mean_price = mean(price),
            n = n()) %>% 
  arrange(-mean_price) %>% 
  ggplot(aes(neighbourhood_group %>% fct_reorder(mean_price),
             mean_price, 
             fill = neighbourhood_group)) +
  geom_col() +
  facet_wrap(~ room_type) +
  labs(title = "Mean NYC Airbnb Prices By Room Type and Neighbourhood Group",
       subtitle = NULL,
       y = "Mean Price Per Night",
       x = "Neighbourhood Group") +
  scale_y_continuous(labels = scales::dollar_format()) +
  ggsci::scale_fill_jama() +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"),
        legend.position = "none",
        axis.text.x = element_text(angle = 90, hjust = 1))

price: cost for one night booking of the listing

The target is log-normally distributed, hence a log transform would be appropriate for a linear model, for instance. SLICED using RMSLE as a metric for evaluation, the target variable will transformed in the XGBoost model as well in this case, in order for the on-board RMSE metric of the Tidymodels package to apply (just a decision of convenience in this case).

data %>% 
  ggplot(aes(price)) +
  geom_histogram(fill = "midnightblue", colour = "white") +
  labs(title = "Distribution Of NYC Airbnb Listing Prices",
       subtitle = NULL,
       y = "Frequency",
       x = "Price Per Night") +
  scale_x_log10(labels = scales::dollar_format()) +
  scale_y_continuous(labels = scales::comma_format()) +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"),
        legend.position = "none")

minimum_nights: minimum number of nights required to book the listing

Creating 100 bins for minimum nights, taking the average price per night by percentile and plotting the correlation in a scatter plot reveals that there might be a positive effect of minimum nights on listing price. This has likely to do with apartment being more expensive than private rooms and apartments being more likely of having a contractual obligation of minimum stay.

data %>% 
  mutate(minimum_nights_ntile = ntile(minimum_nights, n = 100)) %>% 
  group_by(minimum_nights_ntile) %>% 
  summarise(mean_price = mean(price)) %>% 
  ggplot(aes(minimum_nights_ntile, mean_price)) +
  geom_point() +
  geom_smooth(method = "loess") +
  labs(title = "Correlation Of Minimum Nights And NYC Airbnb Price",
       subtitle = "Minimum nights have been binned into 100 percentiles and averaged.",
       y = "Mean Price Per Night",
       x = "Minimum Nights Percentiles") +
  scale_x_continuous(labels = scales::comma_format()) +
  scale_y_continuous(labels = scales::dollar_format()) +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"),
        legend.position = "none")

number_of_reviews: number of reviews the listing has

The number of reviews shows a slightly negative effect.

data %>% 
  mutate(ntile = ntile(number_of_reviews, n = 100)) %>% 
  group_by(ntile) %>% 
  summarise(mean_price = mean(price)) %>% 
  ggplot(aes(ntile, mean_price)) +
  geom_point() +
  geom_smooth(method = "loess") +
  labs(title = "Correlation Of Number Of Reviews And NYC Airbnb Price",
       subtitle = "Number of reviews have been binned into 100 percentiles and averaged.",
       y = "Mean Price Per Night",
       x = "Number of Reviews Percentiles") +
  scale_x_continuous(labels = scales::comma_format()) +
  scale_y_continuous(labels = scales::dollar_format()) +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"),
        legend.position = "none")

last_review: date the last review of the listing was made

This variable will probably generate more interpretable value for the model if I transform it into a days since last review from today’s perspective. This will have to be done in the recipe, in order for it to equally be applied to the holdout set as well.

There does not seem to be any relation between days since last review and price. If anything, more recently reviewed Airbnbs have slightly higher prices, but the confidence bands don’t show a large statistically significant effect.

data %>% 
  mutate(ntile = ntile(last_review, n = 100)) %>% 
  group_by(ntile) %>% 
  summarise(mean_price = mean(price)) %>% 
  ggplot(aes(ntile, mean_price)) +
  geom_point() +
  geom_smooth(method = "loess") +
  labs(title = "Correlation Of Days Since Last Review And NYC Airbnb Price",
       subtitle = "Days since last review have been binned into 100 percentiles and averaged.",
       y = "Mean Price Per Night",
       x = "Days Since Last Review Percentiles") +
  scale_x_continuous(labels = scales::comma_format()) +
  scale_y_continuous(labels = scales::dollar_format()) +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"),
        legend.position = "none")

reviews_per_month: number of reviews the listing gets per month on average

There seems to be a slightly negative relation between reviews per month and price.

data %>% 
  mutate(ntile = ntile(reviews_per_month, n = 100)) %>% 
  group_by(ntile) %>% 
  summarise(mean_price = mean(price)) %>% 
  ggplot(aes(ntile, mean_price)) +
  geom_point() +
  geom_smooth(method = "loess") +
  labs(title = "Correlation Of Reviews Per Month And NYC Airbnb Price",
       subtitle = "Reviews per month have been binned into 100 percentiles and averaged.",
       y = "Mean Price Per Night",
       x = "Reviews Per Month Percentiles") +
  scale_x_continuous(labels = scales::comma_format()) +
  scale_y_continuous(labels = scales::dollar_format()) +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"),
        legend.position = "none")

calculated_host_listings_count: number of listing the host has

It seems like hosts with the highest number of listings, that is professional hosts, likely have higher room prices on average, even though that might be a non-significant outlier.

The binned averages show a non-linear trend downwards and a sudden increase in the highest ten percent for both entire homes and private rooms. However, this trend is reversed for shared rooms. This means, that the hosts with exceptionally many listings likely have expensive homes or private rooms, but likely cheaper shared rooms. I wonder if the first two are a reflection of a dominant position of larger firms in the market, or whether it has something to do with the quality of the offerings. For the shared apartments, I wonder whether it has something to do with subsidised housing or larger firms specialising in affordable rooms and owning entire buildings of rooms. Either way, very interesting.

data %>% 
  mutate(ntile = ntile(calculated_host_listings_count, n = 100)) %>% 
  group_by(ntile, room_type) %>% 
  summarise(mean_price = mean(price)) %>% 
  ggplot(aes(ntile, mean_price, colour = room_type)) +
  geom_point() +
  geom_smooth(method = "loess", se = F) +
  labs(title = "Correlation Of Listings Per Host And NYC Airbnb Price",
       subtitle = "Listings per host have been binned into 100 percentiles and averaged.",
       y = "Mean Price Per Night",
       x = "Listings Per Host (Percentiles)",
       colour = "Room Type") +
  scale_x_continuous(labels = scales::comma_format()) +
  scale_y_continuous(labels = scales::dollar_format()) + 
  ggsci::scale_colour_jama() +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"),
        legend.position = "bottom")

availability_365: number of days out of the year the listing is available

Again, very interesting to observe that an interaction between room type and another variable exists. For private rooms, availability throughout the year has virtually no effect on price, whereas more available shared rooms are usually cheaper and more available entire apartments and houses are the most expensive. I believe that the latter might be due to location, as most touristy places are available all year around, therefore attracting people with concentrated spending power over shorter time periods. These offerings also have to compensate for the time being empty, if no tourists are around.

Anyway, this being speculation, let’s get into the thick of it and continue with correlation analysis for the numeric predictors.

data %>% 
  mutate(ntile = ntile(availability_365, n = 100)) %>% 
  group_by(ntile, room_type) %>% 
  summarise(mean_price = mean(price)) %>% 
  ggplot(aes(ntile, mean_price, colour = room_type)) +
  geom_point() +
  geom_smooth(method = "loess", se = F) +
  labs(title = "Correlation Of Availability And NYC Airbnb Price",
       subtitle = "Availability per year has been binned into 100 percentiles and averaged.",
       y = "Mean Price Per Night",
       x = "Availability Per Year (Percentiles)",
       colour = "Room Type") +
  scale_x_continuous(labels = scales::comma_format()) +
  scale_y_continuous(labels = scales::dollar_format()) + 
  ggsci::scale_colour_jama() +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"),
        legend.position = "bottom")

Proceeding with correlations

Let’s take a look at the correlation matrix from the GGally package to gauge relations of numeric predictors with the target variable.

data %>% 
  mutate(last_review = difftime(Sys.time(), last_review) %>% as.numeric()) %>% 
  select_if(is.numeric) %>% 
  select(-price, price) %>% 
  drop_na() %>% 
  ggcorr(label = T, label_size = 3)

It does not look like numeric variables are good predictors of the price in this setting. However, the correlation only being a linear measure, it might well be that a non-linear machine learning model will figure out non-linear relationships that are hidden right now.

Proceeding to look at variance inflation factors:

data %>%
  select_if(is.numeric) %>% 
  lm(formula = price ~ .) %>% 
  vif()

                            id                        host_id 
                      2.164903                       1.654317 
                      latitude                      longitude 
                      1.011668                       1.067277 
                minimum_nights              number_of_reviews 
                      1.039035                       2.322328 
             reviews_per_month calculated_host_listings_count 
                      2.297050                       1.082071 
              availability_365 
                      1.141782

VIFs are lower than 10 for all numeric variables, so there is no problem of multicollinearity that has to be dealt with or at least mentioned.

With the ideas gathered from the exploratory data analysis, I can now proceed with building the model.

Building And Training The Stacked Model

First, the data is split into training and testing sets. Also, three-fold cross validation is employed for reliable calculation of performance metrics, bearing in mind time efficiency.

dt_split <- data %>% 
  mutate(price = log(price + 1)) %>% 
  initial_split(strata = "price")

dt_train <- training(dt_split)
dt_test <- testing(dt_split)

folds <- vfold_cv(dt_train, v = 3, strata = "price")

The recipe in the tidymodels framework makes it very straightforward to include all feature engineering in one step, preventing data leakage from the test set and uniformly applying the same steps to the holdout in the final fit.

gb_rec <- recipe(price ~ .,
                 data = dt_train) %>%
  step_rm(host_name) %>%
  step_novel(all_nominal_predictors()) %>%
  step_tokenize(name) %>%
  step_stopwords(name) %>%
  step_tokenfilter(name, max_tokens = 40) %>%
  step_tf(name) %>%
  step_mutate(last_review = difftime(Sys.time(), last_review) %>%
                as.numeric()) %>%
  step_impute_median(all_numeric_predictors()) %>%
  step_other(neighbourhood, threshold = 0.001) %>%
  step_dummy(all_nominal_predictors(), one_hot = TRUE)


en_rec <- recipe(price ~ .,
                 data = dt_train) %>%
  step_rm(host_name) %>%
  step_novel(all_nominal_predictors()) %>%
  step_tokenize(name) %>%
  step_stopwords(name) %>%
  step_tokenfilter(name, max_tokens = 40) %>%
  step_tf(name) %>%
  step_mutate(last_review = difftime(Sys.time(), last_review) %>%
                as.numeric()) %>%
  step_impute_median(all_numeric_predictors()) %>% 
  step_other(neighbourhood, threshold = 0.001) %>%
  step_dummy(all_nominal_predictors(), one_hot = TRUE)

Setting up the model specifications with tuning options for hyperparameters:

gb_spec <- 
  boost_tree(
    trees = 1000,
    tree_depth = tune(),
    min_n = tune(),
    loss_reduction = tune(),
    sample_size = tune(),
    mtry = tune(),
    learn_rate = tune()
  ) %>%
  set_engine("xgboost", importance = "impurity") %>%
  set_mode("regression")

en_spec <- linear_reg(penalty = tune(),
                      mixture = tune()) %>% 
  set_engine("glmnet")

In the model specification, you can specify the variable importance, which is calculated based on impurity in this case. Proceeding with setting up the workflow:

gb_wflow <- 
  workflow() %>% 
  add_recipe(gb_rec) %>% 
  add_model(gb_spec)

en_wflow <-
  workflow() %>% 
  add_recipe(en_rec) %>% 
  add_model(en_spec)

Setting up a space-filling design for time-efficient hyperparameter tuning:

gb_grid <- 
  grid_latin_hypercube(
    tree_depth(),
    min_n(),
    loss_reduction(),
    sample_size = sample_prop(),
    finalize(mtry(), dt_train),
    learn_rate(),
    size = 50
  )

en_grid <- 
  grid_latin_hypercube(
    penalty(),
    mixture(),
    size = 100
    )

Now, the hyperparameters can be trained with parallel computing in order to utilise more available computing power.

# Gradient Boosting
start_time = Sys.time()

unregister_dopar <- function() {
  env <- foreach:::.foreachGlobals
  rm(list=ls(name=env), pos=env)
}

cl <- makePSOCKcluster(6)
registerDoParallel(cl)

gb_tune <- tune_grid(object = gb_wflow,
                     resamples = folds,
                     grid = gb_grid,
                     control = control_grid(save_pred = TRUE,
                                            save_workflow = TRUE))

stopCluster(cl)
unregister_dopar()

end_time = Sys.time()
end_time - start_time

Time difference of 12.00451 mins

# Elastic Net
start_time = Sys.time()

unregister_dopar <- function() {
  env <- foreach:::.foreachGlobals
  rm(list=ls(name=env), pos=env)
}

cl <- makePSOCKcluster(6)
registerDoParallel(cl)

en_tune <- tune_grid(object = en_wflow,
                     resamples = folds,
                     grid = en_grid,
                     control = control_grid(save_pred = TRUE,
                                            save_workflow = TRUE))

stopCluster(cl)
unregister_dopar()

end_time = Sys.time()
end_time - start_time

Time difference of 1.425809 mins

Looking at the tuning results reveals that the model captures strong signal in the predictors, as the \(R^2\) is fairly high.

gb_tune %>% 
  show_best(metric = "rsq") %>%
  transmute(model = "XGBoost", .metric, mean, n, std_err)

# A tibble: 5 × 5
  model   .metric  mean     n std_err
  <chr>   <chr>   <dbl> <int>   <dbl>
1 XGBoost rsq     0.597     3 0.00499
2 XGBoost rsq     0.590     3 0.00622
3 XGBoost rsq     0.589     3 0.00563
4 XGBoost rsq     0.572     3 0.00557
5 XGBoost rsq     0.561     3 0.00424

en_tune %>% 
  show_best(metric = "rsq") %>% 
  transmute(model = "Elastic Net", .metric, mean, n, std_err)

# A tibble: 5 × 5
  model       .metric  mean     n std_err
  <chr>       <chr>   <dbl> <int>   <dbl>
1 Elastic Net rsq     0.532     3 0.00559
2 Elastic Net rsq     0.532     3 0.00557
3 Elastic Net rsq     0.532     3 0.00544
4 Elastic Net rsq     0.532     3 0.00544
5 Elastic Net rsq     0.532     3 0.00554

Before creating a stacked model, let’s take a look at the variable importance within both individual models.

gb_final_wflow <- gb_wflow %>%
  finalize_workflow(select_best(gb_tune, metric = "rmse"))

gb_final_fit <- gb_final_wflow %>% 
  last_fit(dt_split)

gb_final_fit %>%
  pluck(".workflow", 1) %>%
  extract_fit_parsnip() %>%
  vi() %>%
  slice_max(order_by = Importance, n = 20) %>% 
  ggplot(aes(Importance, reorder(Variable, Importance))) +
  geom_col(fill = "midnightblue", colour = "white") +
  labs(title = "Variable Importance",
       subtitle = "Only the most important predictors are shown.",
       y = "Predictor",
       x = "Relative Variable Importance") +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"))

For the XGBoost model, the type of room as well as the location, especially information about Manhattan, was important to predict price. It becomes visible now, how important the inclusion of nominal predictors was for model performance.

en_final_wflow <- en_wflow %>%
  finalize_workflow(select_best(en_tune, metric = "rmse"))

en_final_fit <- en_final_wflow %>% 
  last_fit(dt_split)

en_final_fit %>%
  pluck(".workflow", 1) %>%
  extract_fit_parsnip() %>%
  vi() %>% 
  slice_max(order_by = Importance, n = 30) %>% 
  mutate(Importance = ifelse(Sign == "NEG", Importance * -1, Importance)) %>% 
  ggplot(aes(Importance, reorder(Variable, Importance),
             fill = Sign)) +
  geom_col(colour = "white") +
  labs(title = "Variable Importance",
       subtitle = "Only the most important predictors are shown.",
       y = "Predictor",
       x = "Coefficient") +
  ggsci::scale_fill_jama() +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"),
        legend.position = "bottom")

For the elastic net, interestingly, the neighbourhoods were very decisive. Only the 30 most important variables are shown, and most of them contain information on geographic location from the neighbourhood variable. Furthermore, most of the important variables negatively impact price.

With both these individual tuning results, a blended (“stacked”) model can easily be built with the stacks package.

blended_gb_en <- stacks() %>% 
  add_candidates(gb_tune) %>% 
  add_candidates(en_tune) %>% 
  blend_predictions()

blended_gb_en

# A tibble: 7 × 3
  member        type            weight
  <chr>         <chr>            <dbl>
1 gb_tune_1_16  boost_tree 1077472.   
2 gb_tune_1_28  boost_tree   72822.   
3 gb_tune_1_23  boost_tree    1893.   
4 en_tune_1_085 linear_reg       6.98 
5 gb_tune_1_47  boost_tree       0.619
6 gb_tune_1_36  boost_tree       0.228
7 gb_tune_1_41  boost_tree       0.122

The stacks package creates a model additively blending the predictions from the separately trained models before. The optimisation for this is built into the package shows the output above. Interestingly, no elastic net candidate was chosen. Instead, a linear combination of XGBoost models is selected. In order to proceed with the prediction on the final holdout set, the stack is now fitted onto the training data.

blended_gb_en <- blended_gb_en %>% 
  fit_members()

Evaluating Model Performance On The Training Data

Using the fitted model to predict and evaluate on the test set:

blended_gb_en %>% 
  predict(dt_test) %>% 
  bind_cols(dt_test %>% select(price)) %>% 
  rsq(.pred, truth = price) %>% 
  mutate(model = "Stacked Model")

# A tibble: 1 × 4
  .metric .estimator .estimate model        
  <chr>   <chr>          <dbl> <chr>        
1 rsq     standard       0.614 Stacked Model

Success! The blended model stack attained an \(R^2\) slightly higher than the individual XGBoost model on the test data set. This goes to show how stacking individual models can give the final predictions an additional edge.

gb_final_fit %>% 
  extract_workflow() %>% 
  predict(dt_test) %>% 
  bind_cols(dt_test %>% select(price)) %>% 
  rsq(.pred, truth = price) %>% 
  mutate(model = "XGBoost Model")

# A tibble: 1 × 4
  .metric .estimator .estimate model        
  <chr>   <chr>          <dbl> <chr>        
1 rsq     standard       0.612 XGBoost Model

blended_gb_en %>% 
  predict(dt_test) %>% 
  bind_cols(dt_test %>% select(price)) %>% 
  ggplot(aes(exp(price), exp(.pred))) +
  geom_point(colour = "midnightblue", alpha = 0.4) +
  geom_abline(lty = "dashed", colour = "grey50") +
  scale_x_log10(labels = scales::dollar_format()) +
  scale_y_log10(labels = scales::dollar_format()) +
  labs(title = "Out-Of-Sample Fit Of The Blended Model",
       subtitle = NULL,
       y = "Prediction",
       x = "Truth") +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"),
        legend.position = "bottom")

Evaluating Model Performance on the Prediction Data

With the trained stack model, I can now make predictions for the holdout dataset, which will be submitted to the leader board on Kaggle.

blended_gb_en %>% 
  predict(holdout) %>% 
  bind_cols(holdout %>% select(id)) %>% 
  transmute(id, price = exp(.pred))

# A tibble: 14,669 × 2
         id price
      <dbl> <dbl>
 1 10449807 255. 
 2  1178389 130. 
 3 23838063  61.3
 4 14415799 143. 
 5  6555262  66.0
 6 33782555  66.4
 7  6669443  87.1
 8 36306938  57.6
 9 12814195  78.0
10 10100799 176. 
# ℹ 14,659 more rows

This model ranks at 5/30 on the SLICED competition leader board, which I believe speaks volumes about the power of XGBoost and stacking in competition settings given the lack of new features and extensive tuning in this post.

Conclusively, it can be said that the models performed fairly well in fitting the data, even though the predictions are not highly impressive seen from an absolute perspective. In order to make them highly accurate, more information on the level of the listings would have been useful, for instance size of the rooms, capacity, proxies for luxuriousness, details on reviews and information on amenities.

I hope this post has been interesting to you. In case of constructive feedback or if you want to exchange about this or a related topic, feel free to reach out.

Thank you for reading.

A work by Mathias Steilen