Mathias Steilen - Deploying A Model Predicting Used Car Prices With Shiny

Abstract

The data for this post was scraped from a Swiss online platform. The goal of this post is to demonstrate a deployed model with Shiny, which allows users to enter information about their car and get back a prediction for the price in Swiss francs. A bonus is using tokenisation for extracting predictive information from the natural language title of each listing.

Shiny Application

The final result of this blog post can be interacted with below:

knitr::include_app("https://mathiassteilen.shinyapps.io/car_price_prediction_xgboost/", height = "600px")

Description of this project

The data for this post was scraped from the online platform Anibis. The goal of this post is to demonstrate how models can be deployed using Shiny. Specifically, the example above shows a simple UI, where users can enter information about their car and get back a prediction for the price in Swiss francs. Alternatively, I could have also allowed users to use the app at scale by submitting variables via a CSV, or by creating a RESTful API. The rest of the post below shows the scraping and model creation process.

Limitations of the model

Only 25% of all cars on Anibis were scraped, as it was sufficient for proof of concept. It would be no problem to extend the sample to a full 100% of all cars on Anibis and to make the model training more thorough in order to achieve higher accuracy. Additionally, there is no information on previous owners, potential previous accidents and the general state of the car with regard to maintenance. These are crucial pieces of information for a model predicting price. However, in this case, the goal was to show how a model could be hosted using Shiny, therefore the achieved \(R^2\) of around 75% is sufficient.

Data Scraping

Requesting sublinks to individual listings from site

pages <- 1:6500

start_urls <- paste0("https://www.anibis.ch/de/c/autos-autozubehoer-occasionen-neuwagen?pi=",
                     pages) %>% 
  as_tibble() %>% 
  rename(url = value)

start_urls

# Fetch subpages from listings from each overview page

listings <- start_urls %>%
  mutate(subpages = map(url, function(.x) {
    return(
      GET(.x, timeout(10)) %>% 
        read_html(.) %>%
        html_nodes("[class ='sc-1yo7ctu-0 bRDNul']") %>%
        html_attr('href') %>%
        as_tibble() %>%
        rename(subpage = value) %>%
        mutate(subpage = paste0("https://www.anibis.ch", subpage))
    )
  }))

listings

# Extract subpage urls and clean as tibble

subpage_urls <- listings %>% 
  select(subpages) %>% 
  unnest(subpages)

subpage_urls

# Read in html from each subpage (Danger of timeout here)

subpage_urls <- subpage_urls %>% 
  mutate(subpage_content = map(subpage, function(.x) {
    return(GET(.x, timeout(20)) %>% 
             read_html(.))
  }))

subpage_urls

Using for-loop to store full html in memory

tmp <- subpage_urls %>% 
  sample_n(40000)

subpages_content <- vector(mode = "list", length = nrow(tmp))

for (x in 1:nrow(tmp)){
  
  url_tmp <- tmp[x, "subpage"] %>% 
    pull()
  
  tryCatch(
    subpages_content[[x]] <- url_tmp %>%
      GET(., timeout(90)) %>% 
      read_html(.),
    error = function(e){NA}
  )
  
  print(paste("Link", x, "retrieved"))
  
}

subpage_content <- tibble(listing = subpages_content)

subpage_content <- subpage_content %>% 
  mutate(is_null = map(listing, is.null)) %>% 
  unnest(is_null) %>% 
  filter(is_null == FALSE) %>% 
  select(-is_null)

Extract text from scraped content into a tibble:

cars_raw <- tibble(
  listing_no = 1:nrow(subpage_content),
  listing = subpage_content %>% pull(),
  header = map(listing, function(.x){
      return(.x %>% 
        html_nodes(".fauvte") %>%
        html_text())
      }),
  content = map(listing, function(.x){
      return(.x %>% 
        html_nodes(".goTXZq") %>%
        html_text())
      }),
  price = map(listing, function(.x){
      return(.x %>% 
        html_node(".knSuBJ") %>%
        html_text())
      }),
  model_simple = map(listing, function(.x){
      return(.x %>% 
        html_node(".jOflgH .sc-258i13-0") %>%
        html_text())
      })
  ) %>%
  select(-listing) %>% 
  unnest(everything()) %>% 
  pivot_wider(values_from = content, names_from = header)

Data Cleaning

cars <- cars_raw %>% 
  mutate(across(c(price, Kilometer), ~ gsub("'", "", .x)),
         across(c(price, Kilometer, Baujahr,
                  `Leistung (PS)`), ~ parse_number(.x)),
         across(c(Marke, Modell, Getriebeart, Treibstoff,
                  Aufbau, Aussenfarbe, Antrieb), ~ trimws(.x)),
         across(`Letzte Änderung`, ~ lubridate::dmy(.x)),
         across(`Ab MFK`, ~ case_when(is.na(.x) ~ "no", .x == "" ~ "yes"))) %>% 
  rename(full_name = model_simple, brand = Marke, model = Modell,
         mileage_km = Kilometer, year = Baujahr, transmission = Getriebeart,
         fuel = Treibstoff, body_type = Aufbau, horsepower = `Leistung (PS)`,
         colour = Aussenfarbe, last_edited = `Letzte Änderung`,
         id = Inseratnummer, drive = Antrieb, mfk = `Ab MFK`,
         price_chf = price) %>% 
  filter(`Art des Inserats` == "Angebot",
         mileage_km < 1e6,
         price_chf < 1e6) %>% 
  select(- c(`Art des Inserats`))

Exploratory Data Analysis (EDA)

Before training a model, I will explore the clean data set and gauge relations between variables. Let’s first take a look at the frequency of categorical predictors. I lumped levels beyond \(N=15\) together, so that the y axis can be read properly.

cars %>% 
  select(where(is.character)) %>% 
  select(-c(full_name, model)) %>% 
  pivot_longer(everything()) %>% 
  drop_na() %>% 
  group_by(name) %>% 
  mutate(value = fct_lump(value, n = 15)) %>% 
  count(value) %>% 
  mutate(value = reorder_within(value, n, name)) %>% 
  ggplot(aes(n, value)) +
  geom_col(fill = "midnightblue", alpha = 0.8)  +
  facet_wrap(~ name, scales = "free", ncol = 4) +
  labs(title = "Frequency Of Used Car Properties on anibis.ch",
       subtitle = "Sample Size = 30,504 | Data as of 09/22",
       y = NULL,
       x = "Count") +
  scale_x_continuous(labels = scales::comma_format()) +
  scale_y_reordered() +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(face = "italic", size = 10, 
                                     colour = "grey50"),
        panel.grid.major.y = element_blank(),
        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Unsurprisingly, most common brands are German car brands like VW, Mercedes, Audi and BMW. Next up, let’s inspect the numerical variables:

cars %>% 
  select(where(is.numeric)) %>% 
  select(-c(listing_no, id)) %>% 
  pivot_longer(everything()) %>%
  drop_na() %>% 
  ggplot(aes(value)) +
  stat_ecdf() +
  facet_wrap(~ name, scales = "free") +
  labs(title = "Cumulative Distribution Of Used Car Characteristics on anibis.ch",
       subtitle = "Sample Size = 30,504 | Data as of 09/22",
       y = NULL,
       x = NULL) +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(face = "italic", size = 10, 
                                     colour = "grey50"),
        panel.grid.major.y = element_blank())

More interesting plots:

cars %>% 
  filter(year > 1995) %>% 
  count(transmission, year) %>% 
  drop_na() %>% 
  pivot_wider(values_from = n, names_from = transmission) %>% 
  mutate(pct_automatic = Automatik/(Automatik + Handschaltung)) %>% 
  ggplot(aes(year, pct_automatic)) +
  geom_area(alpha = 0.8, fill = "midnightblue") +
  labs(title = "Percentage of cars with automatic transmission by construction year",
       subtitle = "Sample size: 29,284 / Scraped from anibis.ch in 09/2022",
       y = "Percentage Automatic Cars",
       x = NULL) +
  scale_y_continuous(labels = scales::percent_format()) +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(face = "italic", size = 10, 
                                     colour = "grey50"))

cars %>% 
  group_by(year) %>% 
  summarise(mean_hp = mean(horsepower, na.rm = T),
            median_hp = median(horsepower, na.rm = T)) %>% 
  filter(year > 1995) %>% 
  pivot_longer(-year) %>% 
  ggplot(aes(year, value, colour = name)) +
  geom_line(size = 1) +
  labs(title = "Horsepower of used cars by construction year",
       subtitle = "Sample size: 29,284 | Scraped from anibis.ch in 09/2022",
       y = "Horsepower",
       x = NULL,
       colour = NULL) +
  scale_y_continuous(labels = scales::comma_format()) +
  ggsci::scale_colour_futurama() +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(face = "italic", size = 10, 
                                     colour = "grey50"))

cars %>% 
  select(brand, horsepower) %>% 
  drop_na() %>%
  filter(horsepower > 0) %>% 
  mutate(brand = fct_lump(brand, n = 20)) %>% 
  add_count(brand) %>% 
  mutate(brand = paste0(brand, " (N=", n, ")"),
         brand = fct_reorder(brand, horsepower, .desc = TRUE)) %>% 
  ggplot(aes(horsepower, brand)) +
  geom_boxplot(outlier.colour = NA) +
  labs(title = "Horsepower distribution of used cars by brand",
       subtitle = "Sample Size: N = 29,610 | as of 09/22 | Scraped from anibis.ch in 09/2022",
       x = NULL,
       y = NULL) +
  scale_x_continuous(labels = scales::comma_format(suffix = " HP")) +
  coord_cartesian(xlim = c(0,750)) +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(face = "italic", size = 10, 
                                     colour = "grey50"))

cars %>% 
  select(brand, price_chf) %>% 
  drop_na() %>%
  filter(price_chf > 0) %>% 
  mutate(brand = fct_lump(brand, n = 20)) %>% 
  add_count(brand) %>% 
  mutate(brand = paste0(brand, " (N=", n, ")"),
         brand = fct_reorder(brand, price_chf, .desc = TRUE)) %>% 
  ggplot(aes(price_chf, brand)) +
  geom_boxplot(outlier.colour = NA) +
  labs(title = "Price distribution of used cars by brand",
       subtitle = "Sample Size: N = 29,610 | as of 09/22 | Scraped from anibis.ch in 09/2022",
       x = NULL,
       y = NULL) +
  scale_x_continuous(labels = scales::comma_format(suffix = " CHF")) +
  coord_cartesian(xlim = c(0,3e5)) +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(face = "italic", size = 10, 
                                     colour = "grey50"))

cars %>% 
  select(year, price_chf, brand) %>% 
  drop_na() %>% 
  mutate(age = 2023 - year) %>% 
  filter(age < 20,
         age > 0,
         fct_lump(brand, n = 12) != "Other") %>% 
  group_by(brand, age) %>% 
  summarise(median_price = median(price_chf)) %>% 
  mutate(change = median_price/first(median_price)) %>% 
  select(age, brand, change) %>% 
  ggplot(aes(age, change, colour = brand)) +
  geom_point() +
  geom_line() +
  geom_hline(yintercept = 0.5, lty = "dashed", colour = "grey50") +
  facet_wrap(~ brand) +
  expand_limits(y = 0) +
  labs(title = "Used Car Price Change By Age on anibis.ch",
       subtitle = "sample size: n = 18,733 | as of 09/22 | Age 1 is 100%",
       x = "Vehicle Age",
       y = "Price (as %) compared to age = 1") +
  scale_y_continuous(labels = percent_format()) +
  ggsci::scale_colour_futurama() +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(face = "italic", size = 10, 
                                     colour = "grey50"),
        legend.position = "none")

cars %>%
  select(brand, body_type) %>% 
  drop_na() %>% 
  group_by(brand = fct_lump(brand, n = 10)) %>% 
  count(body_type) %>% 
  filter(n > 15) %>%  
  ggplot(aes(y = body_type %>% reorder_within(by = n, within = brand),
             x = n,
             fill = brand)) +
  geom_col() +
  facet_wrap(~ brand, scales = "free") +
  labs(title = "Body Types By Brand on anibis.ch",
       subtitle = "sample size: n = 29,989 | as of 09/22 | showing most frequent body types and brand",
       x = "Count",
       y = NULL) +
  scale_y_reordered() +
  scale_fill_manual(values = MetBrewer::met.brewer(name = "VanGogh1",
                                                   n = 12)) +
  guides(fill = "none") +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(face = "italic", size = 10, 
                                     colour = "grey50"))

cars %>% 
  select(body_type, price_chf) %>% 
  drop_na() %>% 
  add_count(body_type) %>% 
  mutate(body_type = paste0(body_type, " (N=", n, ")"),
         body_type = fct_reorder(body_type, price_chf, .desc = TRUE)) %>% 
  ggplot(aes(price_chf, body_type)) +
  geom_boxplot(outlier.colour = NA) +
  labs(title = "Prices of used cars by body type",
       subtitle = "Sample Size: N = 30,065 | as of 09/22 | Scraped from anibis.ch in 09/2022",
       x = NULL,
       y = NULL) +
  scale_x_continuous(labels = scales::comma_format(suffix = " CHF"),
                     breaks = seq(0, 250000, 50000)) +
  coord_cartesian(xlim = c(0,250000)) +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(face = "italic", size = 10, 
                                     colour = "grey50"))

set.seed(12)

cars %>% 
  sample_n(1000) %>% 
  mutate(age = 2022 - year) %>%
  select(price_chf, mileage_km, horsepower, age) %>% 
  drop_na() %>% 
  pivot_longer(-price_chf) %>% 
  ggplot(aes(value, price_chf)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = F) +
  labs(title = "Relation of Age, Horsepower and Mileage with Price") +
  facet_wrap(~ name, scales = "free") +
  scale_y_continuous(labels = comma_format(suffix = " CHF")) +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(face = "italic", size = 10, 
                                     colour = "grey50"))

cars %>% 
  select(model, price_chf) %>% 
  drop_na() %>% 
  unnest_tokens(input = model, output = "tokens", token = "words") %>% 
  add_count(tokens, sort = T) %>% 
  group_by(tokens) %>% 
  summarise(median_price = median(price_chf),
            n = last(n)) %>% 
  filter(n > 1000) %>% 
  ggplot(aes(median_price, 
             tokens %>% fct_reorder(median_price))) +
  geom_col(fill = "midnightblue") +
  labs(title = "Components of car model names explain variance in prices",
       subtitle = "Sample Size: N = 30,496 | as of 09/22 | Scraped from anibis.ch in 09/2022\nModel names are tokenised by words. Only tokens with frequency N > 1000 are shown.",
       x = "Median Price associated with Token",
       y = NULL) +
  scale_x_continuous(labels = scales::comma_format()) +
  scale_y_reordered() +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(face = "italic", size = 10, 
                                     colour = "grey50"))

Fitting A Model

First, the data is split into training and testing sets. Also, five-fold cross validation is employed for reliable calculation of performance metrics, bearing in mind time efficiency.

dt_split <- cars %>% 
  mutate(across(where(is.character), as.factor)) %>% 
  initial_split()

dt_train <- training(dt_split)
dt_test <- testing(dt_split)

folds <- vfold_cv(dt_train, v = 5)

The recipe in the tidymodels framework makes it very straightforward to include all feature engineering in one step, preventing data leakage from the test set and uniformly applying the same steps to the holdout in the final fit.

model_rec <- recipe(price_chf ~ .,
                 data = dt_train) %>%
  step_rm(listing_no, full_name, last_edited, id) %>% 
  step_mutate(horsepower = ifelse(horsepower == 0, NA, horsepower)) %>% 
  step_impute_median(all_numeric_predictors()) %>% 
  step_novel(all_nominal_predictors()) %>%
  step_unknown(all_nominal_predictors()) %>% 
  step_tokenize(model) %>% 
  step_stopwords(model) %>% 
  step_tokenfilter(model, max_tokens = 500) %>% 
  step_tf(model) %>%
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  step_other(all_nominal_predictors(), threshold = 0.005) %>% 
  step_dummy(all_nominal_predictors(), one_hot = TRUE)

Setting up the model specifications with tuning options for hyperparameters:

gb_spec <- 
  boost_tree(
    trees = 1000,
    tree_depth = tune(),
    min_n = tune(),
    loss_reduction = tune(),
    sample_size = tune(),
    mtry = tune(),
    learn_rate = tune()
  ) %>%
  set_engine("xgboost", importance = "impurity") %>%
  set_mode("regression")

Setting up the model workflow:

gb_wflow <- workflow() %>% 
  add_recipe(model_rec,
             blueprint = hardhat::default_recipe_blueprint(
               allow_novel_levels = TRUE
             )) %>% 
  add_model(gb_spec)

Setting up a space-filling design grid for time-efficient hyperparameter tuning:

gb_grid <- 
  grid_latin_hypercube(
    tree_depth(),
    min_n(),
    loss_reduction(),
    sample_size = sample_prop(),
    finalize(mtry(), dt_train),
    learn_rate(),
    size = 30
  )

Tuning the hyperparameters with parallel processing:

# Gradient Boosting
start_time = Sys.time()

unregister_dopar <- function() {
  env <- foreach:::.foreachGlobals
  rm(list=ls(name=env), pos=env)
}

cl <- makePSOCKcluster(6)
registerDoParallel(cl)

gb_tune <- tune_grid(object = gb_wflow,
                     resamples = folds,
                     grid = gb_grid,
                     control = control_grid(save_pred = TRUE,
                                            save_workflow = TRUE))

stopCluster(cl)
unregister_dopar()

end_time = Sys.time()
end_time - start_time

write_rds(gb_tune, "C:/Users/mathi/OneDrive/R/Data Visualisation/Used Cars Web Scraping/gb_tune.rds")

Looking at the tuning results reveals that the model captures strong signal in the predictors, as the \(R^2\) is fairly high. Though, as mentioned in the introduction, crucial variables are missing.

gb_tune %>% 
  show_best(metric = "rsq") %>% 
  transmute(model = "Gradient Boosting", .metric, mean, n, std_err)

# A tibble: 5 × 5
  model             .metric  mean     n std_err
  <chr>             <chr>   <dbl> <int>   <dbl>
1 Gradient Boosting rsq     0.726     5 0.00762
2 Gradient Boosting rsq     0.709     5 0.00849
3 Gradient Boosting rsq     0.665     5 0.00951
4 Gradient Boosting rsq     0.648     5 0.00975
5 Gradient Boosting rsq     0.573     5 0.0135

Fitting the best model from training on the entire training data:

gb_final_fit <- gb_wflow %>%
  finalize_workflow(select_best(gb_tune, metric = "rmse")) %>% 
  last_fit(dt_split)

Evaluating Model Performance On The Training Data

gb_final_fit %>% 
  augment(dt_test) %>% 
  ggplot(aes(price_chf, .pred)) +
  geom_point(alpha = 0.1, colour = "midnightblue") + 
  geom_abline(colour = "grey50", lty = "dashed") +
  labs(title = "Out-Of-Sample Fit",
       subtitle = NULL,
       y = "Prediction",
       x = "Truth") +
  scale_x_continuous(labels = comma_format(suffix = " CHF")) +
  scale_y_continuous(labels = comma_format(suffix = " CHF")) +
  theme_light() +
  theme(plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"),
        legend.position = "bottom")

gb_final_fit %>% 
  augment(dt_test) %>% 
  select(brand, year, price_chf, .pred) %>% 
  drop_na() %>% 
  filter(year > 1990) %>% 
  mutate(brand = fct_lump(brand, n = 11)) %>% 
  ggplot(aes(price_chf/1000, .pred/1000, colour = year)) +
  geom_point(alpha = 0.5, size = 0.1) +
  geom_abline(colour = "grey50", lty = "dashed") +
  facet_wrap(~ brand, scales = "free") +
  labs(title = "Out-of-sample fit by brand",
       subtitle = "",
       y = "Prediction",
       x = "Truth",
       colour = "Construction Year") +
  scale_x_continuous(labels = scales::comma_format(suffix = "k CHF")) +
  scale_y_continuous(labels = scales::comma_format(suffix = "k CHF")) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
        plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"))

The final fit is not highly impressive, as expected with highly important information for used car prices. Also, it can be seen that the fit is better for some brands than others.

This final model can be saved using the bundle package (Silge, Couch, Yan & Kuhn, 2022) and then read into the Shiny application, where it can make predictions on new data. As you probably have seen, this is what I have done, so you can interact with the final model in the application right at the start of this blog post.

I hope this has been interesting to you. In case of constructive feedback or if you want to exchange about this or a related topic, feel free to reach out. Thank you for reading.

A work by Mathias Steilen