::include_app("https://mathiassteilen.shinyapps.io/car_price_prediction_xgboost/", height = "600px") knitr
Shiny Application
The final result of this blog post can be interacted with below:
Description of this project
The data for this post was scraped from the online platform Anibis. The goal of this post is to demonstrate how models can be deployed using Shiny. Specifically, the example above shows a simple UI, where users can enter information about their car and get back a prediction for the price in Swiss francs. Alternatively, I could have also allowed users to use the app at scale by submitting variables via a CSV, or by creating a RESTful API. The rest of the post below shows the scraping and model creation process.
Limitations of the model
Only 25% of all cars on Anibis were scraped, as it was sufficient for proof of concept. It would be no problem to extend the sample to a full 100% of all cars on Anibis and to make the model training more thorough in order to achieve higher accuracy. Additionally, there is no information on previous owners, potential previous accidents and the general state of the car with regard to maintenance. These are crucial pieces of information for a model predicting price. However, in this case, the goal was to show how a model could be hosted using Shiny, therefore the achieved \(R^2\) of around 75% is sufficient.
Data Scraping
Requesting sublinks to individual listings from site
<- 1:6500
pages
<- paste0("https://www.anibis.ch/de/c/autos-autozubehoer-occasionen-neuwagen?pi=",
start_urls %>%
pages) as_tibble() %>%
rename(url = value)
start_urls
# Fetch subpages from listings from each overview page
<- start_urls %>%
listings mutate(subpages = map(url, function(.x) {
return(
GET(.x, timeout(10)) %>%
read_html(.) %>%
html_nodes("[class ='sc-1yo7ctu-0 bRDNul']") %>%
html_attr('href') %>%
as_tibble() %>%
rename(subpage = value) %>%
mutate(subpage = paste0("https://www.anibis.ch", subpage))
)
}))
listings
# Extract subpage urls and clean as tibble
<- listings %>%
subpage_urls select(subpages) %>%
unnest(subpages)
subpage_urls
# Read in html from each subpage (Danger of timeout here)
<- subpage_urls %>%
subpage_urls mutate(subpage_content = map(subpage, function(.x) {
return(GET(.x, timeout(20)) %>%
read_html(.))
}))
subpage_urls
Using for-loop to store full html in memory
<- subpage_urls %>%
tmp sample_n(40000)
<- vector(mode = "list", length = nrow(tmp))
subpages_content
for (x in 1:nrow(tmp)){
<- tmp[x, "subpage"] %>%
url_tmp pull()
tryCatch(
<- url_tmp %>%
subpages_content[[x]] GET(., timeout(90)) %>%
read_html(.),
error = function(e){NA}
)
print(paste("Link", x, "retrieved"))
}
<- tibble(listing = subpages_content)
subpage_content
<- subpage_content %>%
subpage_content mutate(is_null = map(listing, is.null)) %>%
unnest(is_null) %>%
filter(is_null == FALSE) %>%
select(-is_null)
Extract text from scraped content into a tibble:
<- tibble(
cars_raw listing_no = 1:nrow(subpage_content),
listing = subpage_content %>% pull(),
header = map(listing, function(.x){
return(.x %>%
html_nodes(".fauvte") %>%
html_text())
}),content = map(listing, function(.x){
return(.x %>%
html_nodes(".goTXZq") %>%
html_text())
}),price = map(listing, function(.x){
return(.x %>%
html_node(".knSuBJ") %>%
html_text())
}),model_simple = map(listing, function(.x){
return(.x %>%
html_node(".jOflgH .sc-258i13-0") %>%
html_text())
})%>%
) select(-listing) %>%
unnest(everything()) %>%
pivot_wider(values_from = content, names_from = header)
Data Cleaning
<- cars_raw %>%
cars mutate(across(c(price, Kilometer), ~ gsub("'", "", .x)),
across(c(price, Kilometer, Baujahr,
`Leistung (PS)`), ~ parse_number(.x)),
across(c(Marke, Modell, Getriebeart, Treibstoff,
~ trimws(.x)),
Aufbau, Aussenfarbe, Antrieb), across(`Letzte Änderung`, ~ lubridate::dmy(.x)),
across(`Ab MFK`, ~ case_when(is.na(.x) ~ "no", .x == "" ~ "yes"))) %>%
rename(full_name = model_simple, brand = Marke, model = Modell,
mileage_km = Kilometer, year = Baujahr, transmission = Getriebeart,
fuel = Treibstoff, body_type = Aufbau, horsepower = `Leistung (PS)`,
colour = Aussenfarbe, last_edited = `Letzte Änderung`,
id = Inseratnummer, drive = Antrieb, mfk = `Ab MFK`,
price_chf = price) %>%
filter(`Art des Inserats` == "Angebot",
< 1e6,
mileage_km < 1e6) %>%
price_chf select(- c(`Art des Inserats`))
Exploratory Data Analysis (EDA)
Before training a model, I will explore the clean data set and gauge relations between variables. Let’s first take a look at the frequency of categorical predictors. I lumped levels beyond \(N=15\) together, so that the y axis can be read properly.
%>%
cars select(where(is.character)) %>%
select(-c(full_name, model)) %>%
pivot_longer(everything()) %>%
drop_na() %>%
group_by(name) %>%
mutate(value = fct_lump(value, n = 15)) %>%
count(value) %>%
mutate(value = reorder_within(value, n, name)) %>%
ggplot(aes(n, value)) +
geom_col(fill = "midnightblue", alpha = 0.8) +
facet_wrap(~ name, scales = "free", ncol = 4) +
labs(title = "Frequency Of Used Car Properties on anibis.ch",
subtitle = "Sample Size = 30,504 | Data as of 09/22",
y = NULL,
x = "Count") +
scale_x_continuous(labels = scales::comma_format()) +
scale_y_reordered() +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic", size = 10,
colour = "grey50"),
panel.grid.major.y = element_blank(),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
Unsurprisingly, most common brands are German car brands like VW, Mercedes, Audi and BMW. Next up, let’s inspect the numerical variables:
%>%
cars select(where(is.numeric)) %>%
select(-c(listing_no, id)) %>%
pivot_longer(everything()) %>%
drop_na() %>%
ggplot(aes(value)) +
stat_ecdf() +
facet_wrap(~ name, scales = "free") +
labs(title = "Cumulative Distribution Of Used Car Characteristics on anibis.ch",
subtitle = "Sample Size = 30,504 | Data as of 09/22",
y = NULL,
x = NULL) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic", size = 10,
colour = "grey50"),
panel.grid.major.y = element_blank())
More interesting plots:
%>%
cars filter(year > 1995) %>%
count(transmission, year) %>%
drop_na() %>%
pivot_wider(values_from = n, names_from = transmission) %>%
mutate(pct_automatic = Automatik/(Automatik + Handschaltung)) %>%
ggplot(aes(year, pct_automatic)) +
geom_area(alpha = 0.8, fill = "midnightblue") +
labs(title = "Percentage of cars with automatic transmission by construction year",
subtitle = "Sample size: 29,284 / Scraped from anibis.ch in 09/2022",
y = "Percentage Automatic Cars",
x = NULL) +
scale_y_continuous(labels = scales::percent_format()) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic", size = 10,
colour = "grey50"))
%>%
cars group_by(year) %>%
summarise(mean_hp = mean(horsepower, na.rm = T),
median_hp = median(horsepower, na.rm = T)) %>%
filter(year > 1995) %>%
pivot_longer(-year) %>%
ggplot(aes(year, value, colour = name)) +
geom_line(size = 1) +
labs(title = "Horsepower of used cars by construction year",
subtitle = "Sample size: 29,284 | Scraped from anibis.ch in 09/2022",
y = "Horsepower",
x = NULL,
colour = NULL) +
scale_y_continuous(labels = scales::comma_format()) +
::scale_colour_futurama() +
ggscitheme_bw() +
theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic", size = 10,
colour = "grey50"))
%>%
cars select(brand, horsepower) %>%
drop_na() %>%
filter(horsepower > 0) %>%
mutate(brand = fct_lump(brand, n = 20)) %>%
add_count(brand) %>%
mutate(brand = paste0(brand, " (N=", n, ")"),
brand = fct_reorder(brand, horsepower, .desc = TRUE)) %>%
ggplot(aes(horsepower, brand)) +
geom_boxplot(outlier.colour = NA) +
labs(title = "Horsepower distribution of used cars by brand",
subtitle = "Sample Size: N = 29,610 | as of 09/22 | Scraped from anibis.ch in 09/2022",
x = NULL,
y = NULL) +
scale_x_continuous(labels = scales::comma_format(suffix = " HP")) +
coord_cartesian(xlim = c(0,750)) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic", size = 10,
colour = "grey50"))
%>%
cars select(brand, price_chf) %>%
drop_na() %>%
filter(price_chf > 0) %>%
mutate(brand = fct_lump(brand, n = 20)) %>%
add_count(brand) %>%
mutate(brand = paste0(brand, " (N=", n, ")"),
brand = fct_reorder(brand, price_chf, .desc = TRUE)) %>%
ggplot(aes(price_chf, brand)) +
geom_boxplot(outlier.colour = NA) +
labs(title = "Price distribution of used cars by brand",
subtitle = "Sample Size: N = 29,610 | as of 09/22 | Scraped from anibis.ch in 09/2022",
x = NULL,
y = NULL) +
scale_x_continuous(labels = scales::comma_format(suffix = " CHF")) +
coord_cartesian(xlim = c(0,3e5)) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic", size = 10,
colour = "grey50"))
%>%
cars select(year, price_chf, brand) %>%
drop_na() %>%
mutate(age = 2023 - year) %>%
filter(age < 20,
> 0,
age fct_lump(brand, n = 12) != "Other") %>%
group_by(brand, age) %>%
summarise(median_price = median(price_chf)) %>%
mutate(change = median_price/first(median_price)) %>%
select(age, brand, change) %>%
ggplot(aes(age, change, colour = brand)) +
geom_point() +
geom_line() +
geom_hline(yintercept = 0.5, lty = "dashed", colour = "grey50") +
facet_wrap(~ brand) +
expand_limits(y = 0) +
labs(title = "Used Car Price Change By Age on anibis.ch",
subtitle = "sample size: n = 18,733 | as of 09/22 | Age 1 is 100%",
x = "Vehicle Age",
y = "Price (as %) compared to age = 1") +
scale_y_continuous(labels = percent_format()) +
::scale_colour_futurama() +
ggscitheme_bw() +
theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic", size = 10,
colour = "grey50"),
legend.position = "none")
%>%
cars select(brand, body_type) %>%
drop_na() %>%
group_by(brand = fct_lump(brand, n = 10)) %>%
count(body_type) %>%
filter(n > 15) %>%
ggplot(aes(y = body_type %>% reorder_within(by = n, within = brand),
x = n,
fill = brand)) +
geom_col() +
facet_wrap(~ brand, scales = "free") +
labs(title = "Body Types By Brand on anibis.ch",
subtitle = "sample size: n = 29,989 | as of 09/22 | showing most frequent body types and brand",
x = "Count",
y = NULL) +
scale_y_reordered() +
scale_fill_manual(values = MetBrewer::met.brewer(name = "VanGogh1",
n = 12)) +
guides(fill = "none") +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic", size = 10,
colour = "grey50"))
%>%
cars select(body_type, price_chf) %>%
drop_na() %>%
add_count(body_type) %>%
mutate(body_type = paste0(body_type, " (N=", n, ")"),
body_type = fct_reorder(body_type, price_chf, .desc = TRUE)) %>%
ggplot(aes(price_chf, body_type)) +
geom_boxplot(outlier.colour = NA) +
labs(title = "Prices of used cars by body type",
subtitle = "Sample Size: N = 30,065 | as of 09/22 | Scraped from anibis.ch in 09/2022",
x = NULL,
y = NULL) +
scale_x_continuous(labels = scales::comma_format(suffix = " CHF"),
breaks = seq(0, 250000, 50000)) +
coord_cartesian(xlim = c(0,250000)) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic", size = 10,
colour = "grey50"))
set.seed(12)
%>%
cars sample_n(1000) %>%
mutate(age = 2022 - year) %>%
select(price_chf, mileage_km, horsepower, age) %>%
drop_na() %>%
pivot_longer(-price_chf) %>%
ggplot(aes(value, price_chf)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = F) +
labs(title = "Relation of Age, Horsepower and Mileage with Price") +
facet_wrap(~ name, scales = "free") +
scale_y_continuous(labels = comma_format(suffix = " CHF")) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic", size = 10,
colour = "grey50"))
%>%
cars select(model, price_chf) %>%
drop_na() %>%
unnest_tokens(input = model, output = "tokens", token = "words") %>%
add_count(tokens, sort = T) %>%
group_by(tokens) %>%
summarise(median_price = median(price_chf),
n = last(n)) %>%
filter(n > 1000) %>%
ggplot(aes(median_price,
%>% fct_reorder(median_price))) +
tokens geom_col(fill = "midnightblue") +
labs(title = "Components of car model names explain variance in prices",
subtitle = "Sample Size: N = 30,496 | as of 09/22 | Scraped from anibis.ch in 09/2022\nModel names are tokenised by words. Only tokens with frequency N > 1000 are shown.",
x = "Median Price associated with Token",
y = NULL) +
scale_x_continuous(labels = scales::comma_format()) +
scale_y_reordered() +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic", size = 10,
colour = "grey50"))
Fitting A Model
First, the data is split into training and testing sets. Also, five-fold cross validation is employed for reliable calculation of performance metrics, bearing in mind time efficiency.
<- cars %>%
dt_split mutate(across(where(is.character), as.factor)) %>%
initial_split()
<- training(dt_split)
dt_train <- testing(dt_split)
dt_test
<- vfold_cv(dt_train, v = 5) folds
The recipe in the tidymodels framework makes it very straightforward to include all feature engineering in one step, preventing data leakage from the test set and uniformly applying the same steps to the holdout in the final fit.
<- recipe(price_chf ~ .,
model_rec data = dt_train) %>%
step_rm(listing_no, full_name, last_edited, id) %>%
step_mutate(horsepower = ifelse(horsepower == 0, NA, horsepower)) %>%
step_impute_median(all_numeric_predictors()) %>%
step_novel(all_nominal_predictors()) %>%
step_unknown(all_nominal_predictors()) %>%
step_tokenize(model) %>%
step_stopwords(model) %>%
step_tokenfilter(model, max_tokens = 500) %>%
step_tf(model) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors()) %>%
step_other(all_nominal_predictors(), threshold = 0.005) %>%
step_dummy(all_nominal_predictors(), one_hot = TRUE)
Setting up the model specifications with tuning options for hyperparameters:
<-
gb_spec boost_tree(
trees = 1000,
tree_depth = tune(),
min_n = tune(),
loss_reduction = tune(),
sample_size = tune(),
mtry = tune(),
learn_rate = tune()
%>%
) set_engine("xgboost", importance = "impurity") %>%
set_mode("regression")
Setting up the model workflow:
<- workflow() %>%
gb_wflow add_recipe(model_rec,
blueprint = hardhat::default_recipe_blueprint(
allow_novel_levels = TRUE
%>%
)) add_model(gb_spec)
Setting up a space-filling design grid for time-efficient hyperparameter tuning:
<-
gb_grid grid_latin_hypercube(
tree_depth(),
min_n(),
loss_reduction(),
sample_size = sample_prop(),
finalize(mtry(), dt_train),
learn_rate(),
size = 30
)
Tuning the hyperparameters with parallel processing:
# Gradient Boosting
= Sys.time()
start_time
<- function() {
unregister_dopar <- foreach:::.foreachGlobals
env rm(list=ls(name=env), pos=env)
}
<- makePSOCKcluster(6)
cl registerDoParallel(cl)
<- tune_grid(object = gb_wflow,
gb_tune resamples = folds,
grid = gb_grid,
control = control_grid(save_pred = TRUE,
save_workflow = TRUE))
stopCluster(cl)
unregister_dopar()
= Sys.time()
end_time - start_time end_time
write_rds(gb_tune, "C:/Users/mathi/OneDrive/R/Data Visualisation/Used Cars Web Scraping/gb_tune.rds")
Looking at the tuning results reveals that the model captures strong signal in the predictors, as the \(R^2\) is fairly high. Though, as mentioned in the introduction, crucial variables are missing.
%>%
gb_tune show_best(metric = "rsq") %>%
transmute(model = "Gradient Boosting", .metric, mean, n, std_err)
# A tibble: 5 × 5
model .metric mean n std_err
<chr> <chr> <dbl> <int> <dbl>
1 Gradient Boosting rsq 0.726 5 0.00762
2 Gradient Boosting rsq 0.709 5 0.00849
3 Gradient Boosting rsq 0.665 5 0.00951
4 Gradient Boosting rsq 0.648 5 0.00975
5 Gradient Boosting rsq 0.573 5 0.0135
Fitting the best model from training on the entire training data:
<- gb_wflow %>%
gb_final_fit finalize_workflow(select_best(gb_tune, metric = "rmse")) %>%
last_fit(dt_split)
Evaluating Model Performance On The Training Data
%>%
gb_final_fit augment(dt_test) %>%
ggplot(aes(price_chf, .pred)) +
geom_point(alpha = 0.1, colour = "midnightblue") +
geom_abline(colour = "grey50", lty = "dashed") +
labs(title = "Out-Of-Sample Fit",
subtitle = NULL,
y = "Prediction",
x = "Truth") +
scale_x_continuous(labels = comma_format(suffix = " CHF")) +
scale_y_continuous(labels = comma_format(suffix = " CHF")) +
theme_light() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"),
legend.position = "bottom")
%>%
gb_final_fit augment(dt_test) %>%
select(brand, year, price_chf, .pred) %>%
drop_na() %>%
filter(year > 1990) %>%
mutate(brand = fct_lump(brand, n = 11)) %>%
ggplot(aes(price_chf/1000, .pred/1000, colour = year)) +
geom_point(alpha = 0.5, size = 0.1) +
geom_abline(colour = "grey50", lty = "dashed") +
facet_wrap(~ brand, scales = "free") +
labs(title = "Out-of-sample fit by brand",
subtitle = "",
y = "Prediction",
x = "Truth",
colour = "Construction Year") +
scale_x_continuous(labels = scales::comma_format(suffix = "k CHF")) +
scale_y_continuous(labels = scales::comma_format(suffix = "k CHF")) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"))
The final fit is not highly impressive, as expected with highly important information for used car prices. Also, it can be seen that the fit is better for some brands than others.
This final model can be saved using the bundle package (Silge, Couch, Yan & Kuhn, 2022) and then read into the Shiny application, where it can make predictions on new data. As you probably have seen, this is what I have done, so you can interact with the final model in the application right at the start of this blog post.
I hope this has been interesting to you. In case of constructive feedback or if you want to exchange about this or a related topic, feel free to reach out. Thank you for reading.
A work by Mathias Steilen