knitr::include_app("https://mathiassteilen.shinyapps.io/car_price_prediction_xgboost/", height = "600px")Shiny Application
The final result of this blog post can be interacted with below:
Description of this project
The data for this post was scraped from the online platform Anibis. The goal of this post is to demonstrate how models can be deployed using Shiny. Specifically, the example above shows a simple UI, where users can enter information about their car and get back a prediction for the price in Swiss francs. Alternatively, I could have also allowed users to use the app at scale by submitting variables via a CSV, or by creating a RESTful API. The rest of the post below shows the scraping and model creation process.
Limitations of the model
Only 25% of all cars on Anibis were scraped, as it was sufficient for proof of concept. It would be no problem to extend the sample to a full 100% of all cars on Anibis and to make the model training more thorough in order to achieve higher accuracy. Additionally, there is no information on previous owners, potential previous accidents and the general state of the car with regard to maintenance. These are crucial pieces of information for a model predicting price. However, in this case, the goal was to show how a model could be hosted using Shiny, therefore the achieved \(R^2\) of around 75% is sufficient.
Data Scraping
Requesting sublinks to individual listings from site
pages <- 1:6500
start_urls <- paste0("https://www.anibis.ch/de/c/autos-autozubehoer-occasionen-neuwagen?pi=",
pages) %>%
as_tibble() %>%
rename(url = value)
start_urls
# Fetch subpages from listings from each overview page
listings <- start_urls %>%
mutate(subpages = map(url, function(.x) {
return(
GET(.x, timeout(10)) %>%
read_html(.) %>%
html_nodes("[class ='sc-1yo7ctu-0 bRDNul']") %>%
html_attr('href') %>%
as_tibble() %>%
rename(subpage = value) %>%
mutate(subpage = paste0("https://www.anibis.ch", subpage))
)
}))
listings
# Extract subpage urls and clean as tibble
subpage_urls <- listings %>%
select(subpages) %>%
unnest(subpages)
subpage_urls
# Read in html from each subpage (Danger of timeout here)
subpage_urls <- subpage_urls %>%
mutate(subpage_content = map(subpage, function(.x) {
return(GET(.x, timeout(20)) %>%
read_html(.))
}))
subpage_urlsUsing for-loop to store full html in memory
tmp <- subpage_urls %>%
sample_n(40000)
subpages_content <- vector(mode = "list", length = nrow(tmp))
for (x in 1:nrow(tmp)){
url_tmp <- tmp[x, "subpage"] %>%
pull()
tryCatch(
subpages_content[[x]] <- url_tmp %>%
GET(., timeout(90)) %>%
read_html(.),
error = function(e){NA}
)
print(paste("Link", x, "retrieved"))
}
subpage_content <- tibble(listing = subpages_content)
subpage_content <- subpage_content %>%
mutate(is_null = map(listing, is.null)) %>%
unnest(is_null) %>%
filter(is_null == FALSE) %>%
select(-is_null)Extract text from scraped content into a tibble:
cars_raw <- tibble(
listing_no = 1:nrow(subpage_content),
listing = subpage_content %>% pull(),
header = map(listing, function(.x){
return(.x %>%
html_nodes(".fauvte") %>%
html_text())
}),
content = map(listing, function(.x){
return(.x %>%
html_nodes(".goTXZq") %>%
html_text())
}),
price = map(listing, function(.x){
return(.x %>%
html_node(".knSuBJ") %>%
html_text())
}),
model_simple = map(listing, function(.x){
return(.x %>%
html_node(".jOflgH .sc-258i13-0") %>%
html_text())
})
) %>%
select(-listing) %>%
unnest(everything()) %>%
pivot_wider(values_from = content, names_from = header)Data Cleaning
cars <- cars_raw %>%
mutate(across(c(price, Kilometer), ~ gsub("'", "", .x)),
across(c(price, Kilometer, Baujahr,
`Leistung (PS)`), ~ parse_number(.x)),
across(c(Marke, Modell, Getriebeart, Treibstoff,
Aufbau, Aussenfarbe, Antrieb), ~ trimws(.x)),
across(`Letzte Änderung`, ~ lubridate::dmy(.x)),
across(`Ab MFK`, ~ case_when(is.na(.x) ~ "no", .x == "" ~ "yes"))) %>%
rename(full_name = model_simple, brand = Marke, model = Modell,
mileage_km = Kilometer, year = Baujahr, transmission = Getriebeart,
fuel = Treibstoff, body_type = Aufbau, horsepower = `Leistung (PS)`,
colour = Aussenfarbe, last_edited = `Letzte Änderung`,
id = Inseratnummer, drive = Antrieb, mfk = `Ab MFK`,
price_chf = price) %>%
filter(`Art des Inserats` == "Angebot",
mileage_km < 1e6,
price_chf < 1e6) %>%
select(- c(`Art des Inserats`))Exploratory Data Analysis (EDA)
Before training a model, I will explore the clean data set and gauge relations between variables. Let’s first take a look at the frequency of categorical predictors. I lumped levels beyond \(N=15\) together, so that the y axis can be read properly.
cars %>%
select(where(is.character)) %>%
select(-c(full_name, model)) %>%
pivot_longer(everything()) %>%
drop_na() %>%
group_by(name) %>%
mutate(value = fct_lump(value, n = 15)) %>%
count(value) %>%
mutate(value = reorder_within(value, n, name)) %>%
ggplot(aes(n, value)) +
geom_col(fill = "midnightblue", alpha = 0.8) +
facet_wrap(~ name, scales = "free", ncol = 4) +
labs(title = "Frequency Of Used Car Properties on anibis.ch",
subtitle = "Sample Size = 30,504 | Data as of 09/22",
y = NULL,
x = "Count") +
scale_x_continuous(labels = scales::comma_format()) +
scale_y_reordered() +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic", size = 10,
colour = "grey50"),
panel.grid.major.y = element_blank(),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))Unsurprisingly, most common brands are German car brands like VW, Mercedes, Audi and BMW. Next up, let’s inspect the numerical variables:
cars %>%
select(where(is.numeric)) %>%
select(-c(listing_no, id)) %>%
pivot_longer(everything()) %>%
drop_na() %>%
ggplot(aes(value)) +
stat_ecdf() +
facet_wrap(~ name, scales = "free") +
labs(title = "Cumulative Distribution Of Used Car Characteristics on anibis.ch",
subtitle = "Sample Size = 30,504 | Data as of 09/22",
y = NULL,
x = NULL) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic", size = 10,
colour = "grey50"),
panel.grid.major.y = element_blank())More interesting plots:
cars %>%
filter(year > 1995) %>%
count(transmission, year) %>%
drop_na() %>%
pivot_wider(values_from = n, names_from = transmission) %>%
mutate(pct_automatic = Automatik/(Automatik + Handschaltung)) %>%
ggplot(aes(year, pct_automatic)) +
geom_area(alpha = 0.8, fill = "midnightblue") +
labs(title = "Percentage of cars with automatic transmission by construction year",
subtitle = "Sample size: 29,284 / Scraped from anibis.ch in 09/2022",
y = "Percentage Automatic Cars",
x = NULL) +
scale_y_continuous(labels = scales::percent_format()) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic", size = 10,
colour = "grey50"))cars %>%
group_by(year) %>%
summarise(mean_hp = mean(horsepower, na.rm = T),
median_hp = median(horsepower, na.rm = T)) %>%
filter(year > 1995) %>%
pivot_longer(-year) %>%
ggplot(aes(year, value, colour = name)) +
geom_line(size = 1) +
labs(title = "Horsepower of used cars by construction year",
subtitle = "Sample size: 29,284 | Scraped from anibis.ch in 09/2022",
y = "Horsepower",
x = NULL,
colour = NULL) +
scale_y_continuous(labels = scales::comma_format()) +
ggsci::scale_colour_futurama() +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic", size = 10,
colour = "grey50"))cars %>%
select(brand, horsepower) %>%
drop_na() %>%
filter(horsepower > 0) %>%
mutate(brand = fct_lump(brand, n = 20)) %>%
add_count(brand) %>%
mutate(brand = paste0(brand, " (N=", n, ")"),
brand = fct_reorder(brand, horsepower, .desc = TRUE)) %>%
ggplot(aes(horsepower, brand)) +
geom_boxplot(outlier.colour = NA) +
labs(title = "Horsepower distribution of used cars by brand",
subtitle = "Sample Size: N = 29,610 | as of 09/22 | Scraped from anibis.ch in 09/2022",
x = NULL,
y = NULL) +
scale_x_continuous(labels = scales::comma_format(suffix = " HP")) +
coord_cartesian(xlim = c(0,750)) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic", size = 10,
colour = "grey50"))cars %>%
select(brand, price_chf) %>%
drop_na() %>%
filter(price_chf > 0) %>%
mutate(brand = fct_lump(brand, n = 20)) %>%
add_count(brand) %>%
mutate(brand = paste0(brand, " (N=", n, ")"),
brand = fct_reorder(brand, price_chf, .desc = TRUE)) %>%
ggplot(aes(price_chf, brand)) +
geom_boxplot(outlier.colour = NA) +
labs(title = "Price distribution of used cars by brand",
subtitle = "Sample Size: N = 29,610 | as of 09/22 | Scraped from anibis.ch in 09/2022",
x = NULL,
y = NULL) +
scale_x_continuous(labels = scales::comma_format(suffix = " CHF")) +
coord_cartesian(xlim = c(0,3e5)) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic", size = 10,
colour = "grey50"))cars %>%
select(year, price_chf, brand) %>%
drop_na() %>%
mutate(age = 2023 - year) %>%
filter(age < 20,
age > 0,
fct_lump(brand, n = 12) != "Other") %>%
group_by(brand, age) %>%
summarise(median_price = median(price_chf)) %>%
mutate(change = median_price/first(median_price)) %>%
select(age, brand, change) %>%
ggplot(aes(age, change, colour = brand)) +
geom_point() +
geom_line() +
geom_hline(yintercept = 0.5, lty = "dashed", colour = "grey50") +
facet_wrap(~ brand) +
expand_limits(y = 0) +
labs(title = "Used Car Price Change By Age on anibis.ch",
subtitle = "sample size: n = 18,733 | as of 09/22 | Age 1 is 100%",
x = "Vehicle Age",
y = "Price (as %) compared to age = 1") +
scale_y_continuous(labels = percent_format()) +
ggsci::scale_colour_futurama() +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic", size = 10,
colour = "grey50"),
legend.position = "none")cars %>%
select(brand, body_type) %>%
drop_na() %>%
group_by(brand = fct_lump(brand, n = 10)) %>%
count(body_type) %>%
filter(n > 15) %>%
ggplot(aes(y = body_type %>% reorder_within(by = n, within = brand),
x = n,
fill = brand)) +
geom_col() +
facet_wrap(~ brand, scales = "free") +
labs(title = "Body Types By Brand on anibis.ch",
subtitle = "sample size: n = 29,989 | as of 09/22 | showing most frequent body types and brand",
x = "Count",
y = NULL) +
scale_y_reordered() +
scale_fill_manual(values = MetBrewer::met.brewer(name = "VanGogh1",
n = 12)) +
guides(fill = "none") +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic", size = 10,
colour = "grey50"))cars %>%
select(body_type, price_chf) %>%
drop_na() %>%
add_count(body_type) %>%
mutate(body_type = paste0(body_type, " (N=", n, ")"),
body_type = fct_reorder(body_type, price_chf, .desc = TRUE)) %>%
ggplot(aes(price_chf, body_type)) +
geom_boxplot(outlier.colour = NA) +
labs(title = "Prices of used cars by body type",
subtitle = "Sample Size: N = 30,065 | as of 09/22 | Scraped from anibis.ch in 09/2022",
x = NULL,
y = NULL) +
scale_x_continuous(labels = scales::comma_format(suffix = " CHF"),
breaks = seq(0, 250000, 50000)) +
coord_cartesian(xlim = c(0,250000)) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic", size = 10,
colour = "grey50"))set.seed(12)
cars %>%
sample_n(1000) %>%
mutate(age = 2022 - year) %>%
select(price_chf, mileage_km, horsepower, age) %>%
drop_na() %>%
pivot_longer(-price_chf) %>%
ggplot(aes(value, price_chf)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = F) +
labs(title = "Relation of Age, Horsepower and Mileage with Price") +
facet_wrap(~ name, scales = "free") +
scale_y_continuous(labels = comma_format(suffix = " CHF")) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic", size = 10,
colour = "grey50"))cars %>%
select(model, price_chf) %>%
drop_na() %>%
unnest_tokens(input = model, output = "tokens", token = "words") %>%
add_count(tokens, sort = T) %>%
group_by(tokens) %>%
summarise(median_price = median(price_chf),
n = last(n)) %>%
filter(n > 1000) %>%
ggplot(aes(median_price,
tokens %>% fct_reorder(median_price))) +
geom_col(fill = "midnightblue") +
labs(title = "Components of car model names explain variance in prices",
subtitle = "Sample Size: N = 30,496 | as of 09/22 | Scraped from anibis.ch in 09/2022\nModel names are tokenised by words. Only tokens with frequency N > 1000 are shown.",
x = "Median Price associated with Token",
y = NULL) +
scale_x_continuous(labels = scales::comma_format()) +
scale_y_reordered() +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic", size = 10,
colour = "grey50"))Fitting A Model
First, the data is split into training and testing sets. Also, five-fold cross validation is employed for reliable calculation of performance metrics, bearing in mind time efficiency.
dt_split <- cars %>%
mutate(across(where(is.character), as.factor)) %>%
initial_split()
dt_train <- training(dt_split)
dt_test <- testing(dt_split)
folds <- vfold_cv(dt_train, v = 5)The recipe in the tidymodels framework makes it very straightforward to include all feature engineering in one step, preventing data leakage from the test set and uniformly applying the same steps to the holdout in the final fit.
model_rec <- recipe(price_chf ~ .,
data = dt_train) %>%
step_rm(listing_no, full_name, last_edited, id) %>%
step_mutate(horsepower = ifelse(horsepower == 0, NA, horsepower)) %>%
step_impute_median(all_numeric_predictors()) %>%
step_novel(all_nominal_predictors()) %>%
step_unknown(all_nominal_predictors()) %>%
step_tokenize(model) %>%
step_stopwords(model) %>%
step_tokenfilter(model, max_tokens = 500) %>%
step_tf(model) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors()) %>%
step_other(all_nominal_predictors(), threshold = 0.005) %>%
step_dummy(all_nominal_predictors(), one_hot = TRUE)Setting up the model specifications with tuning options for hyperparameters:
gb_spec <-
boost_tree(
trees = 1000,
tree_depth = tune(),
min_n = tune(),
loss_reduction = tune(),
sample_size = tune(),
mtry = tune(),
learn_rate = tune()
) %>%
set_engine("xgboost", importance = "impurity") %>%
set_mode("regression")Setting up the model workflow:
gb_wflow <- workflow() %>%
add_recipe(model_rec,
blueprint = hardhat::default_recipe_blueprint(
allow_novel_levels = TRUE
)) %>%
add_model(gb_spec)Setting up a space-filling design grid for time-efficient hyperparameter tuning:
gb_grid <-
grid_latin_hypercube(
tree_depth(),
min_n(),
loss_reduction(),
sample_size = sample_prop(),
finalize(mtry(), dt_train),
learn_rate(),
size = 30
)Tuning the hyperparameters with parallel processing:
# Gradient Boosting
start_time = Sys.time()
unregister_dopar <- function() {
env <- foreach:::.foreachGlobals
rm(list=ls(name=env), pos=env)
}
cl <- makePSOCKcluster(6)
registerDoParallel(cl)
gb_tune <- tune_grid(object = gb_wflow,
resamples = folds,
grid = gb_grid,
control = control_grid(save_pred = TRUE,
save_workflow = TRUE))
stopCluster(cl)
unregister_dopar()
end_time = Sys.time()
end_time - start_timewrite_rds(gb_tune, "C:/Users/mathi/OneDrive/R/Data Visualisation/Used Cars Web Scraping/gb_tune.rds")Looking at the tuning results reveals that the model captures strong signal in the predictors, as the \(R^2\) is fairly high. Though, as mentioned in the introduction, crucial variables are missing.
gb_tune %>%
show_best(metric = "rsq") %>%
transmute(model = "Gradient Boosting", .metric, mean, n, std_err)# A tibble: 5 × 5
model .metric mean n std_err
<chr> <chr> <dbl> <int> <dbl>
1 Gradient Boosting rsq 0.726 5 0.00762
2 Gradient Boosting rsq 0.709 5 0.00849
3 Gradient Boosting rsq 0.665 5 0.00951
4 Gradient Boosting rsq 0.648 5 0.00975
5 Gradient Boosting rsq 0.573 5 0.0135
Fitting the best model from training on the entire training data:
gb_final_fit <- gb_wflow %>%
finalize_workflow(select_best(gb_tune, metric = "rmse")) %>%
last_fit(dt_split)Evaluating Model Performance On The Training Data
gb_final_fit %>%
augment(dt_test) %>%
ggplot(aes(price_chf, .pred)) +
geom_point(alpha = 0.1, colour = "midnightblue") +
geom_abline(colour = "grey50", lty = "dashed") +
labs(title = "Out-Of-Sample Fit",
subtitle = NULL,
y = "Prediction",
x = "Truth") +
scale_x_continuous(labels = comma_format(suffix = " CHF")) +
scale_y_continuous(labels = comma_format(suffix = " CHF")) +
theme_light() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"),
legend.position = "bottom")gb_final_fit %>%
augment(dt_test) %>%
select(brand, year, price_chf, .pred) %>%
drop_na() %>%
filter(year > 1990) %>%
mutate(brand = fct_lump(brand, n = 11)) %>%
ggplot(aes(price_chf/1000, .pred/1000, colour = year)) +
geom_point(alpha = 0.5, size = 0.1) +
geom_abline(colour = "grey50", lty = "dashed") +
facet_wrap(~ brand, scales = "free") +
labs(title = "Out-of-sample fit by brand",
subtitle = "",
y = "Prediction",
x = "Truth",
colour = "Construction Year") +
scale_x_continuous(labels = scales::comma_format(suffix = "k CHF")) +
scale_y_continuous(labels = scales::comma_format(suffix = "k CHF")) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"))The final fit is not highly impressive, as expected with highly important information for used car prices. Also, it can be seen that the fit is better for some brands than others.
This final model can be saved using the bundle package (Silge, Couch, Yan & Kuhn, 2022) and then read into the Shiny application, where it can make predictions on new data. As you probably have seen, this is what I have done, so you can interact with the final model in the application right at the start of this blog post.
I hope this has been interesting to you. In case of constructive feedback or if you want to exchange about this or a related topic, feel free to reach out. Thank you for reading.
A work by Mathias Steilen