Description of the Data
This blog post is about how to create your own data sets by scraping websites with rvest. Specifically, I will scrape all available used motorcycles from the online platform Anibis, explore the data, make conclusions and train a model on the prices.
Data Scraping
Firstly, let’s start by analysing the website. Searching for motorcycles yields an overview of 20 sub-page links (2 shown below). My plan is first getting links to all 24,000 of these sub-pages and store them for later use. This will result in a tibble with one link for each available motorcycle on the website.
Specifically, by inspecting the source code of the website, I found out the class I needed to get to the hyperlinks hidden behind the title of the respective motorcycle sub-page. Firstly, I constructed the links of the overview pages by varying the page number from 1 to 1200. Then, I mapped them onto a function including a wait, in order to prevent errors caused by Anibis’ servers timing out after too many request.
# Store main page URLs
<- 1:1200
pages
<- paste0("https://www.anibis.ch/de/c/motorrad-velo-motorraeder?pi=",
start_urls %>%
pages) as_tibble() %>%
rename(url = value)
# Fetch subpages from listings from each overview page
<- start_urls %>%
listings mutate(subpages = map(url, function(.x) {
return(
GET(.x, timeout(10)) %>%
read_html(.) %>%
html_nodes("[class ='sc-1yo7ctu-0 bRDNul']") %>%
html_attr('href') %>%
as_tibble() %>%
rename(subpage = value) %>%
mutate(subpage = paste0("https://www.anibis.ch", subpage))
)
}))
# Extract subpage urls and clean as tibble
<- listings %>%
subpage_urls select(subpages) %>%
unnest(subpages)
Now that I have scraped and unnested all links to all subpages, I will load the entire content of each subpage into memory using a for loop with a time out exception like previously. This part takes the longest time to run at a couple of hours.
<- vector(mode = "list", length = nrow(subpage_urls))
subpages_content
for (x in 1:nrow(subpage_urls)){
<- subpage_urls[x, "subpage"] %>%
url_tmp pull()
tryCatch(
<- url_tmp %>%
subpages_content[[x]] GET(., timeout(90)) %>%
read_html(.),
error = function(e){NA}
)
print(paste("Link", x, "retrieved"))
}
<- tibble(listing = subpages_content) subpage_content
Having fetched the entire HTML of each sublisting, I can now extract the interesting information by using the SelectorGadget to find the right CSS selectors and then mapping rvest functions onto the stored xml objects in the previously generated tibble, in order to get the information in character format.
<- tibble(
motorcycles_raw listing_no = 1:nrow(subpage_content),
listing = subpage_content %>% pull(),
header = map(listing, function(.x){
return(.x %>%
html_nodes(".fauvte") %>%
html_text())
}),content = map(listing, function(.x){
return(.x %>%
html_nodes(".goTXZq") %>%
html_text())
}),price = map(listing, function(.x){
return(.x %>%
html_node(".knSuBJ") %>%
html_text())
})%>%
) select(-listing) %>%
unnest(everything()) %>%
pivot_wider(values_from = content, names_from = header)
After some unnesting, I am left with the raw data:
Rows: 23,974
Columns: 17
$ listing_no <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
$ price <chr> "CHF 6'500.– / Verhandlungspreis", "CHF 5'500.– / V…
$ Art <chr> "Gegenstand", "Gegenstand", "Gegenstand", "Gegensta…
$ Zustand <chr> "Neuwertig", "Geringe Gebrauchsspuren", "Geringe Ge…
$ `Hubraum (Ccm)` <chr> "500 Cm3", "800 Cm3", "600 Cm3", "1200 Cm3", "50 Cm…
$ Baujahr <dbl> 2014, 2014, 2013, 1989, 2013, 2019, 2019, NA, 2017,…
$ Marke <chr> "YAMAHA", "YAMAHA", "YAMAHA", "YAMAHA", "BETA", "PO…
$ Modell <chr> "450 wrf", "Mt-07", NA, "FJ 1200", "50 RR", "RZR 10…
$ Kilometer <chr> "5'660 km", "41'000 km", "20'000 km", "7'156 km", "…
$ Getriebeart <chr> "Schaltgetriebe", NA, "Schaltgetriebe", "Schaltgetr…
$ Treibstoff <chr> "Benzin", "Benzin", "Benzin", "Benzin", "2-Takt-Gem…
$ Aussenfarbe <chr> "Blau", "Rot", "Schwarz", "Grau", "Rot", NA, "Orang…
$ `Art des Inserats` <chr> "Angebot", "Angebot", "Angebot", "Angebot", "Angebo…
$ `Letzte Änderung` <chr> "05.09.2022", "05.09.2022", "05.09.2022", "05.09.20…
$ Inseratnummer <dbl> 44818239, 44818276, 44818264, 44818210, 44818257, 4…
$ `Ab MFK` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ Antrieb <chr> NA, NA, NA, "Kette", NA, NA, NA, NA, "Kette", "Hint…
Data Cleaning
As most of the columns are not clean yet, I’ll continue by parsing numbers, trimming trailing spaces, interpreting missing values and renaming variables.
<- motorcycles_raw %>%
motorcycles mutate(across(c(price, Kilometer), ~ gsub("'", "", .x)),
across(c(price, Kilometer, `Hubraum (Ccm)`), ~ parse_number(.x)),
across(c(Marke, Modell, Getriebeart, Treibstoff, Aussenfarbe,
~ trimws(.x)),
Antrieb, Art), across(`Letzte Änderung`, ~ lubridate::dmy(.x)),
across(`Ab MFK`, ~ case_when(is.na(.x) ~ "no", .x == "" ~ "yes"))) %>%
rename(brand = Marke, model = Modell, mileage_km = Kilometer,
year = Baujahr, transmission = Getriebeart, fuel = Treibstoff,
colour = Aussenfarbe, last_edited = `Letzte Änderung`,
id = Inseratnummer, drive = Antrieb, mfk = `Ab MFK`,
price_chf = price, type = Art, condition = Zustand,
displacement = `Hubraum (Ccm)`) %>%
filter(`Art des Inserats` == "Angebot",
%in% c("Gegenstand", NA),
type < 1e6,
mileage_km < 1e6) %>%
price_chf select(- c(`Art des Inserats`, type))
With that, the clean data set is ready and I have all the available motorcycles on Anibis as of September 2022 loaded into memory, ready to be visualised and modelled.
glimpse(motorcycles)
Rows: 22,432
Columns: 15
$ listing_no <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 16, 17, 18, 20…
$ price_chf <dbl> 6500, 5500, 4000, 1500, 2400, 29000, 5000, 1100, 5300, 10…
$ condition <chr> "Neuwertig", "Geringe Gebrauchsspuren", "Geringe Gebrauch…
$ displacement <dbl> 500, 800, 600, 1200, 50, 1000, 125, 800, 350, 750, 750, 1…
$ year <dbl> 2014, 2014, 2013, 1989, 2013, 2019, 2019, NA, 2017, 2021,…
$ brand <chr> "YAMAHA", "YAMAHA", "YAMAHA", "YAMAHA", "BETA", "POLARIS"…
$ model <chr> "450 wrf", "Mt-07", NA, "FJ 1200", "50 RR", "RZR 1000 tur…
$ mileage_km <dbl> 5660, 41000, 20000, 7156, 14500, 2700, 3600, 60000, 8100,…
$ transmission <chr> "Schaltgetriebe", NA, "Schaltgetriebe", "Schaltgetriebe",…
$ fuel <chr> "Benzin", "Benzin", "Benzin", "Benzin", "2-Takt-Gemisch",…
$ colour <chr> "Blau", "Rot", "Schwarz", "Grau", "Rot", NA, "Orange", NA…
$ last_edited <date> 2022-09-05, 2022-09-05, 2022-09-05, 2022-09-05, 2022-09-…
$ id <dbl> 44818239, 44818276, 44818264, 44818210, 44818257, 4481803…
$ mfk <chr> "no", "no", "yes", "no", "no", "yes", "no", "no", "no", "…
$ drive <chr> NA, NA, NA, "Kette", NA, NA, NA, NA, "Kette", "Hinterrada…
Exploratory Data Analysis (EDA)
Before training a model, I will explore the clean data set and gauge relations between variables. Let’s first take a look at the frequency of categorical predictors. I lumped levels beyond \(N=15\) together, so that the y axis can be read properly.
%>%
motorcycles select(where(is.character)) %>%
select(-c(model)) %>%
pivot_longer(everything()) %>%
drop_na() %>%
group_by(name) %>%
mutate(value = fct_lump(value, n = 15)) %>%
count(value) %>%
mutate(value = reorder_within(value, n, name)) %>%
ggplot(aes(n, value)) +
geom_col(fill = "midnightblue", alpha = 0.8) +
facet_wrap(~ name, scales = "free", ncol = 4) +
labs(title = "Frequency Of Used Motorcycle Properties on anibis.ch",
subtitle = "Sample Size = 22422 | Data as of 09/22",
y = NULL,
x = "Count") +
scale_x_continuous(labels = scales::comma_format()) +
scale_y_reordered() +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic", size = 10,
colour = "grey50"),
panel.grid.major.y = element_blank(),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
Most common brands are Yamaha, Honda, Kawasaki, Harley Davidson and BMW. Most motorcycles are black, grey or silver - neutral colours. Most motorcycles on Anibis are in used condition. Virtually all motorcycles use petrol instead of diesel and are manual. There is quite a significant number of motorcycles without MFK.
Next up, let’s inspect the numerical variables:
%>%
motorcycles select(where(is.numeric)) %>%
select(-c(listing_no, id)) %>%
pivot_longer(everything()) %>%
drop_na() %>%
ggplot(aes(value)) +
stat_ecdf() +
facet_wrap(~ name, scales = "free") +
labs(title = "Cumulative Distribution Of Used Motorcycle Characteristics on anibis.ch",
subtitle = "Sample Size = 22422 | Data as of 09/22",
y = NULL,
x = NULL) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic", size = 10,
colour = "grey50"),
panel.grid.major.y = element_blank())
Displacement is surprisingly close to uniformly distributed, though very high numbers are a bit less frequent and medium-sized motorcycles are most common. The steep steps almost make it seem like a categorical variable. Mileage and price suffer from large outliers, which might be faulty data in the first case. The last could use a log-scale. Lastly, most motorcycles (about 90%) are not older than 20 years.
Some interesting plots
Before doing EDA and training a model, let’s first draw a few interesting charts, not related to any purpose except for fun.
%>%
motorcycles select(brand, displacement) %>%
drop_na() %>%
mutate(brand = fct_lump(brand, n = 15),
brand = fct_reorder(brand, displacement, .desc = TRUE)) %>%
ggplot(aes(displacement, brand)) +
geom_boxplot(outlier.colour = NA) +
labs(title = "Displacement Of Used Motorcycles on anibis.ch",
subtitle = "Sample Size: N = 5987 | as of 09/22 | excluding outliers",
x = NULL,
y = NULL) +
scale_x_continuous(labels = scales::comma_format(suffix = " ccm")) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic", size = 10,
colour = "grey50"))
Interestingly, less frequent brands lumped into “Other” sell smaller motorcycles. Likely, these brand target the budget range. Choppers have huge displacement, central European producers have large median displacement and Asian producers like Honda are a bit smaller.
%>%
motorcycles select(brand, price_chf) %>%
drop_na() %>%
mutate(brand = fct_lump(brand, n = 20),
brand = fct_reorder(brand, price_chf, .desc = TRUE)) %>%
ggplot(aes(price_chf, brand)) +
geom_boxplot(outlier.colour = NA) +
labs(title = "Used Motorcycle Prices on anibis.ch",
subtitle = "sample size: n = 21242 | as of 09/22 | excluding outliers",
x = NULL,
y = NULL) +
scale_x_continuous(labels = scales::comma_format(suffix = " CHF")) +
coord_cartesian(xlim = c(0, 40000)) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic", size = 10,
colour = "grey50"))
There are quite substantial price differences between motorcycle producers. Choppers are very expensive, while some lesser known budget options have medians below 5,000 CHF.
%>%
motorcycles select(year, price_chf, brand) %>%
drop_na() %>%
mutate(age = 2023 - year) %>%
filter(age < 20,
> 0,
age fct_lump(brand, n = 10) != "Other") %>%
group_by(brand, age) %>%
summarise(median_price = median(price_chf)) %>%
mutate(change = median_price/first(median_price)) %>%
select(age, brand, change) %>%
ggplot(aes(age, change, colour = brand)) +
geom_point() +
geom_line() +
geom_hline(yintercept = 0.5, lty = "dashed", colour = "grey50") +
facet_wrap(~ brand) +
expand_limits(y = 0) +
labs(title = "Used Motorcycle Price Change By Age on anibis.ch",
subtitle = "sample size: n = 21242 | as of 09/22 | Age 1 is 100%",
x = "Vehicle Age",
y = "Price (as %) compared to age = 1") +
scale_y_continuous(labels = percent_format()) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic", size = 10,
colour = "grey50"),
legend.position = "none")
This chart depicts the value of motorcycles at each age up to 20 years for a given brand. It is not possible to draw conclusions regarding depreciation of a defined model group, given the heterogeneity of the available models at each age class across brands. However, assuming flat or linearly increasing median prices across the offering for each brand, we can approximately gauge the stability of prices. The later the curves cross the 50% line, the higher price stability. Harley Davidson, Kawasaki, Yamaha and Honda have very stable prices. In contrast, BMW, Aprilia and Ducati seem to suffer greater price declines with age.
%>%
motorcycles select(condition, price_chf) %>%
drop_na() %>%
ggplot(aes(price_chf, condition %>% fct_reorder(price_chf))) +
geom_boxplot(outlier.colour = NA) +
labs(title = "Used Motorcycle Prices on anibis.ch",
subtitle = "sample size: n = 21242 | as of 09/22 | excluding outliers",
x = NULL,
y = NULL) +
scale_x_continuous(labels = scales::comma_format(suffix = " CHF")) +
coord_cartesian(xlim = c(0, 21000)) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic", size = 10,
colour = "grey50"))
There is not a lot to say about this one, except that new motorcycles cost more than used ones and worse condition implies lower prices, which is intuitive.
%>%
motorcycles add_count(colour) %>%
filter(n > 500) %>%
select(colour, price_chf, n) %>%
drop_na() %>%
mutate(colour = paste0(colour, " (n=", n, ")")) %>%
ggplot(aes(price_chf, colour %>% fct_reorder(price_chf))) +
geom_boxplot(outlier.colour = NA) +
labs(title = "Used Motorcycle Prices on anibis.ch",
subtitle = "sample size: n = 21242 | as of 09/22 | excluding outliers",
x = NULL,
y = NULL) +
scale_x_continuous(labels = scales::comma_format(suffix = " CHF")) +
coord_cartesian(xlim = c(0, 35000)) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic", size = 10,
colour = "grey50"))
Interestingly, there exists a relation between prices and colour of motorcycles.
%>%
motorcycles select(brand, colour) %>%
drop_na() %>%
group_by(brand = fct_lump(brand, n = 10)) %>%
count(colour) %>%
filter(n > 15) %>%
ggplot(aes(y = colour %>% reorder_within(by = n, within = brand),
x = n)) +
geom_col() +
facet_wrap(~ brand, scales = "free") +
labs(title = "Motorcycle Colours By Brand on anibis.ch",
subtitle = "sample size: n = 21242 | as of 09/22 | showing most frequent colours and brand",
x = "Count",
y = NULL) +
scale_y_reordered() +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic", size = 10,
colour = "grey50"))
Colours reveal the design identities of certain brands: Kawasaki Green, Ducati Red, Suzuki Blue and KTM Orange.
EDA
Let’s now inspect the relation of the predictors with the target variable. Intuitively, it makes sense that mileage, as well as age and displacement have the strongest relation with price. Additionally, as seen before, condition and brand also play an important role. Let’s just look at the first two and exclude outliers.
set.seed(12)
%>%
motorcycles sample_n(1000) %>%
mutate(age = 2022 - year) %>%
select(price_chf, mileage_km, displacement, age) %>%
drop_na() %>%
filter(mileage_km < 100000,
< 40) %>%
age pivot_longer(-price_chf) %>%
ggplot(aes(value, price_chf)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = F) +
facet_wrap(~ name, scales = "free") +
scale_y_continuous(labels = comma_format(suffix = " CHF")) +
theme_bw()
Clearly, the intuition holds. However, not all the variation is explained and the relationship is not as strict, as expected. Presumably, there is a lot of information in the brands as well as the model names. Let’s look at that. Extracting the model specifications from the titles by tokenisation and then removing punctuation and numbers, I calculate the median price for the 30 most frequent model names. What becomes apparent is the disparity in median prices between brand names. Clearly, tokenisation is a way with which the model will achieve greater accuracy.
%>%
motorcycles unnest_tokens(input = model, output = "tokens") %>%
select(tokens, price_chf) %>%
drop_na() %>%
mutate(tokens = tokens %>%
::removeNumbers() %>%
tm::removePunctuation()) %>%
tmfilter(tokens != "") %>%
mutate(tokens = fct_lump(tokens, n = 30)) %>%
group_by(tokens) %>%
summarize(median_price = median(price_chf),
n = n()) %>%
ungroup() %>%
ggplot(aes(median_price, tokens %>% fct_reorder(median_price))) +
geom_col(fill = "midnightblue") +
labs(title = "Motorcycle Model Names and Median Prices on anibis.ch",
subtitle = "sample size: n = 21242 | as of 09/22 | showing 30 most frequent tokens",
x = "Median Price",
y = NULL) +
scale_x_continuous(labels = comma_format(suffix = " CHF")) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(face = "italic", size = 10,
colour = "grey50"))
Fitting A Model
First, the data is split into training and testing sets. Also, three-fold cross validation is employed for reliable calculation of performance metrics, bearing in mind time efficiency.
<- motorcycles %>%
motorcycles mutate(across(where(is.character), as.factor))
<- initial_split(motorcycles)
dt_split
<- training(dt_split)
dt_train <- testing(dt_split)
dt_test
<- vfold_cv(dt_train, v = 5) folds
The recipe in the tidymodels framework makes it very straightforward to include all feature engineering in one step, preventing data leakage from the test set and uniformly applying the same steps to the holdout in the final fit.
<- recipe(price_chf ~ .,
model_rec data = dt_train) %>%
step_rm(listing_no, id, last_edited) %>%
step_impute_median(all_numeric_predictors()) %>%
step_impute_mode(all_nominal_predictors()) %>%
step_tokenize(model) %>%
step_stopwords(model) %>%
step_tokenfilter(model, max_tokens = 300) %>%
step_tf(model) %>%
step_novel(all_nominal_predictors()) %>%
step_other(all_nominal_predictors(), threshold = 0.01) %>%
step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
step_zv(all_predictors()) %>%
step_scale(all_numeric_predictors()) %>%
step_center(all_numeric_predictors())
Setting up the model specifications with tuning options for hyperparameters:
<-
gb_spec boost_tree(
trees = 1000,
tree_depth = tune(),
min_n = tune(),
loss_reduction = tune(),
sample_size = tune(),
mtry = tune(),
learn_rate = tune()
%>%
) set_engine("xgboost", importance = "impurity") %>%
set_mode("regression")
<- linear_reg(penalty = tune(),
en_spec mixture = tune()) %>%
set_engine("glmnet")
Specifying the workflows by adding the model specifications and the recipes:
<- workflow() %>%
gb_wflow add_recipe(model_rec) %>%
add_model(gb_spec)
<- workflow() %>%
en_wflow add_recipe(model_rec) %>%
add_model(en_spec)
Setting up the space-filling design tuning grids for hyperparameters of both models:
<-
gb_grid grid_latin_hypercube(
tree_depth(),
min_n(),
loss_reduction(),
sample_size = sample_prop(),
finalize(mtry(), dt_train),
learn_rate(),
size = 20
)
<-
en_grid grid_latin_hypercube(
penalty(),
mixture(),
size = 50
)
Tuning hyperparameters:
# Gradient Boosting
= Sys.time()
start_time
<- function() {
unregister_dopar <- foreach:::.foreachGlobals
env rm(list=ls(name=env), pos=env)
}
<- makePSOCKcluster(6)
cl registerDoParallel(cl)
<- tune_grid(object = gb_wflow,
gb_tune resamples = folds,
grid = gb_grid,
control = control_grid(save_pred = TRUE,
save_workflow = TRUE))
stopCluster(cl)
unregister_dopar()
= Sys.time()
end_time - start_time end_time
Time difference of 7.960418 mins
# Elastic Net
= Sys.time()
start_time
<- function() {
unregister_dopar <- foreach:::.foreachGlobals
env rm(list=ls(name=env), pos=env)
}
<- makePSOCKcluster(6)
cl registerDoParallel(cl)
<- tune_grid(object = en_wflow,
en_tune resamples = folds,
grid = en_grid,
control = control_grid(save_pred = TRUE,
save_workflow = TRUE))
stopCluster(cl)
unregister_dopar()
= Sys.time()
end_time - start_time end_time
Time difference of 1.449482 mins
Looking at the tuning results reveals that the model captures strong signal in the predictors, as the \(R^2\) is fairly high. Gradient boosting works better than the elastic net, in this case, though.
%>%
gb_tune show_best(metric = "rsq") %>%
transmute(model = "Gradient Boosting", .metric, mean, n, std_err)
# A tibble: 5 × 5
model .metric mean n std_err
<chr> <chr> <dbl> <int> <dbl>
1 Gradient Boosting rsq 0.586 5 0.0174
2 Gradient Boosting rsq 0.564 5 0.0173
3 Gradient Boosting rsq 0.559 5 0.0186
4 Gradient Boosting rsq 0.497 5 0.0156
5 Gradient Boosting rsq 0.485 5 0.0152
%>%
en_tune show_best(metric = "rsq") %>%
transmute(model = "Elastic Net", .metric, mean, n, std_err)
# A tibble: 5 × 5
model .metric mean n std_err
<chr> <chr> <dbl> <int> <dbl>
1 Elastic Net rsq 0.534 5 0.0227
2 Elastic Net rsq 0.534 5 0.0227
3 Elastic Net rsq 0.534 5 0.0227
4 Elastic Net rsq 0.534 5 0.0227
5 Elastic Net rsq 0.534 5 0.0227
<- gb_wflow %>%
gb_final_wflow finalize_workflow(select_best(gb_tune, metric = "rmse"))
<- gb_final_wflow %>%
gb_final_fit last_fit(dt_split)
%>%
gb_final_fit extract_workflow() %>%
extract_fit_parsnip() %>%
vi() %>%
slice_max(order_by = Importance, n = 20) %>%
ggplot(aes(Importance, reorder(Variable, Importance))) +
geom_col(fill = "midnightblue", colour = "white") +
labs(title = "Variable Importance",
subtitle = "Only the most important predictors are shown.",
y = "Predictor",
x = "Relative Variable Importance") +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"))
<- en_wflow %>%
en_final_wflow finalize_workflow(select_best(en_tune, metric = "rsq"))
<- en_final_wflow %>%
en_final_fit last_fit(dt_split)
%>%
en_final_fit extract_workflow() %>%
extract_fit_parsnip() %>%
vi() %>%
slice_max(order_by = Importance, n = 30) %>%
mutate(Importance = ifelse(Sign == "NEG", Importance * -1, Importance)) %>%
ggplot(aes(Importance, reorder(Variable, Importance),
fill = Sign)) +
geom_col(colour = "white") +
labs(title = "Variable Importance",
subtitle = "Only the most important predictors are shown.",
y = "Predictor",
x = "Coefficient") +
::scale_fill_jama() +
ggscitheme_bw() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"),
legend.position = "bottom")
From the variable importance, it becomes clear that
- brand names are important for prices;
- information on the model name is vital for prediction;
- newer models tend to be more expensive;
- high mileage decreases the price;
- displacement and price are correlating positively.
Evaluating Model Performance On The Holdout Data
The out-of-sample fit is fairly good. Naturally, prices being set by humans, there must remain error for comparable bikes. Notably, the model fails to predict prices on the far right side of the price distribution. Assuming that these motorcycle models are rare and potentially occur for the first time in the holdout set, naturally, the model doesn’t manage to be very precise with them.
I hope this post has been interesting to you. In case of constructive feedback or if you want to exchange about this or a related topic, feel free to reach out.
Thank you for reading.
A work by Mathias Steilen