Mathias Steilen - Using Classification Models and Gain Curves For Profit Maximisation

Abstract

This post shows how customer churn classification models can be set up, trained and used to create financial value within organisations, specifically using gain curves and classification thresholds.

If you are only interested in the application to a business setting, feel free to use the table of contents to skip to the last section.

What are we looking at?

The data comes from Kaggle, the biggest online platform for machine learning enthusiasts hosting datasets and competitions around data science. More precisely, the data set was part of season 1 episode 7 of SLICED, a data science competition streamed by Nick Wan and Meg Risdal on Twitch.

This dataset is about retail bank customer churn. The purpose of this post is to demonstrate how classification models can be set up, trained and used to create financial value within organisations, specifically using gain curves.

Data Cleaning

Firstly, I start by loading the data. The first file is the one to be used for training, whereas the holdout will only be used for submission of out-of-sample predictions, as it doesn’t contain the target variable. The training data is rather small, but still acceptable, containing information on 7,088 customers with 14 predictors and the binary response variable of their churn status.

names(data)

 [1] "id"                       "attrition_flag"          
 [3] "customer_age"             "gender"                  
 [5] "education_level"          "income_category"         
 [7] "total_relationship_count" "months_inactive_12_mon"  
 [9] "credit_limit"             "total_revolving_bal"     
[11] "total_amt_chng_q4_q1"     "total_trans_amt"         
[13] "total_trans_ct"           "total_ct_chng_q4_q1"     
[15] "avg_utilization_ratio"

It is nice to see that there is no missing data in the variables. Therefore, no imputation steps will have to be done in the model recipe.

colMeans(is.na(data)) %>% 
  tidy() %>% 
  rename(pct = x) %>% 
  mutate(names = fct_reorder(names, pct)) %>% 
  ggplot(aes(pct, names)) +
  geom_col(fill = "midnightblue") +
  labs(title = "Missing Data In Variables",
       subtitle = "Percent missingness calculated for each column",
       y = NULL,
       x = NULL) +
  scale_x_continuous(labels = scales::percent_format()) +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"))

Converting the necessary columns to factors:

data <- data %>% 
  mutate(across(c(where(is.character),attrition_flag), as.factor))

holdout <- holdout %>% 
  mutate(across(c(where(is.character)), as.factor))

Exploratory Data Analysis

The next step is to walk through the available predictors and understand relations to the target variable. Below, every variable is briefly looked at and presented, enabling a better understanding of the complete training data.

‘attrition_flag’: whether the customer is churned (0 = no; 1 = yes)

Firstly, the target variable must be inspected to judge class imbalance, which will potentially have to be rectified, as well as releveling, if the factor levels are the wrong way around.

data %>% 
  count(attrition_flag) %>% 
  mutate(pct = n/sum(n))

# A tibble: 2 × 3
  attrition_flag     n   pct
  <fct>          <int> <dbl>
1 0               5956 0.840
2 1               1132 0.160

There is a class imbalance in the target variable in favour of the negative case. In order to avoid bias in the final model, the classes will be rebalanced in the recipe with step_smote, which creates new cases for the minority class with nearest neighbours.

levels(data$attrition_flag)

[1] "0" "1"

As suspected, the negative case is case 1. Tidymodels treats the first factor level in binary classifications as the positive case, so these will have to be switched around.

data <- data %>% 
  mutate(attrition_flag = fct_rev(attrition_flag))

levels(data$attrition_flag)

[1] "1" "0"

Now we should be good to go.

‘id’: unique identifier for the customer

The ID column stores a unique integer for each customer, so there should not be predictive power in it that holds for new customers.

‘customer_age’: age of the customer

The age data looks almost normally distributed, with some peaks at both ends. Colouring by gender reveals that both groups are approximately distributed the same.

data %>% 
  ggplot(aes(customer_age, fill = gender)) +
  geom_histogram(binwidth = 1, position = "identity", alpha = 0.4) +
  labs(title = "Distribution Of Customer Age",
       subtitle = NULL,
       y = "Frequency",
       x = NULL) +
  scale_y_continuous(labels = scales::comma_format()) +
  ggsci::scale_fill_jama() +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"),
        legend.position = "right")

At this stage, I’ll make a function to summarise the probability of churning by group for later use.

summarise_churn <- function(tbl, category){
  tbl %>% 
    group_by({{ category }}) %>% 
    summarise(n = n(),
              prob_churning = mean(attrition_flag == 1)) %>% 
    arrange(-prob_churning)
}

data %>% 
  group_by(customer_age = round(customer_age/5)*5) %>% 
  summarise_churn(customer_age) %>% 
  filter(n > 30) %>% 
  ggplot(aes(customer_age, prob_churning, size = n)) +
  geom_point(colour = "midnightblue") + 
  labs(title = "Relation: Customer Age And Probability Of Churning",
       subtitle = NULL,
       y = "Probability of Churning",
       x = "Customer Age") +
  scale_y_continuous(labels = scales::percent_format()) +
  ggsci::scale_fill_jama() +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"),
        legend.position = "right")

Rounding customer ages to the nearest five and plotting it against the observed probability of churning reveals a non-linear pattern, which shows an increasing trend towards the age of 55, which reverses beyond that threshold.

‘gender’: gender of the customer

Gender groups are distributed evenly.

data %>% 
  group_by(gender) %>% 
  summarise(n = n(),
            prob_churning = mean(attrition_flag == 1))

# A tibble: 2 × 3
  gender     n prob_churning
  <fct>  <int>         <dbl>
1 F       3714         0.174
2 M       3374         0.144

The observed data shows women churning at higher rates than men, though the absolute difference is rather small.

‘education_level’: education level of the customer

data %>% 
  summarise_churn(education_level)

# A tibble: 7 × 3
  education_level     n prob_churning
  <fct>           <int>         <dbl>
1 Doctorate         326         0.212
2 Post-Graduate     343         0.195
3 Unknown          1083         0.165
4 Uneducated       1038         0.160
5 Graduate         2212         0.156
6 High School      1378         0.154
7 College           708         0.134

Though there is a class imbalance, the churn rates for people with higher degrees, that is PhDs or Master’s, are higher.

‘income_category’: income range of the customer

data %>% 
  summarise_churn(income_category)

# A tibble: 6 × 3
  income_category     n prob_churning
  <fct>           <int>         <dbl>
1 $120K +           521         0.177
2 Unknown           772         0.174
3 Less than $40K   2467         0.171
4 $80K - $120K     1081         0.155
5 $40K - $60K      1223         0.154
6 $60K - $80K      1024         0.126

The churn probabilities of the income range does not show a linear trend. High earners and low earners are the most likely to churn whereas the midfield is lowest.

‘total_relationship_count’: number of relationships

I am not entirely sure what “relationship” means here because there is no further documentation on it. Assuming the variable describes dependents and plotting it against the target variable probabilities for each bucket, it appears like there is a negative correlation. Hence, a higher number of dependents is associated with a lower probability of changing banks.

data %>% 
  ggplot(aes(total_relationship_count)) +
  geom_histogram(binwidth = 1, position = "identity", alpha = 0.6,
                 fill = "midnightblue", colour = "white") +
  labs(title = "Distribution Of Relationship Count",
       subtitle = NULL,
       y = "Frequency",
       x = NULL) +
  scale_y_continuous(labels = scales::comma_format()) +
  ggsci::scale_fill_jama() +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"),
        legend.position = "right")

data %>% 
  summarise_churn(total_relationship_count) %>% 
  ggplot(aes(y = prob_churning, x = total_relationship_count, size = n)) +
  geom_point(colour = "midnightblue") +
  expand_limits(x = 0, y = 0) +
  labs(title = "Correlation Of Relationship Count And Churn Probability",
       subtitle = NULL,
       y = "Churn Probability",
       x = "Total Relationships") +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_size_continuous(labels = scales::comma_format()) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"),
        legend.position = "right")

‘months_inactive_12_mon’: number of months the customer is inactive in the past 12 months

Interestingly, customers inactivity is positively correlated with churning up until 4 months, where the trend reverses. This has to be taken with a grain of salt, as the number of observations below 1 month and above 4 months is sparse, so it might well be noise. We’ll see, how important this variable turns out to be in classifying customers correctly.

data %>% 
  summarise_churn(months_inactive_12_mon) %>% 
  ggplot(aes(y = prob_churning, x = months_inactive_12_mon, size = n)) +
  geom_point(colour = "midnightblue") +
  expand_limits(x = 0) +
  labs(title = "Correlation Of Months Inactive And Churn Probability",
       subtitle = NULL,
       y = "Churn Rate") +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_size_continuous(labels = scales::comma_format()) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"),
        legend.position = "right")

‘credit_limit’: the customer’s credit limit

It looks like low credit limits lead to higher churn probabilities than higher credit limits, though the relation does not look linear.

data %>% 
  mutate(credit_limit = round(credit_limit/500)*500) %>% 
  summarise_churn(credit_limit) %>% 
  ggplot(aes(credit_limit, prob_churning)) +
  geom_point(aes(size = n), colour = "midnightblue") +
  geom_smooth(se = F, method = "loess") +
  labs(title = "Correlation Of Credit Limit And Churn Probability",
       subtitle = "X variable was rounded to the nearest USD 500",
       y = "Churn Rate") +
  scale_x_continuous(labels = scales::comma_format()) +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_size_continuous(labels = scales::comma_format()) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"),
        legend.position = "right")

‘total_revolving_bal’: the customer’s total revolving balance

Low total revolving balance seems to correlate highly with churning, as well as the highest percentiles. Seen the other way around, the area excluding customers on the very far tail ends of the revolving balances exhibits very low levels of churning, which will most likely lead to the variable being useful in the models.

data %>% 
  group_by(total_revolving_bal = round(total_revolving_bal/100)*100) %>% 
  summarise_churn(total_revolving_bal) %>% 
  ggplot(aes(y = prob_churning, x = total_revolving_bal, size = n)) +
  geom_point(colour = "midnightblue") +
  expand_limits(x = 0) +
  labs(title = "Correlation Of Total Revolving Balance And Churn Probability",
       subtitle = "Total revolving balances were rounded to the nearest 100",
       y = "Churn Rate") +
  scale_x_continuous(labels = scales::dollar_format()) +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_size_continuous(labels = scales::comma_format()) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"),
        legend.position = "right")

‘total_amt_chng_q4_q1’: the amount the balance changed from Q4 to Q1

Negative changes of customer balances, that is values below 100%, are strongly associated with churning. The change is very likely a byproduct of the changing of banks process, where customers shift their assets to another bank before closing the accounts. Therefore, strong declines in the balances are often a strong indicator of the intention to leave the bank. Vice-versa, for clients with strong increases of their bank balance, e.g. 1.5 to 3 fold, the churning rates are very low. This variable will also be of high importance for the algorithms.

data %>% 
  mutate(total_amt_chng_q4_q1 = round(total_amt_chng_q4_q1/0.1)*0.1) %>% 
  summarise_churn(total_amt_chng_q4_q1) %>% 
  ggplot(aes(total_amt_chng_q4_q1, prob_churning)) +
  geom_point(aes(size = n), colour = "midnightblue") +
  geom_smooth(se = F, method = "loess") +
  labs(title = "Correlation Of Balance Change And Churn Probability",
       subtitle = "X variable was rounded to the nearest 10%.",
       y = "Churn Rate") +
  scale_x_continuous(labels = scales::percent_format()) +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_size_continuous(labels = scales::comma_format()) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"),
        legend.position = "right")

‘total_trans_amt’: the value of all the customer’s transactions in the period

The shape of this relation with the target variable looks very peculiar. Up until USD 7,500, the data is a mess and no clear direction is discernible. Then, leading up to USD 10,000, the churning rates are unbelievably high and immediately plummet to virtually zero beyond that threshold. I am unsure whether this has to with my own ignorance about a phenomenon in the banking world that leads to this pattern, but I would almost suggest a closer look at the source of this variable, in order to exclude data errors. In any case, it will be included, but I am tempted to refrain from using it for predictions.

data %>% 
  mutate(total_trans_amt = round(total_trans_amt/500)*500) %>% 
  summarise_churn(total_trans_amt) %>% 
  ggplot(aes(total_trans_amt, prob_churning)) +
  geom_point(aes(size = n), colour = "midnightblue") +
  geom_smooth(se = F, method = "loess") +
  labs(title = "Correlation Of Total Transaction Amounts And Churn Probability",
       subtitle = "X variable was rounded to the nearest USD 500.",
       y = "Churn Rate") +
  scale_x_continuous(labels = scales::dollar_format()) +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_size_continuous(labels = scales::comma_format()) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"),
        legend.position = "right")

‘total_trans_ct’: the number of all of the customer’s transactions

Now this variable shows a less confusing pattern. Customers below about 60 total transactions exhibit very high attrition rates as opposed to other customers. They use their cards/accounts a little, but not very often. From the transaction amount and the transaction count, it will be possible to generate an average transaction size, which will potentially reveal some additional insights.

data %>% 
  mutate(total_trans_ct = round(total_trans_ct/5)*5) %>% 
  summarise_churn(total_trans_ct) %>% 
  ggplot(aes(total_trans_ct, prob_churning)) +
  geom_point(aes(size = n), colour = "midnightblue") +
  labs(title = "Correlation Of Total Transaction Count And Churn Probability",
       subtitle = "X variable was rounded to the nearest 5.",
       y = "Churn Rate") +
  scale_x_continuous(labels = scales::comma_format()) +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_size_continuous(labels = scales::comma_format()) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"),
        legend.position = "right")

‘total_ct_chng_q4_q1’: the difference in number of the customer’s transactions from Q4 to Q1

Customers with strongly decreasing total transaction count are most likely to churn. This variable shows a similar patterns as the change in account balances, which is an indication of shifting the use to a competitor’s offering.

data %>% 
  mutate(total_ct_chng_q4_q1 = round(total_ct_chng_q4_q1/0.2)*0.2) %>% 
  summarise_churn(total_ct_chng_q4_q1) %>% 
  ggplot(aes(total_ct_chng_q4_q1, prob_churning)) +
  geom_point(aes(size = n), colour = "midnightblue") +
  labs(title = "Correlation Of Total Transaction Count And Churn Probability",
       subtitle = "X variable was rounded to the nearest 20%.",
       y = "Churn Rate") +
  scale_x_continuous(labels = scales::percent_format()) +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_size_continuous(labels = scales::comma_format()) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"),
        legend.position = "right")

‘avg_utilization_ratio’: the customer’s average utilization ratio during the period

Customers with virtually no utilisation are likely to change banks, which also holds for customers with very high utilisation. Everything in between is moderate.

data %>% 
  mutate(avg_utilization_ratio = round(avg_utilization_ratio/0.1)*0.1) %>% 
  summarise_churn(avg_utilization_ratio) %>% 
  ggplot(aes(avg_utilization_ratio, prob_churning)) +
  geom_point(aes(size = n), colour = "midnightblue") +
  labs(title = "Correlation Of Total Transaction Count And Churn Probability",
       subtitle = "X variable was rounded to the nearest 20%.",
       y = "Churn Rate") +
  scale_x_continuous(labels = scales::percent_format()) +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_size_continuous(labels = scales::comma_format()) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"),
        legend.position = "right")

avg_transaction_size: The average size of each transaction over the observed period

Using total transaction count, this variable can be created. From the looks of it, there is a non-linear relation with the target variable, but it remains to be seen whether the variable will actually be useful in the model.

data %>% 
  mutate(avg_transaction_size = total_trans_amt/total_trans_ct) %>% 
  mutate(avg_transaction_size = round(avg_transaction_size/5)*5) %>% 
  summarise_churn(avg_transaction_size) %>% 
  ggplot(aes(avg_transaction_size, prob_churning)) +
  geom_point(aes(size = n), colour = "midnightblue") +
  labs(title = "Correlation Of Total Transaction Count And Churn Probability",
       subtitle = "X variable was rounded to the nearest 20%.",
       y = "Churn Rate") +
  scale_x_continuous(labels = scales::dollar_format()) +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_size_continuous(labels = scales::comma_format()) +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"),
        legend.position = "right")

Having investigated the data set in closer detail and having learnt a bit about the relation of the variables with the target, the models can now be specified and accordingly trained.

Building And Training The Models

First, the data is split into training and testing sets. Also, three-fold cross validation is employed for reliable calculation of performance metrics, bearing in mind time efficiency.

dt_split <- data %>% 
  initial_split(strata = attrition_flag)

dt_train <- training(dt_split)
dt_test <- testing(dt_split)

folds <- vfold_cv(dt_train, v = 3, strata = attrition_flag)

The recipe in the tidymodels framework makes it very straightforward to include all feature engineering in one step, preventing data leakage from the test set and uniformly applying the same steps to the holdout in the final fit. As mentioned in the EDA, the class imbalance will be dealt with using step_smote from the themis package. It samples new observations for the minority class using nearest neighbours and is very simple and fast to put to use.

gb_rec <- recipe(attrition_flag ~ .,
                 data = dt_train) %>%
  step_rm(id) %>%
  step_mutate(avg_transaction_size = total_trans_amt/total_trans_ct) %>% 
  step_novel(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors(), one_hot = TRUE) %>% 
  step_zv(all_numeric_predictors()) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  step_smote(attrition_flag, skip = TRUE)

knn_rec <- recipe(attrition_flag ~ .,
                  data = dt_train) %>%
  step_rm(id) %>%
  step_mutate(avg_transaction_size = total_trans_amt/total_trans_ct) %>% 
  step_novel(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors(), one_hot = TRUE) %>% 
  step_zv(all_numeric_predictors()) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  step_smote(attrition_flag, skip = TRUE)

log_rec <- recipe(attrition_flag ~ .,
                  data = dt_train) %>%
  step_rm(id) %>%
  step_mutate(avg_transaction_size = total_trans_amt/total_trans_ct) %>% 
  step_novel(all_nominal_predictors()) %>%
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_numeric_predictors()) %>% 
  step_normalize(all_numeric_predictors()) %>% 
  step_smote(attrition_flag, skip = TRUE)

Setting up the model specifications with tuning options for hyperparameters:

gb_spec <- 
  boost_tree(
    trees = 1000,
    tree_depth = tune(),
    min_n = tune(),
    loss_reduction = tune(),
    sample_size = tune(),
    mtry = tune(),
    learn_rate = tune()
  ) %>%
  set_engine("xgboost", importance = "impurity") %>%
  set_mode("classification")

knn_spec <- nearest_neighbor(neighbors = tune()) %>% 
  set_engine("kknn") %>% 
  set_mode("classification")

log_spec <- logistic_reg(penalty = tune(), mixture = tune()) %>% 
  set_engine("glmnet") %>% 
  set_mode("classification")

In the model specification, you can specify the variable importance, which is calculated based on impurity in this case. Proceeding with setting up the workflow:

gb_wflow <- 
  workflow() %>% 
  add_recipe(gb_rec) %>% 
  add_model(gb_spec)

knn_wflow <-
  workflow() %>% 
  add_recipe(knn_rec) %>% 
  add_model(knn_spec)

log_wflow <- workflow() %>% 
  add_recipe(
    log_rec,
    blueprint = hardhat::default_recipe_blueprint(allow_novel_levels = TRUE)
  ) %>% 
  add_model(log_spec)

Next up is setting up a space-filling design for time-efficient hyperparameter tuning. The latter is not required for KNN, as there is only hyperparameter to be tuned in this case.

gb_grid <- 
  grid_latin_hypercube(
    tree_depth(),
    min_n(),
    loss_reduction(),
    sample_size = sample_prop(),
    finalize(mtry(), dt_train),
    learn_rate(),
    size = 50
  )

knn_grid <- 
  grid_regular(neighbors(), levels = 10)

Now, the hyperparameters can be trained with parallel computing in order to utilise more available computing power.

# Gradient Boosting
start_time = Sys.time()

unregister_dopar <- function() {
  env <- foreach:::.foreachGlobals
  rm(list=ls(name=env), pos=env)
}

cl <- makePSOCKcluster(6)
registerDoParallel(cl)

gb_tune <- tune_grid(object = gb_wflow,
                     resamples = folds,
                     grid = gb_grid,
                     control = control_grid(save_pred = TRUE,
                                            save_workflow = TRUE),
                     metrics = metric_set(roc_auc, accuracy, sensitivity, 
                                          specificity, mn_log_loss))

stopCluster(cl)
unregister_dopar()

end_time = Sys.time()
end_time - start_time

# KNN
start_time = Sys.time()

unregister_dopar <- function() {
  env <- foreach:::.foreachGlobals
  rm(list=ls(name=env), pos=env)
}

cl <- makePSOCKcluster(6)
registerDoParallel(cl)

knn_tune <- tune_grid(object = knn_wflow,
                      resamples = folds,
                      grid = knn_grid,
                      control = control_grid(save_pred = TRUE,
                                             save_workflow = TRUE),
                      metrics = metric_set(roc_auc, accuracy, sensitivity, 
                                           specificity, mn_log_loss))

stopCluster(cl)
unregister_dopar()

end_time = Sys.time()
end_time - start_time

# Logistic Regression
start_time = Sys.time()

unregister_dopar <- function() {
  env <- foreach:::.foreachGlobals
  rm(list=ls(name=env), pos=env)
}

cl <- makePSOCKcluster(6)
registerDoParallel(cl)

log_tune <- tune_grid(object = log_wflow,
                      resamples = folds,
                      grid = grid_latin_hypercube(mixture(), penalty(), 
                                                  size = 50),
                      control = control_grid(save_pred = TRUE,
                                             save_workflow = TRUE),
                      metrics = metric_set(roc_auc, accuracy, sensitivity, 
                                           specificity, mn_log_loss))

stopCluster(cl)
unregister_dopar()

end_time = Sys.time()
end_time - start_time

gb_tune %>% 
  show_best(metric = "sensitivity") %>%
  transmute(model = "XGBoost", .metric, mean, n, std_err) %>% 
  head(3)

# A tibble: 3 × 5
  model   .metric      mean     n std_err
  <chr>   <chr>       <dbl> <int>   <dbl>
1 XGBoost sensitivity 1         3 0      
2 XGBoost sensitivity 0.986     3 0.00204
3 XGBoost sensitivity 0.953     3 0.00623

gb_tune %>% 
  show_best(metric = "specificity") %>%
  transmute(model = "XGBoost", .metric, mean, n, std_err) %>% 
  head(3)

# A tibble: 3 × 5
  model   .metric      mean     n std_err
  <chr>   <chr>       <dbl> <int>   <dbl>
1 XGBoost specificity 0.976     3 0.00404
2 XGBoost specificity 0.964     3 0.00437
3 XGBoost specificity 0.955     3 0.00331

knn_tune %>% 
  show_best(metric = "sensitivity") %>%
  transmute(model = "KNN", .metric, mean, n, std_err) %>% 
  head(3)

# A tibble: 3 × 5
  model .metric      mean     n std_err
  <chr> <chr>       <dbl> <int>   <dbl>
1 KNN   sensitivity 0.718     3  0.0337
2 KNN   sensitivity 0.715     3  0.0371
3 KNN   sensitivity 0.711     3  0.0356

knn_tune %>% 
  show_best(metric = "specificity") %>%
  transmute(model = "KNN", .metric, mean, n, std_err) %>% 
  head(3)

# A tibble: 3 × 5
  model .metric      mean     n std_err
  <chr> <chr>       <dbl> <int>   <dbl>
1 KNN   specificity 0.910     3 0.00641
2 KNN   specificity 0.910     3 0.00641
3 KNN   specificity 0.910     3 0.00641

The initial training results on the training data are promising. The best XGBoost models showed near perfect sensitivity and specificity. KNN was able to get very good specificity, however it lacked on the sensitivity front and falsely identified some churned customers as non-churned. Overall, the XGBoost model stood out as the better model, individually.

bind_rows(
  gb_tune %>% 
    collect_metrics() %>% 
    filter(!.metric %in% c("mn_log_loss")) %>% 
    mutate(model = "XGBoost"),
  knn_tune %>% 
    collect_metrics() %>% 
    filter(!.metric %in% c("mn_log_loss")) %>% 
    mutate(model = "KNN"),
  log_tune %>% 
    collect_metrics() %>% 
    filter(!.metric %in% c("mn_log_loss")) %>% 
    mutate(model = "Logistic Regression")
) %>% 
  ggplot(aes(.metric, mean, colour = .config)) +
  geom_jitter(show.legend = F, width = 0.2, alpha = 0.4) +
  facet_wrap(~ model) +
  labs(title = "Hyperparameter Tuning Results",
       subtitle = "Colours indicate different parameter combinations",
       x = NULL,
       y = NULL) +
  scale_y_continuous(labels = scales::percent_format(),
                     limits = c(0,1)) +
  theme_light() +
  theme(plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"))

Let’s take a look at the variable importance within the XGBoost model. Unfortunately, KNN does not provide a variable importance metric.

gb_final_fit <- gb_wflow %>%
  finalize_workflow(select_best(gb_tune, metric = "roc_auc")) %>% 
  last_fit(dt_split)

gb_final_fit %>%
  pluck(".workflow", 1) %>% 
  extract_fit_parsnip() %>% 
  vi() %>%
  slice_max(order_by = Importance, n = 100) %>% 
  ggplot(aes(Importance, reorder(Variable, Importance))) +
  geom_col(fill = "midnightblue", colour = "white") +
  labs(title = "Variable Importance",
       subtitle = NULL,
       y = "Predictor",
       x = "Relative Variable Importance") +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"))

log_final_fit <- log_wflow %>% 
  finalize_workflow(select_best(log_tune, metric = "roc_auc")) %>% 
  last_fit(dt_split)

log_final_fit %>%
  extract_workflow() %>% 
  extract_fit_parsnip() %>%
  vi() %>% 
  slice_max(order_by = Importance, n = 30) %>% 
  mutate(Importance = ifelse(Sign == "NEG", Importance * -1, Importance)) %>% 
  ggplot(aes(Importance, reorder(Variable, Importance),
             fill = Sign)) +
  geom_col(colour = "white") +
  labs(title = "Variable Importance",
       subtitle = "Only the most important predictors are shown.",
       y = "Predictor",
       x = "Coefficient") +
  ggsci::scale_fill_jama() +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 12),
        plot.subtitle = element_text(face = "italic", colour = "grey50"),
        legend.position = "bottom")

It now becomes clear that the numeric variables were more important than the nominal ones, like income bracket or education status. The total transaction amount, count as well as average transaction size were highly important. Equally, the change in the latter constitutes an important metric as well. Surprisingly, the relationship count is highly important. From the EDA, we know that people with more dependents are less likely to churn, even though it wasn’t clear that the variable was going to be as important at this stage. Another surprise is that customer age is equally important as average utilisation ratio. Then, lastly, the nominal predictors come in last with virtually no importance.

With both these individual tuning results, a blended (“stacked”) model could easily be built with the stacks package. In this blog post, the focus will be the business/real life application however, therefore this won’t be implemented this time. If you are interested in seeing how the implementation works in R, feel free to check out the Airbnb price prediction blog post.

Evaluating Model Performance On The Training Data

Using the fitted models to predict and evaluate on the test set, which was held out from the model tuning and training to prevent data leakage:

multi_metric <- metric_set(accuracy, sensitivity, specificity)

bind_rows(
  gb_final_fit %>% 
    extract_workflow() %>% 
    predict(dt_test) %>% 
    bind_cols(dt_test %>% select(attrition_flag)) %>% 
    multi_metric(estimate = .pred_class, truth = attrition_flag),
  gb_final_fit %>% 
    extract_workflow() %>% 
    predict(dt_test) %>% 
    bind_cols(dt_test %>% select(attrition_flag)) %>% 
    mutate(.pred_class = .pred_class %>% as.character %>% as.numeric) %>% 
    roc_auc(attrition_flag, .pred_class)
)

# A tibble: 4 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 accuracy    binary         0.960
2 sensitivity binary         0.859
3 specificity binary         0.980
4 roc_auc     binary         0.919

As we have a class imbalance in the out-of-sample data, the accuracy should not be considered a good metric to gauge model performance, as a model predicting the negative class only would lead to very high accuracy, but terrible sensitivity and ROC AUC.

Let’s make this an example:

dt_test %>% 
  transmute(attrition_flag = factor(attrition_flag, levels = c(0,1)), 
            all_negative = factor(0, levels = c(0,1))) %>%
  multi_metric(truth = attrition_flag, 
               estimate = all_negative)

# A tibble: 3 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 accuracy    binary         0.840
2 sensitivity binary         1    
3 specificity binary         0

bind_rows(
  dt_test %>% 
    transmute(attrition_flag = factor(attrition_flag, levels = c(0,1)), 
              all_negative = factor(0, levels = c(0,1))) %>%
    multi_metric(truth = attrition_flag, 
                 estimate = all_negative),
  dt_test %>% 
    transmute(attrition_flag = factor(attrition_flag, levels = c(0,1)), 
              all_negative = 0) %>%
    roc_auc(attrition_flag, all_negative) 
)

# A tibble: 4 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 accuracy    binary         0.840
2 sensitivity binary         1    
3 specificity binary         0    
4 roc_auc     binary         0.5

As mentioned before, this “model” only predicts “does not churn” for all clients. The specificity is great, because all clients that did not churn, were predicted to not churn. The sensitivity, however, is terrible, because not a single churned client was predicted correctly. The ROC AUC curve lies on the identity line, thus giving the value 0.5, which indicates that the model is not better than random guessing.

The confusion matrix of the better model can now be visualised, like in the chart below. Especially for a binary classification, this chart is easy to interpret and tells us a lot about what is going on. The model predicts both classes very accurately, but misses a higher fraction of the positive class. The minority class being quite rare and the training data set not being extremely large, this is not surprising. In general, the model performance can be considered very good.

gb_final_fit %>% 
  extract_workflow() %>% 
  predict(dt_test) %>% 
  bind_cols(dt_test %>% select(attrition_flag)) %>% 
  conf_mat(truth = attrition_flag, estimate = .pred_class) %>% 
  autoplot() +
  labs(title = "Confusion Matrix Of The OOS Prediction") +
  theme(plot.title = element_text(face = "bold", size = 12))

Maximising Business Profit by Using Classification Thresholds and Gain Curves

From the EDA as well as variable importance, we have learned about the most important variables to predict bank customer churn. Transaction amounts and frequency, as well as personal traits, such as relationships and age are of high importance.

Potential use in a real life setting

The real question following the above is: How can we use any of the above to create business value? That’s were the gain curve comes in.

Classification models work by assigning probabilities of each class to any individual observation. For instance, one specific customer might be categorised with 75% probability of churning and 25% probability of not churning. In that case, the model would predict the customer to churn, as the probability is higher than 50%. This can be seen below. The XGBoost model predicted around 70% probability of the customer churning and around 30% of them not churning based on the available variables.

gb_final_fit %>% 
  extract_workflow() %>% 
  augment(dt_test) %>% 
  filter(round(.pred_1, 1) == 0.7) %>% 
  head(1) %>% 
  glimpse()

Rows: 1
Columns: 18
$ id                       <dbl> 9718
$ attrition_flag           <fct> 0
$ customer_age             <dbl> 48
$ gender                   <fct> F
$ education_level          <fct> High School
$ income_category          <fct> $40K - $60K
$ total_relationship_count <dbl> 5
$ months_inactive_12_mon   <dbl> 3
$ credit_limit             <dbl> 3871
$ total_revolving_bal      <dbl> 2055
$ total_amt_chng_q4_q1     <dbl> 0.824
$ total_trans_amt          <dbl> 5480
$ total_trans_ct           <dbl> 58
$ total_ct_chng_q4_q1      <dbl> 0.933
$ avg_utilization_ratio    <dbl> 0.531
$ .pred_class              <fct> 1
$ .pred_1                  <dbl> 0.6876283
$ .pred_0                  <dbl> 0.3123717

In a business setting, a logical consequence for a model as reliable as this one predicting a customer to churn would be to target said customer with a retention programme, for instance via selected benefits only attributed to customers at risk of churning (e.g. coupons, discounts etc.). However, it would likely not be economically viable to target all customers that have a greater than 50% chance of churning, as it would most likely be a waste of resources. If a customer has only a 50.00001% chance of churning, according to the model, and assuming that the model is right, then the customer should not deserve a discount equal to the one given to a customer with >90% probability of leaving. After all, there is an around 50% chance that the first might not intend to leave in the first place. Then the discount would be wasted.

The business only wants to give costly retention programmes to customers that are at a high risk of leaving, not to the ones who were more likely going to stay anyway.

Therefore, businesses must find a threshold: Where do you set the minimum probability proposed by the model to classify a customer as at risk of churning? There exists an inherent trade-off in wanting to prevent customers from leaving the business, and not wanting to accumulate costs giving out retention programmes to many clients, who were not at high risk of leaving. As an example, the bank might decide that it targets the top 25% of customers with highest probability. This can be visualised with a gain curve, which comes with the tidymodels package in R:

gb_final_fit %>% 
  collect_predictions() %>%
  gain_curve(truth = attrition_flag, .pred_1) %>% 
  autoplot() +
  geom_vline(xintercept = 25, lty = "dashed", colour = "grey50")

The curve can be read the following way: If targeting the x customers with highest probability of leaving, how many % y do we get right of the customers who will actually leave? In the case of targeting the 25% of customers with highest modelled probability of leaving, we would get close to but not 100% of the customers with an intention of leaving right. The curve being very close to the upper edge of the grey area indicates that the underlying model works very well.

The gain curve is really useful and quick to make in R, however it only says “target the 25% of customers with highest probabilities”. This being dependent on the customers that predictions are being made on, I wanted to create a function that says “target only customers that have a probability x of leaving or higher”. This can be seen here (Code for the function call on the chart can be inspected by clicking the “Code” button to the right below):

case_counts <- function(probs){
  
  gb_final_fit %>% 
    collect_predictions() %>% 
    mutate(
      .pred_thr = ifelse(.pred_1 > probs, 1, 0),
      case = case_when(
        .pred_thr == 1 & attrition_flag == 1 ~ "TP",
        .pred_thr == 1 & attrition_flag == 0 ~ "FP",
        .pred_thr == 0 & attrition_flag == 1 ~ "FN",
        .pred_thr == 0 & attrition_flag == 0 ~ "TN"
      )) %>% 
    count(case)  
  
}

threshold_curve <- function(probs){
  
  probs %>% 
    as_tibble() %>% 
    rename(threshold = value) %>%
    mutate(counts = map(threshold, case_counts)) %>%
    unnest(counts) %>%
    pivot_wider(values_from = n, names_from = case) %>% 
    mutate(across(everything(), ~ replace(., is.na(.), 0)),
           sensitivity = TP/(TP + FN),
           specificity = TN/(TN + FP))
  
}

threshold_curve(seq(0, 1, 0.01)) %>% 
  ggplot(aes(threshold, sensitivity)) +
  geom_line() +
  geom_segment(aes(y = 1, x = 0, yend = 0, xend = 1),
               size = 0.25, colour = "grey50", lty = "dashed") +
  geom_ribbon(aes(ymin = 1-threshold, ymax = sensitivity), alpha = 0.1) +
  labs(
    title = "What % of churned customers is correctly targeted?",
    subtitle = "Threshold: Modelled Probability > x to classify positively",
    y = "sensitivity (true positive)"
  ) +
  scale_x_continuous(labels = scales::percent_format(),
                     breaks = seq(0,1,0.1)) +
  scale_y_continuous(labels = scales::percent_format(),
                     breaks = seq(0,1,0.1)) +
  theme_light() +
  theme(plot.title = element_text(size = 14, face = "bold"),
        plot.subtitle = element_text(face = "italic", size = 12,
                                     colour = "grey50"))

This curve now demonstrates the real trade off of setting the classification threshold better. At 50%, the default threshold for the model to classify clients above 50% as will churn and below 50% as will not churn has a higher sensitivity. However, at the same time, we are classifying more clients on an absolute level as will churn. Therefore, there are likely also more clients in our predicted will churn class, that are not actually going to leave us. Remember, we didn’t want to spend additional money on them, as they are not going to leave.

Therefore, while we, as the business, want to maximise the number of churning customers we target with retention programmes, we also want to minimise the non-churning customers we wrongly give the coupons/discounts.

This can be seen below: With an increasing threshold, i.e. the “stricter” we make our model, the fewer % of actually non-churning customers we target with coupons that were designed for the customers at risk of churning.

threshold_curve(seq(0, 1, 0.01)) %>% 
  ggplot(aes(threshold, 1-specificity)) +
  geom_line() +
  geom_segment(aes(y = 1, x = 0, yend = 0, xend = 1),
               size = 0.25, colour = "grey50", lty = "dashed") +
  geom_ribbon(aes(ymin = 1-threshold, ymax = 1-specificity), alpha = 0.1) +
  labs(
    title = "What % of non-churning customers is wrongly targeted?",
    subtitle = "Threshold: Modelled Probability > x to classify positively",
    y = NULL
  ) +
  scale_x_continuous(labels = scales::percent_format(),
                     breaks = seq(0,1,0.1)) +
  scale_y_continuous(labels = scales::percent_format(),
                     breaks = seq(0,1,0.1)) +
  theme_light() +
  theme(plot.title = element_text(size = 14, face = "bold"),
        plot.subtitle = element_text(face = "italic", size = 12,
                                     colour = "grey50"))

Enough theory, how can this create value?

With this curve, we can look at an actual business case: Let’s make some quick and dirty assumptions about profit generated per customers.

Let’s say, a regular, non-churning customer generates USD 100 of profit for us.
We are going to give out a discount of 50% to customers we believe will churn in the next period. It is effective, but not perfectly effective, so only 50% of those customers, who were going to leave, stay after getting the discount. The others still leave and leave us with USD \(0\).
Customers who leave us do not spend any money anymore, so we get USD \(0\) from them.

In model terms this implies:

TP = True Positive: We predicted the customer leaves, we gave out a 50% voucher. 50% of them stay and pay USD 50, the rest leaves. Our profit from this group is \(N_{TP}*50*0.5\).
FP = False Positive: We predicted the customers leaves, but they weren’t planning on leaving. We gave them a 50% discount, all of them stay and our profit from this group is \(N_{FP}*50\).
TN = True Negative: We predicted the customer is not going to leave, they actually didn’t leave. We like those customers because of their loyalty and because they give us the most money, namely \(N_{TN}*100\).
FN = False Negatives: We predicted the customer is not going to leave, but they actually left. These are bad, because we didn’t target them with a voucher. Ouch: The profit from this group is \(0\).

Now I can go ahead and write a function, which counts our TP, FP, TN and FN and calculates the profit based on the sum of all of the four points above, for each threshold we could use in our model.

profit_curve <- function(probs){
  
  probs %>% 
    as_tibble() %>% 
    rename(threshold = value) %>%
    mutate(counts = map(threshold, case_counts)) %>%
    unnest(counts) %>%
    pivot_wider(values_from = n, names_from = case) %>% 
    mutate(
      across(everything(), ~ replace(., is.na(.), 0)),
      profit_without_model = FP * 100 + TP * 0 + TN * 100 + FN * 0,
      profit_with_model = FP * 50 + TP * (50*0.5) + TN * 100 + FN * 0,
      value_add = profit_with_model - profit_without_model,
      sign = ifelse(value_add > 0, "positive", "negative")
    )
  
}

profit_curve(probs = seq(0.01, 1, 0.01)) %>% 
  ggplot(aes(threshold, profit_with_model)) +
  geom_line(colour = "grey50", size = 0.4) + 
  geom_point(aes(colour = sign, group = 1)) +
  geom_curve(aes(x = 0.95, y = 140000, xend = 0.92, yend = 152000),
             arrow = arrow(length = unit(0.08, "inch")), size = 0.5,
             color = "gray20", curvature = 0.1) +
  geom_curve(aes(x = 0.4, y = 142000, xend = 0.5, yend = 132000),
             arrow = arrow(length = unit(0.08, "inch")), size = 0.5,
             color = "gray20", curvature = -0.2) +
  annotate("text", x = 0.35, y = 144000, 
           label = "Normally, classification models use 50%\nas the threshold to predict the one\nor the other class",
           size = 3) +
  annotate("text", x = 0.77, y = 137500, 
           label = "In this bank customer churn model\nhowever, profit is maximised when only\nsaying a customer will churn if they show\nmore than 90% forecasted probability",
           size = 3) +
  labs(colour = "Value-add of the \nmodel compared \nto using no model:",
       y = "Profit",
       x = "Classification Threshold", 
       title = "Forecasted Annual Profit Depending On Classification Threshold",
       subtitle = "Optimal binary classification threshold with profit maximisation in mind\nis not at 50%, but closer to 90%.") +
  scale_y_continuous(labels = dollar_format()) +
  scale_x_continuous(labels = percent_format(), 
                     breaks = seq(0, 1, 0.1)) +
  scale_colour_manual(values = c("firebrick", "dodgerblue")) +
  theme_bw() +
  theme(plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(face = "italic", colour = "grey50",
                                     size = 12))

This is the final product! We can now see that the profit is indeed worse at the default of 50% than for higher values. The optimal level for this bank and this type of discount is 90%, as it maximises the total profit. From the colouring, we can see that the baseline revenue without the model is a little under USD 150,000 and that the model generates value for all thresholds with blue coloured points and destroys value for thresholds with red colouring. At the maximisation point, this particular model increases profits by about 3% compared to not using a model at all. This number is of course not carved in stone, it will move with the type of discount and the variance in model performance in future periods.

Concluding remarks

In the above blog post, we have seen that varying and optimising the classification threshold for binary classification models can be crucial for organisational value creation. In cases like these, simple visualisations with data from the predictive models, such as scatter plots and line charts, can be highly beneficial for strategic and financial decision making.

I hope this post has been interesting to you. In case of constructive feedback or if you want to exchange about this or a related topic, feel free to reach out.

Thank you for reading.

A work by Mathias Steilen