{tidymodels} workflow with Bayesian optimisation

R
Machine learning

I’ve been collecting a few notes on using the tidymodels workflow for modelling, and thought it might be worth sharing them here. More for personal reference than anything, but someone might find my ramblings useful!

Published

May 23, 2020

For demonstration purposes I’m going to use the credit_data dataset from the modeldata package. This is a fairly simple data set that doesn’t require too much wrangling to get going.

library(modeldata)
library(tidymodels)
library(tidyverse)
library(doParallel)
library(probably)
library(gt)

data("credit_data")
glimpse(credit_data)
Rows: 4,454
Columns: 14
$ Status    <fct> good, good, bad, good, good, good, good, good, good, bad, go…
$ Seniority <int> 9, 17, 10, 0, 0, 1, 29, 9, 0, 0, 6, 7, 8, 19, 0, 0, 15, 33, …
$ Home      <fct> rent, rent, owner, rent, rent, owner, owner, parents, owner,…
$ Time      <int> 60, 60, 36, 60, 36, 60, 60, 12, 60, 48, 48, 36, 60, 36, 18, …
$ Age       <int> 30, 58, 46, 24, 26, 36, 44, 27, 32, 41, 34, 29, 30, 37, 21, …
$ Marital   <fct> married, widow, married, single, single, married, married, s…
$ Records   <fct> no, no, yes, no, no, no, no, no, no, no, no, no, no, no, yes…
$ Job       <fct> freelance, fixed, freelance, fixed, fixed, fixed, fixed, fix…
$ Expenses  <int> 73, 48, 90, 63, 46, 75, 75, 35, 90, 90, 60, 60, 75, 75, 35, …
$ Income    <int> 129, 131, 200, 182, 107, 214, 125, 80, 107, 80, 125, 121, 19…
$ Assets    <int> 0, 0, 3000, 2500, 0, 3500, 10000, 0, 15000, 0, 4000, 3000, 5…
$ Debt      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2500, 260, 0, 0, 0, 2000…
$ Amount    <int> 800, 1000, 2000, 900, 310, 650, 1600, 200, 1200, 1200, 1150,…
$ Price     <int> 846, 1658, 2985, 1325, 910, 1645, 1800, 1093, 1957, 1468, 15…

First, I’ll split the sample into training and testing sets using rsample. As well as create v-fold cross-validation set for use in the tuning step later.

set.seed(122)

credit_data <- credit_data %>%
  drop_na(Status)

# initial split
split <- initial_split(credit_data, prop = 0.75, strata = "Status")

# train/test sets
train <- training(split)
test <- testing(split)

# cross validation set
folds <- vfold_cv(train, v = 5)

I’ve dropped NA’s in outcome as these caused problems in later steps…

Next, I’ll set up the recipe using recipies. As well as defining the formula for the model, I’ve created two preprocessing steps:

  1. Impute the missing values.
  2. One-hot encode the factor variables in the data.
rec <- recipe(Status ~ ., data = train) %>%
  step_impute_bag(Home, Marital, Job, Income, Assets, Debt) %>%
  step_dummy(Home, Marital, Records, Job, one_hot = T) 

Now I’ll prepare the model specification using parsnip. Here I’m using an xgboost model and specify the parameters I want to tune using Bayesian optimisation. The mtry parameter requires one additional step to finalise the range of possible values (because it depends on the number of variables in the data and a suitable range of values to test can’t be estimated without that information).

mod <- boost_tree(
  trees = 1000,
  min_n = tune(),
  learn_rate = tune(),
  loss_reduction = tune(),
  sample_size = tune(),
  mtry = tune(),
  tree_depth = tune()
) %>%
  set_engine("xgboost") %>%
  set_mode("classification")


params <- parameters(mod) %>%
  finalize(train)

In this step I’ve bundled all the above steps into a workflow. This avoids the need to use the juice, bake and prep functions (which I never quite got my head around…!).

xgboost_wflow <- workflow() %>%
  add_recipe(rec) %>%
  add_model(mod)

Now we are ready to OPTIMISE. I’m going to use an iterative search through Bayesian optimisation to predict what parameters to try next (as opposed to a grid search where we need to specific parameters values in advance).

First I set up some addition works so the tests can be run in parallel, and then use the tune_bayes() function to set up the tuning. Here I’ve decided to:

options(tidymodels.dark = TRUE)

cl <- makePSOCKcluster(8)
registerDoParallel(cl)

tuned <- tune_bayes(
  object = xgboost_wflow,
  resamples = folds,
  param_info = params,
  iter = 30,
  metrics = metric_set(pr_auc),
  initial = 10,
  control = control_bayes(
    verbose = TRUE,
    no_improve = 10,
    seed = 123
  )
)

The options(tidymodels.dark = TRUE) makes the text much clear if you’re using a dark theme in Rstudio.

After a little while, we’re done! The top combinations were:

show_best(tuned, "pr_auc")%>% 
  select(1:7, 11) %>% 
  gt()
mtry min_n tree_depth learn_rate loss_reduction sample_size .metric std_err
8 13 1 0.080393032 3.731578e+00 0.4536060 pr_auc 0.01687497
9 17 6 0.018723128 2.609860e-04 0.5941547 pr_auc 0.01547112
11 19 2 0.063308769 7.647379e-07 0.6607560 pr_auc 0.01278103
7 8 7 0.002946107 2.545899e-07 0.5553494 pr_auc 0.01094428
11 18 10 0.053917413 3.882588e-02 0.4801454 pr_auc 0.01518299

mean here is the average pr_auc across the resamples.

Now we can create our final model using these parameters.

xgboost_wkflow_tuned <- finalize_workflow(
  xgboost_wflow,
  select_best(tuned, "pr_auc")
)

Finally we can fit the model.

final_res <- last_fit(
  xgboost_wkflow_tuned,
  split
)

#stopCluster(cl)

With our model in hand we can make some predictions to evaluate performance

preds <- final_res %>% 
  collect_predictions()

Now we can use the yardstick package to evaluate how the model performed.

conf_mat(preds, Status, .pred_class)
          Truth
Prediction bad good
      bad  167   76
      good 147  724
preds %>%
  gain_curve(Status, .pred_bad) %>%
  autoplot()

preds %>%
  pr_curve(Status, .pred_bad) %>%
  autoplot()

The autoplot() feature is great for quickly visualising these curves

hmmm… That confusion matrix doesn’t look too great. Quite a large number “good” predictions were actually “bad” (false negatives). Maybe when can improve the class prediction by using probably.

Here we use threshold_perf() to evaluate different thresholds to make our class predictions. One methods to determine the “best” cut point is to use the j-index (maximum value of 1 when there are no false positives and no false negatives).

threshold_data <- preds %>%
  threshold_perf(Status, .pred_bad, thresholds = seq(0.2, 1, by = 0.0025))


max_j_index_threshold <- threshold_data %>%
  filter(.metric == "j_index") %>%
  filter(.estimate == max(.estimate)) %>%
  pull(.threshold)


preds_new <- preds %>% 
  mutate(new_class_pred = factor(ifelse(.pred_bad >= max_j_index_threshold, "bad", "good"), 
                                 levels = c("bad", "good"))) 
  

max_j_index_threshold
[1] 0.2375

Now we have a new prediction based on our new threshold of vs the default threshold of 0.50. We can compare the performance on a range of different binary classification metrics by calling summay() on the conf_mat() object for both the old and new predicted classes.

summary(conf_mat(preds, Status, .pred_class)) %>%
  select(-.estimator) %>%
  rename(old_threshold = .estimate) %>%
  bind_cols(.,
            summary(conf_mat(preds_new, Status, new_class_pred)) %>%
              select(.estimate) %>%
              rename(new_threshold = .estimate)) %>%
  gt() %>%
  fmt_number(columns = c(2, 3),
             decimals = 3) %>%
  tab_style(
    style = cell_fill(color = "indianred3"),
    locations = cells_body(columns = 3,
                           rows = new_threshold < old_threshold)
  )  %>%
  tab_style(
    style = cell_fill(color = "springgreen3"),
    locations = cells_body(columns = 3,
                           rows = new_threshold > old_threshold)
  ) %>% 
  cols_align(align = "center")
.metric old_threshold new_threshold
accuracy 0.800 0.736
kap 0.469 0.447
sens 0.532 0.825
spec 0.905 0.701
ppv 0.687 0.520
npv 0.831 0.911
mcc 0.476 0.476
j_index 0.437 0.526
bal_accuracy 0.718 0.763
detection_prevalence 0.218 0.447
precision 0.687 0.520
recall 0.532 0.825
f_meas 0.600 0.638

Lowering the threshold to .27 seems to have had a positive impact on quite a few of the binary classification metrics. There is always going to be a trade off between maximising some metrics over others, and will of course depend on what you are trying to achieve with your model.

…and that is a very quick tour of tidymodels! There are obviously some additional steps you would want to carry out out in the “real” world. You’d probably want to compare a range of different models and maybe do some additional feature engineering based on the data you have, but the code above is a good initial starting point for a tidymodels orientated workflow.

There are many other features in tidymodels and I’ve barely scratched the surface here!

Thanks for reading!