{tidymodels} workflow with Bayesian optimisation

Machine learning

I’ve been collecting a few notes on using the tidymodels workflow for modelling, and thought it might be worth sharing them here. More for personal reference than anything, but someone might find my ramblings useful!

Published

May 23, 2020

For demonstration purposes I’m going to use the credit_data dataset from the modeldata package. This is a fairly simple data set that doesn’t require too much wrangling to get going.

library(modeldata)
library(tidymodels)
library(tidyverse)
library(doParallel)
library(probably)
library(gt)

data("credit_data")
glimpse(credit_data)

Rows: 4,454
Columns: 14
$ Status    <fct> good, good, bad, good, good, good, good, good, good, bad, go…
$ Seniority <int> 9, 17, 10, 0, 0, 1, 29, 9, 0, 0, 6, 7, 8, 19, 0, 0, 15, 33, …
$ Home      <fct> rent, rent, owner, rent, rent, owner, owner, parents, owner,…
$ Time      <int> 60, 60, 36, 60, 36, 60, 60, 12, 60, 48, 48, 36, 60, 36, 18, …
$ Age       <int> 30, 58, 46, 24, 26, 36, 44, 27, 32, 41, 34, 29, 30, 37, 21, …
$ Marital   <fct> married, widow, married, single, single, married, married, s…
$ Records   <fct> no, no, yes, no, no, no, no, no, no, no, no, no, no, no, yes…
$ Job       <fct> freelance, fixed, freelance, fixed, fixed, fixed, fixed, fix…
$ Expenses  <int> 73, 48, 90, 63, 46, 75, 75, 35, 90, 90, 60, 60, 75, 75, 35, …
$ Income    <int> 129, 131, 200, 182, 107, 214, 125, 80, 107, 80, 125, 121, 19…
$ Assets    <int> 0, 0, 3000, 2500, 0, 3500, 10000, 0, 15000, 0, 4000, 3000, 5…
$ Debt      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2500, 260, 0, 0, 0, 2000…
$ Amount    <int> 800, 1000, 2000, 900, 310, 650, 1600, 200, 1200, 1200, 1150,…
$ Price     <int> 846, 1658, 2985, 1325, 910, 1645, 1800, 1093, 1957, 1468, 15…

First, I’ll split the sample into training and testing sets using rsample. As well as create v-fold cross-validation set for use in the tuning step later.

set.seed(122)

credit_data <- credit_data %>%
  drop_na(Status)

# initial split
split <- initial_split(credit_data, prop = 0.75, strata = "Status")

# train/test sets
train <- training(split)
test <- testing(split)

# cross validation set
folds <- vfold_cv(train, v = 5)

I’ve dropped NA’s in outcome as these caused problems in later steps…

Next, I’ll set up the recipe using recipies. As well as defining the formula for the model, I’ve created two preprocessing steps:

Impute the missing values.
One-hot encode the factor variables in the data.

rec <- recipe(Status ~ ., data = train) %>%
  step_impute_bag(Home, Marital, Job, Income, Assets, Debt) %>%
  step_dummy(Home, Marital, Records, Job, one_hot = T)

Now I’ll prepare the model specification using parsnip. Here I’m using an xgboost model and specify the parameters I want to tune using Bayesian optimisation. The mtry parameter requires one additional step to finalise the range of possible values (because it depends on the number of variables in the data and a suitable range of values to test can’t be estimated without that information).

mod <- boost_tree(
  trees = 1000,
  min_n = tune(),
  learn_rate = tune(),
  loss_reduction = tune(),
  sample_size = tune(),
  mtry = tune(),
  tree_depth = tune()
) %>%
  set_engine("xgboost") %>%
  set_mode("classification")


params <- parameters(mod) %>%
  finalize(train)

In this step I’ve bundled all the above steps into a workflow. This avoids the need to use the juice, bake and prep functions (which I never quite got my head around…!).

xgboost_wflow <- workflow() %>%
  add_recipe(rec) %>%
  add_model(mod)

Now we are ready to OPTIMISE. I’m going to use an iterative search through Bayesian optimisation to predict what parameters to try next (as opposed to a grid search where we need to specific parameters values in advance).

First I set up some addition works so the tests can be run in parallel, and then use the tune_bayes() function to set up the tuning. Here I’ve decided to:

Limit the number of iterations to 30
Use precision-recall as the metric I want to optimise
Start off with an initial grid of 10 combinations.
Stop early is no improvements are made after 10 iterations.

options(tidymodels.dark = TRUE)

cl <- makePSOCKcluster(8)
registerDoParallel(cl)

tuned <- tune_bayes(
  object = xgboost_wflow,
  resamples = folds,
  param_info = params,
  iter = 30,
  metrics = metric_set(pr_auc),
  initial = 10,
  control = control_bayes(
    verbose = TRUE,
    no_improve = 10,
    seed = 123
  )
)

The options(tidymodels.dark = TRUE) makes the text much clear if you’re using a dark theme in Rstudio.

After a little while, we’re done! The top combinations were:

show_best(tuned, "pr_auc")%>% 
  select(1:7, 11) %>% 
  gt()

mtry	min_n	tree_depth	learn_rate	loss_reduction	sample_size	.metric	std_err
8	13	1	0.080393032	3.731578e+00	0.4536060	pr_auc	0.01687497
9	17	6	0.018723128	2.609860e-04	0.5941547	pr_auc	0.01547112
11	19	2	0.063308769	7.647379e-07	0.6607560	pr_auc	0.01278103
7	8	7	0.002946107	2.545899e-07	0.5553494	pr_auc	0.01094428
11	18	10	0.053917413	3.882588e-02	0.4801454	pr_auc	0.01518299

mean here is the average pr_auc across the resamples.

Now we can create our final model using these parameters.

xgboost_wkflow_tuned <- finalize_workflow(
  xgboost_wflow,
  select_best(tuned, "pr_auc")
)

Finally we can fit the model.

final_res <- last_fit(
  xgboost_wkflow_tuned,
  split
)

#stopCluster(cl)

With our model in hand we can make some predictions to evaluate performance

preds <- final_res %>% 
  collect_predictions()

Now we can use the yardstick package to evaluate how the model performed.

conf_mat(preds, Status, .pred_class)

          Truth
Prediction bad good
      bad  167   76
      good 147  724

preds %>%
  gain_curve(Status, .pred_bad) %>%
  autoplot()

preds %>%
  pr_curve(Status, .pred_bad) %>%
  autoplot()

The autoplot() feature is great for quickly visualising these curves

hmmm… That confusion matrix doesn’t look too great. Quite a large number “good” predictions were actually “bad” (false negatives). Maybe when can improve the class prediction by using probably.

Here we use threshold_perf() to evaluate different thresholds to make our class predictions. One methods to determine the “best” cut point is to use the j-index (maximum value of 1 when there are no false positives and no false negatives).

threshold_data <- preds %>%
  threshold_perf(Status, .pred_bad, thresholds = seq(0.2, 1, by = 0.0025))


max_j_index_threshold <- threshold_data %>%
  filter(.metric == "j_index") %>%
  filter(.estimate == max(.estimate)) %>%
  pull(.threshold)


preds_new <- preds %>% 
  mutate(new_class_pred = factor(ifelse(.pred_bad >= max_j_index_threshold, "bad", "good"), 
                                 levels = c("bad", "good"))) 
  

max_j_index_threshold

[1] 0.2375

Now we have a new prediction based on our new threshold of vs the default threshold of 0.50. We can compare the performance on a range of different binary classification metrics by calling summay() on the conf_mat() object for both the old and new predicted classes.

summary(conf_mat(preds, Status, .pred_class)) %>%
  select(-.estimator) %>%
  rename(old_threshold = .estimate) %>%
  bind_cols(.,
            summary(conf_mat(preds_new, Status, new_class_pred)) %>%
              select(.estimate) %>%
              rename(new_threshold = .estimate)) %>%
  gt() %>%
  fmt_number(columns = c(2, 3),
             decimals = 3) %>%
  tab_style(
    style = cell_fill(color = "indianred3"),
    locations = cells_body(columns = 3,
                           rows = new_threshold < old_threshold)
  )  %>%
  tab_style(
    style = cell_fill(color = "springgreen3"),
    locations = cells_body(columns = 3,
                           rows = new_threshold > old_threshold)
  ) %>% 
  cols_align(align = "center")

.metric	old_threshold	new_threshold
accuracy	0.800	0.736
kap	0.469	0.447
sens	0.532	0.825
spec	0.905	0.701
ppv	0.687	0.520
npv	0.831	0.911
mcc	0.476	0.476
j_index	0.437	0.526
bal_accuracy	0.718	0.763
detection_prevalence	0.218	0.447
precision	0.687	0.520
recall	0.532	0.825
f_meas	0.600	0.638

Lowering the threshold to .27 seems to have had a positive impact on quite a few of the binary classification metrics. There is always going to be a trade off between maximising some metrics over others, and will of course depend on what you are trying to achieve with your model.

…and that is a very quick tour of tidymodels! There are obviously some additional steps you would want to carry out out in the “real” world. You’d probably want to compare a range of different models and maybe do some additional feature engineering based on the data you have, but the code above is a good initial starting point for a tidymodels orientated workflow.

There are many other features in tidymodels and I’ve barely scratched the surface here!

Thanks for reading!