Intro

When evaluating machine learning algorithms through resampling, it is preferable that each train/test partition will be a representative subset of the whole data set. This post covers three ways to achieve such reliable resampling procedures:

Stratified resampling for classification problems where each train/test split maintains the target class distribution of the original data set.
Block resampling where a grouping factor determines which observations should be together in train/test splits.
Custom resampling using predefined and manually created folds for the train/test splits.

Prerequisites

We load the most important packages for this post.

library(mlr3verse)
library(mlbench)
library(data.table)

We initialize the random number generator with a fixed seed for reproducibility.

set.seed(7832)

Stratified resampling

In classification tasks, the ratio of the target class distribution should be similar in each train/test split, which is achieved by stratification. This is particularly useful in the case of imbalanced classes and small data sets.

Stratification can also be performed with respect to explanatory categorical variables to ensure that all subgroups are represented in all training and test sets.

In mlr3, each Task has a slot $col_roles. This slot shows general roles certain features will have throughout different stages of the machine learning process. At least, the $col_roles slot shows which variables will be used as features and as the target. However, the $col_roles slot can be more diverse and some variables might even serve multiple roles. We can specify the variable used for stratification in task$col_roles$stratum. This will be illustrated in the following example using the german_credit data:

task_gc = tsk("german_credit")
task_gc$col_roles

$feature
 [1] "age"                     "amount"                  "credit_history"          "duration"               
 [5] "employment_duration"     "foreign_worker"          "housing"                 "installment_rate"       
 [9] "job"                     "number_credits"          "other_debtors"           "other_installment_plans"
[13] "people_liable"           "personal_status_sex"     "present_residence"       "property"               
[17] "purpose"                 "savings"                 "status"                  "telephone"              

$target
[1] "credit_risk"

$name
character(0)

$order
character(0)

$stratum
character(0)

$group
character(0)

$weight
character(0)

We use the target feature called credit_risk to specify stratification with respect to the target variable:

task_gc$col_roles$stratum = "credit_risk"
# alternatively task_gc$col_roles$stratum = task_gc$col_roles$target

After the specification of task$col_roles$stratum, the active binding task$strata will show the number of observations in each group and the corresponding row id’s:

task_gc$strata

     N                row_id
1: 700       1,3,4,6,7,8,...
2: 300  2, 5,10,11,12,14,...

Specify 3-fold cross validation and instantiate the resampling on the task:

cv3 = rsmp("cv", folds = 3)
cv3$instantiate(task_gc)
cv3$instance

      row_id fold
   1:      7    1
   2:      8    1
   3:      9    1
   4:     17    1
   5:     22    1
  ---            
 996:    959    3
 997:    967    3
 998:    980    3
 999:    984    3
1000:    999    3

Check if the target class distribution is similar in each fold:

dt = merge(cv3$instance, task_gc$data()[, row_id := .I], by = "row_id")
dt[, .(class_ratio = sum(credit_risk == "bad") /
  sum(credit_risk == "good")), by = fold]

   fold class_ratio
1:    2   0.4291845
2:    3   0.4291845
3:    1   0.4273504

And compare it with the target class distribution from the whole data set:

dt[, .(class_ratio = sum(credit_risk == "bad") / sum(credit_risk == "good"))]

   class_ratio
1:   0.4285714

Note that the variable used for stratification does not necessarily have to be the target class. In fact, multiple categorical features can be used for stratification to maintain their frequency distribution in each fold:

task_gc$col_roles$stratum = c("housing", "telephone")
task_gc$strata

     N                      row_id
1: 280        1,13,20,21,26,30,...
2: 433        2, 3, 7, 9,10,14,...
3:  47   4,  5, 45, 76,134,192,...
4:  61        6,19,37,55,63,69,...
5:  63   8, 48, 60, 72, 96,100,...
6: 116       11,12,15,22,23,28,...

To illustrate if stratification based on multiple categorical features works, we need to instantiate the CV folds again as we changed the features used for stratification:

cv3$instantiate(task_gc)
cv3$instance

      row_id fold
   1:     13    1
   2:     21    1
   3:     31    1
   4:     33    1
   5:     43    1
  ---            
 996:    945    3
 997:    973    3
 998:    974    3
 999:    986    3
1000:    993    3

Again, we check the relative frequency of observations in each group (combination of housing and telephone) across all folds:

dt = merge(cv3$instance, task_gc$data()[, row_id := .I], by = "row_id")
dt = dt[, .(freq = .N), by = list(fold, housing, telephone)]
dt = dcast(dt, housing + telephone ~ fold)

Using 'freq' as value column. Use 'value.var' to override

dt[, c(3:5) := lapply(.SD, function(x) x / sum(x)), .SDcols = 3:5]
dt

    housing                 telephone          1          2          3
1: for free                        no 0.11607143 0.11711712 0.11480363
2: for free yes (under customer name) 0.06250000 0.06306306 0.06344411
3:     rent                        no 0.43154762 0.43243243 0.43504532
4:     rent yes (under customer name) 0.27976190 0.27927928 0.28096677
5:      own                        no 0.04761905 0.04804805 0.04531722
6:      own yes (under customer name) 0.06250000 0.06006006 0.06042296

And compare it with the relative frequency from the whole data set:

task_gc$data()[, .(freq = .N / max(.I)),
  by = list(housing, telephone)
][order(housing, telephone), ]

    housing                 telephone       freq
1: for free                        no 0.11681772
2: for free yes (under customer name) 0.06415479
3:     rent                        no 0.43300000
4:     rent yes (under customer name) 0.28084253
5:      own                        no 0.04895833
6:      own yes (under customer name) 0.06106106

It is evident that in each fold, the combination of housing and telephone have similar frequencies that also coincide with the frequencies from the whole data set.

Block resampling

An additional concern when specifying resampling is respecting the natural grouping of the data. Blocking refers to the situation where subsets of observations belong together and must not be separated during resampling. Hence, for one train/test set pair the entire block is either in the training set or in the test set.

The following example is based on the BreastCancer data set from the mlbench package:

data(BreastCancer, package = "mlbench")
task_bc = as_task_classif(BreastCancer, target = "Class", positive = "malignant")

In the BreastCancer data set, for example, several observations have the same “Id” (Sample code number) which implies these are samples taken from the same patient at different times.

# Let's count how many observation actually have the same Id more than once
sum(table(BreastCancer$Id) > 1)

[1] 46

There are 46 Id’s with more than one observation (row).

The model trained on this data set will be used to predict cancer status of new patients. Hence, we have to make sure that each Id occurs exactly in one fold, so that all observations with the same Id should be either used for training or for evaluating the model. This way, we get less biased performance estimates via k-fold cross validation. The following example will illustrate block cross validation which can be achieved by specifying a blocking factor in the task$col_roles$group slot:

# Use Id column as block factor
task_bc$col_roles$group = "Id"
# Remove Id from feature
# task_bc$col_roles$feature = setdiff(task_bc$col_roles$feature, "Id")
cv5 = rsmp("cv", folds = 5)
set.seed(123)
cv5$instantiate(task_bc)
cv5$instance

      row_id fold
  1: 1016277    1
  2: 1044572    1
  3: 1049815    1
  4: 1050718    1
  5: 1054590    1
 ---             
641: 1369821    5
642: 1371026    5
643: 1371920    5
644:  714039    5
645:  841769    5

In this case, the row_id column of the cv5$instance slot refers to values of the grouping variable “Id”. Additionally, the number of rows of the cv5$instance is the same as the number of unique groups:

all(cv5$instance$row_id %in% BreastCancer$Id)

[1] TRUE

nrow(cv5$instance) == length(unique(BreastCancer$Id))

[1] TRUE

If the specified blocking groups are respected, each Id appears only in exactly one fold. To inspect if blocking was successful when generating the folds we count how often each Id appears in a specific fold and print the Ids that appear in more than one fold:

dt = merge(task_bc$data(), cv5$instance, by.x = "Id", by.y = "row_id")
dt = dt[, .(unique_folds = length(unique(fold))), by = Id]
dt[unique_folds > 1, ]

Empty data.table (0 rows and 2 cols): Id,unique_folds

As expected, the table is empty as there are no Id’s present in more than one fold.

Resampling with predefined folds

In some use cases, it might be necessary to use predefined folds. When using k-fold cross validation without repetition this can be achieved by manually creating a feature used to denote folds and assigning it to the task$col_roles$group slot. First, we create a vector that contains 5 predefined folds:

folds = sample(rep(1:5, length.out = nrow(BreastCancer)),
  size = nrow(BreastCancer),
  replace = F
)
head(folds, 20)

 [1] 2 2 4 1 5 2 5 3 1 5 4 3 3 4 5 3 3 5 2 4

table(folds)

folds
  1   2   3   4   5 
140 140 140 140 139

This vector is now added to the data set and will be used as grouping factor just as when defining block resampling:

task_bc = TaskClassif$new(
  id = "BreastCancer",
  backend = data.frame(BreastCancer, foldIds = as.factor(folds)),
  target = "Class",
  positive = "malignant"
)
task_bc$col_roles$group = "foldIds"
# Remove "foldIds" from features
# task_bc$col_roles$feature = setdiff(task_bc$col_roles$feature, "foldIds")

We now instantiate a 5-fold CV that will respect the predefined folds:

cv5 = rsmp("cv", folds = 5)
cv5$instantiate(task_bc)
cv5$instance

   row_id fold
1:      1    1
2:      2    2
3:      3    3
4:      4    4
5:      5    5

Since we have only five predefined folds, the cv5$instance data table has five rows and shows which of our foldIds values (contained in the row_id column) will belong to which instantiated fold. To check if the predefined groups are respected, we count how often each foldIds appears in a specific fold:

dt = merge(task_bc$data(), cv5$instance, by.x = "foldIds", by.y = "row_id")
dt[, .(unique_folds = length(unique(fold))), by = foldIds]

   foldIds unique_folds
1:       1            1
2:       2            1
3:       3            1
4:       4            1
5:       5            1

There are five groups and each foldIds appears only in exactly one fold. This means that each instantiated fold corresponds to one of the predefined folds.

The previous example does not cover how to perform repeated k-fold CV or time series CV with predefined indices. This is possible via the mlr_resamplings_custom to which a list of predefined train and test indices can be assigned. In the following example, a custom resampling is created using indices created by caret::createMultiFolds():

task_gc = tsk("german_credit")
train_ind = caret::createMultiFolds(task_gc$truth(), k = 5, times = 10)
test_ind = lapply(train_ind, function(x) setdiff(1:task_gc$nrow, x))
rc = rsmp("custom")
rc$instantiate(task_gc, train_ind, test_ind)

We now check if the instantiated custom resampling contains the intended folds:

# check it for the first fold
all.equal(train_ind[[1]], rc$train_set(1))

[1] TRUE

# check it for all folds
unlist(lapply(1:rc$iters, function(i) all.equal(train_ind[[i]], rc$train_set(i))))

 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[24] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[47] TRUE TRUE TRUE TRUE

Conclusions

This post shows how to control the resampling process when using mlr3 in order to account for data specificities.