library(mlr3verse)
library(mlbench)
library(data.table)
Intro
When evaluating machine learning algorithms through resampling, it is preferable that each train/test partition will be a representative subset of the whole data set. This post covers three ways to achieve such reliable resampling procedures:
- Stratified resampling for classification problems where each train/test split maintains the target class distribution of the original data set.
- Block resampling where a grouping factor determines which observations should be together in train/test splits.
- Custom resampling using predefined and manually created folds for the train/test splits.
Prerequisites
We load the most important packages for this post.
We initialize the random number generator with a fixed seed for reproducibility.
set.seed(7832)
Stratified resampling
In classification tasks, the ratio of the target class distribution should be similar in each train/test split, which is achieved by stratification. This is particularly useful in the case of imbalanced classes and small data sets.
Stratification can also be performed with respect to explanatory categorical variables to ensure that all subgroups are represented in all training and test sets.
In mlr3, each Task
has a slot $col_roles
. This slot shows general roles certain features will have throughout different stages of the machine learning process. At least, the $col_roles
slot shows which variables will be used as feature
s and as the target
. However, the $col_roles
slot can be more diverse and some variables might even serve multiple roles. We can specify the variable used for stratification in task$col_roles$stratum
. This will be illustrated in the following example using the german_credit
data:
= tsk("german_credit")
task_gc $col_roles task_gc
$feature
[1] "age" "amount" "credit_history" "duration"
[5] "employment_duration" "foreign_worker" "housing" "installment_rate"
[9] "job" "number_credits" "other_debtors" "other_installment_plans"
[13] "people_liable" "personal_status_sex" "present_residence" "property"
[17] "purpose" "savings" "status" "telephone"
$target
[1] "credit_risk"
$name
character(0)
$order
character(0)
$stratum
character(0)
$group
character(0)
$weight
character(0)
We use the target feature called credit_risk
to specify stratification with respect to the target variable:
$col_roles$stratum = "credit_risk"
task_gc# alternatively task_gc$col_roles$stratum = task_gc$col_roles$target
After the specification of task$col_roles$stratum
, the active binding task$strata
will show the number of observations in each group and the corresponding row id’s:
$strata task_gc
N row_id
1: 700 1,3,4,6,7,8,...
2: 300 2, 5,10,11,12,14,...
Specify 3-fold cross validation and instantiate the resampling on the task:
= rsmp("cv", folds = 3)
cv3 $instantiate(task_gc)
cv3$instance cv3
row_id fold
1: 7 1
2: 8 1
3: 9 1
4: 17 1
5: 22 1
---
996: 959 3
997: 967 3
998: 980 3
999: 984 3
1000: 999 3
Check if the target class distribution is similar in each fold:
= merge(cv3$instance, task_gc$data()[, row_id := .I], by = "row_id")
dt class_ratio = sum(credit_risk == "bad") /
dt[, .(sum(credit_risk == "good")), by = fold]
fold class_ratio
1: 2 0.4291845
2: 3 0.4291845
3: 1 0.4273504
And compare it with the target class distribution from the whole data set:
class_ratio = sum(credit_risk == "bad") / sum(credit_risk == "good"))] dt[, .(
class_ratio
1: 0.4285714
Note that the variable used for stratification does not necessarily have to be the target class. In fact, multiple categorical features can be used for stratification to maintain their frequency distribution in each fold:
$col_roles$stratum = c("housing", "telephone")
task_gc$strata task_gc
N row_id
1: 280 1,13,20,21,26,30,...
2: 433 2, 3, 7, 9,10,14,...
3: 47 4, 5, 45, 76,134,192,...
4: 61 6,19,37,55,63,69,...
5: 63 8, 48, 60, 72, 96,100,...
6: 116 11,12,15,22,23,28,...
To illustrate if stratification based on multiple categorical features works, we need to instantiate the CV folds again as we changed the features used for stratification:
$instantiate(task_gc)
cv3$instance cv3
row_id fold
1: 13 1
2: 21 1
3: 31 1
4: 33 1
5: 43 1
---
996: 945 3
997: 973 3
998: 974 3
999: 986 3
1000: 993 3
Again, we check the relative frequency of observations in each group (combination of housing
and telephone
) across all folds:
= merge(cv3$instance, task_gc$data()[, row_id := .I], by = "row_id")
dt = dt[, .(freq = .N), by = list(fold, housing, telephone)]
dt = dcast(dt, housing + telephone ~ fold) dt
Using 'freq' as value column. Use 'value.var' to override
c(3:5) := lapply(.SD, function(x) x / sum(x)), .SDcols = 3:5]
dt[, dt
housing telephone 1 2 3
1: for free no 0.11607143 0.11711712 0.11480363
2: for free yes (under customer name) 0.06250000 0.06306306 0.06344411
3: rent no 0.43154762 0.43243243 0.43504532
4: rent yes (under customer name) 0.27976190 0.27927928 0.28096677
5: own no 0.04761905 0.04804805 0.04531722
6: own yes (under customer name) 0.06250000 0.06006006 0.06042296
And compare it with the relative frequency from the whole data set:
$data()[, .(freq = .N / max(.I)),
task_gc= list(housing, telephone)
by order(housing, telephone), ] ][
housing telephone freq
1: for free no 0.11681772
2: for free yes (under customer name) 0.06415479
3: rent no 0.43300000
4: rent yes (under customer name) 0.28084253
5: own no 0.04895833
6: own yes (under customer name) 0.06106106
It is evident that in each fold, the combination of housing
and telephone
have similar frequencies that also coincide with the frequencies from the whole data set.
Block resampling
An additional concern when specifying resampling is respecting the natural grouping of the data. Blocking refers to the situation where subsets of observations belong together and must not be separated during resampling. Hence, for one train/test set pair the entire block is either in the training set or in the test set.
The following example is based on the BreastCancer data set from the mlbench package:
data(BreastCancer, package = "mlbench")
= as_task_classif(BreastCancer, target = "Class", positive = "malignant") task_bc
In the BreastCancer data set, for example, several observations have the same “Id” (Sample code number) which implies these are samples taken from the same patient at different times.
# Let's count how many observation actually have the same Id more than once
sum(table(BreastCancer$Id) > 1)
[1] 46
There are 46 Id’s with more than one observation (row).
The model trained on this data set will be used to predict cancer status of new patients. Hence, we have to make sure that each Id
occurs exactly in one fold, so that all observations with the same Id
should be either used for training or for evaluating the model. This way, we get less biased performance estimates via k-fold cross validation. The following example will illustrate block cross validation which can be achieved by specifying a blocking factor in the task$col_roles$group
slot:
# Use Id column as block factor
$col_roles$group = "Id"
task_bc# Remove Id from feature
# task_bc$col_roles$feature = setdiff(task_bc$col_roles$feature, "Id")
= rsmp("cv", folds = 5)
cv5 set.seed(123)
$instantiate(task_bc)
cv5$instance cv5
row_id fold
1: 1016277 1
2: 1044572 1
3: 1049815 1
4: 1050718 1
5: 1054590 1
---
641: 1369821 5
642: 1371026 5
643: 1371920 5
644: 714039 5
645: 841769 5
In this case, the row_id
column of the cv5$instance
slot refers to values of the grouping variable “Id”. Additionally, the number of rows of the cv5$instance
is the same as the number of unique groups:
all(cv5$instance$row_id %in% BreastCancer$Id)
[1] TRUE
nrow(cv5$instance) == length(unique(BreastCancer$Id))
[1] TRUE
If the specified blocking groups are respected, each Id
appears only in exactly one fold. To inspect if blocking was successful when generating the folds we count how often each Id
appears in a specific fold and print the Id
s that appear in more than one fold:
= merge(task_bc$data(), cv5$instance, by.x = "Id", by.y = "row_id")
dt = dt[, .(unique_folds = length(unique(fold))), by = Id]
dt > 1, ] dt[unique_folds
Empty data.table (0 rows and 2 cols): Id,unique_folds
As expected, the table is empty as there are no Id’s present in more than one fold.
Resampling with predefined folds
In some use cases, it might be necessary to use predefined folds. When using k-fold cross validation without repetition this can be achieved by manually creating a feature used to denote folds and assigning it to the task$col_roles$group
slot. First, we create a vector that contains 5 predefined folds:
= sample(rep(1:5, length.out = nrow(BreastCancer)),
folds size = nrow(BreastCancer),
replace = F
)head(folds, 20)
[1] 2 2 4 1 5 2 5 3 1 5 4 3 3 4 5 3 3 5 2 4
table(folds)
folds
1 2 3 4 5
140 140 140 140 139
This vector is now added to the data set and will be used as grouping factor just as when defining block resampling:
= TaskClassif$new(
task_bc id = "BreastCancer",
backend = data.frame(BreastCancer, foldIds = as.factor(folds)),
target = "Class",
positive = "malignant"
)$col_roles$group = "foldIds"
task_bc# Remove "foldIds" from features
# task_bc$col_roles$feature = setdiff(task_bc$col_roles$feature, "foldIds")
We now instantiate a 5-fold CV that will respect the predefined folds:
= rsmp("cv", folds = 5)
cv5 $instantiate(task_bc)
cv5$instance cv5
row_id fold
1: 1 1
2: 2 2
3: 3 3
4: 4 4
5: 5 5
Since we have only five predefined folds, the cv5$instance
data table has five rows and shows which of our foldIds
values (contained in the row_id
column) will belong to which instantiated fold. To check if the predefined groups are respected, we count how often each foldIds
appears in a specific fold:
= merge(task_bc$data(), cv5$instance, by.x = "foldIds", by.y = "row_id")
dt unique_folds = length(unique(fold))), by = foldIds] dt[, .(
foldIds unique_folds
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 1
There are five groups and each foldIds
appears only in exactly one fold. This means that each instantiated fold corresponds to one of the predefined folds.
The previous example does not cover how to perform repeated k-fold CV or time series CV with predefined indices. This is possible via the mlr_resamplings_custom
to which a list of predefined train and test indices can be assigned. In the following example, a custom resampling is created using indices created by caret::createMultiFolds()
:
= tsk("german_credit")
task_gc = caret::createMultiFolds(task_gc$truth(), k = 5, times = 10)
train_ind = lapply(train_ind, function(x) setdiff(1:task_gc$nrow, x))
test_ind = rsmp("custom")
rc $instantiate(task_gc, train_ind, test_ind) rc
We now check if the instantiated custom resampling contains the intended folds:
# check it for the first fold
all.equal(train_ind[[1]], rc$train_set(1))
[1] TRUE
# check it for all folds
unlist(lapply(1:rc$iters, function(i) all.equal(train_ind[[i]], rc$train_set(i))))
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[24] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[47] TRUE TRUE TRUE TRUE
Conclusions
This post shows how to control the resampling process when using mlr3 in order to account for data specificities.