Working with spatial data in R requires a lot of data wrangling e.g. reading from different file formats, converting between spatial formats, creating tables from point layers, and predicting spatial raster images. The goal of mlr3spatial is to simplify these workflows within the mlr3 ecosystem. As a practical example, we will perform a land cover classification for the city of Leipzig, Germany. Figure 1 illustrates the typical workflow for this type of task: Load the training data, create a spatial task, train a learner with it, and predict the final raster image.

We assume that you are familiar with the mlr3 ecosystem and know the basic concepts of remote sensing. If not, we recommend reading the mlr3book first. If you are interested in spatial resampling, check out the book chapter on spatial analysis.

Land cover is the physical material or vegetation that covers the surface of the earth, including both natural and human-made features. Understanding land cover patterns and changes over time is critical for addressing global environmental challenges, such as climate change, land degradation, and loss of biodiversity. Land cover classification is the process of assigning land cover classes to pixels in a raster image. With mlr3spatial, we can easily perform a land cover classification within the mlr3 ecosystem.

Before we can start the land cover classification, we need to load the necessary packages. The mlr3spatial package relies on terra for processing raster data and sf for vector data. These widely used packages read all common raster and vector formats. Additionally, the stars and raster packages are supported.

```
library(mlr3verse)
library(mlr3spatial)
library(terra, exclude = "resample")
library(sf)
```

We will work with a Sentinel-2 scene of the city of Leipzig which consists of 7 bands with a 10 or 20m spatial resolution and an NDVI band. The data is included in the mlr3spatial package. We use the `terra::rast()`

to load the TIFF raster file.

```
leipzig_raster = rast(system.file("extdata", "leipzig_raster.tif", package = "mlr3spatial"))
leipzig_raster
```

```
class : SpatRaster
dimensions : 206, 154, 8 (nrow, ncol, nlyr)
resolution : 10, 10 (x, y)
extent : 731810, 733350, 5692030, 5694090 (xmin, xmax, ymin, ymax)
coord. ref. : WGS 84 / UTM zone 32N (EPSG:32632)
source : leipzig_raster.tif
names : b02, b03, b04, b06, b07, b08, ...
min values : 846, 645, 366, 375, 401, 374, ...
max values : 4705, 4880, 5451, 4330, 5162, 5749, ...
```

The training data is a GeoPackage point layer with land cover labels and spectral features. We load the file and create a `simple feature point layer`

.

```
leipzig_vector = read_sf(system.file("extdata", "leipzig_points.gpkg", package = "mlr3spatial"), stringsAsFactors = TRUE)
leipzig_vector
```

```
Simple feature collection with 97 features and 9 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 731930.5 ymin: 5692136 xmax: 733220.3 ymax: 5693968
Projected CRS: WGS 84 / UTM zone 32N
# A tibble: 97 × 10
b02 b03 b04 b06 b07 b08 b11 ndvi land_cover geom
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <POINT [m]>
1 903 772 426 2998 4240 4029 1816 0.809 forest (732480.1 5693957)
2 1270 1256 1081 1998 2493 2957 2073 0.465 urban (732217.4 5692769)
3 1033 996 777 2117 2748 2799 1595 0.565 urban (732737.2 5692469)
4 962 773 500 465 505 396 153 -0.116 water (733169.3 5692777)
5 1576 1527 1626 1715 1745 1768 1980 0.0418 urban (732202.2 5692644)
6 1125 1185 920 3058 3818 3758 2682 0.607 pasture (732153 5693059)
7 880 746 424 2502 3500 3397 1469 0.778 forest (731937.9 5693722)
8 1332 1251 1385 1663 1799 1640 1910 0.0843 urban (732416.2 5692324)
9 940 741 475 452 515 400 139 -0.0857 water (732933.7 5693344)
10 902 802 454 2764 3821 3666 1567 0.780 forest (732411.3 5693352)
# … with 87 more rows
```

We plot both layers to get an overview of the data. The training points are located in the districts of Lindenau and Zentrum West.

```
library(ggplot2)
library(tidyterra, exclude = "filter")
ggplot() +
geom_spatraster_rgb(data = leipzig_raster, r = 3, g = 2, b = 1, max_col_value = 5451) +
geom_spatvector(data = leipzig_vector, aes(color = land_cover)) +
scale_color_viridis_d(name = "Land cover", labels = c("Forest", "Pastures", "Urban", "Water")) +
theme_minimal()
```

The `as_task_classif_st()`

function directly creates a spatial task from the point layer. This makes it unnecessary to transform the point layer to a `data.frame`

with coordinates. Spatial tasks additionally store the coordinates of the training points. The coordinates are useful when estimating the performance of the model with spatial resampling.

```
task = as_task_classif_st(leipzig_vector, target = "land_cover")
task
```

```
<TaskClassifST:leipzig_vector> (97 x 9)
* Target: land_cover
* Properties: multiclass
* Features (8):
- dbl (8): b02, b03, b04, b06, b07, b08, b11, ndvi
* Coordinates:
X Y
1: 732480.1 5693957
2: 732217.4 5692769
3: 732737.2 5692469
4: 733169.3 5692777
5: 732202.2 5692644
---
93: 733018.7 5692342
94: 732551.4 5692887
95: 732520.4 5692589
96: 732542.2 5692204
97: 732437.8 5692300
```

Now we can train a model with the task. We use a simple decision tree learner from the rpart package. The `"classif_st"`

task is a specialization of the `"classif"`

task and therefore works with all `"classif"`

learners.

```
learner = lrn("classif.rpart")
learner$train(task)
```

To get a complete land cover classification of Leipzig, we have to predict on each pixel and return a raster image with these predictions. The `$predict()`

method of the learner only works for tabular data. To predict a raster image, we use the `predict_spatial()`

function.

```
# predict land cover map
land_cover = predict_spatial(leipzig_raster, learner)
```

```
ggplot() +
geom_spatraster(data = land_cover) +
scale_fill_viridis_d(name = "Land cover", labels = c("Forest", "Pastures", "Urban", "Water")) +
theme_minimal()
```

Working with spatial data in R is very easy with the mlr3spatial package. You can quickly train a model with a point layer and predict a raster image. The mlr3spatial package is still in development and we are looking forward to your feedback and contributions.

Feature selection is the process of finding an optimal subset of features in order to improve the performance, interpretability and robustness of machine learning algorithms. In this article, we introduce the wrapper feature selection method *Recursive Feature Elimination*. Wrapper methods iteratively select features that optimize a performance measure. As an example, we will search for the optimal set of features for a `gradient boosting machine`

and `support vector machine`

on the `Sonar`

data set. We assume that you are already familiar with the basic building blocks of the mlr3 ecosystem. If you are new to feature selection, we recommend reading the feature selection chapter of the mlr3book first.

Recursive Feature Elimination (RFE) is a widely used feature selection method for high-dimensional data sets. The idea is to iteratively remove the least predictive feature from a model until the desired number of features is reached. This feature is determined by the built-in feature importance method of the model. Currently, RFE works with support vector machines (SVM), decision tree algorithms and gradient boosting machines (GBM). Supported learners are tagged with the `"importance"`

property. For a full list of supported learners, see the learner page on the mlr-org website and search for `"importance"`

.

Guyon et al. (2002) developed the RFE algorithm for SVMs (SVM-RFE) to select informative genes in cancer classification. The importance of the features is given by the weight vector of a linear support vector machine. This method was later extended to other machine learning algorithms. The only requirement is that the models can internally measure the feature importance. The random forest algorithm offers multiple options for measuring feature importance. The commonly used methods are the mean decrease in accuracy (MDA) and the mean decrease in impurity (MDI). The MDA measures the decrease in accuracy for a feature if it was randomly permuted in the out-of-bag sample. The MDI is the total reduction in node impurity when the feature is used for splitting. Gradient boosting algorithms like `XGBoost`

, `LightGBM`

and `GBM`

use similar methods to measure the importance of the features.

Resampling strategies can be combined with the algorithm in different ways. The frameworks scikit-learn (Pedregosa et al. 2011) and caret (Kuhn 2008) implement a variant called recursive feature elimination with cross-validation (RFE-CV) that estimates the optimal number of features with cross-validation first. Then one more RFE is carried out on the complete dataset with the optimal number of features as the final feature set size. The RFE implementation in mlr3 can rank and aggregate importance scores across resampling iterations. We will explore both variants in more detail below.

mlr3fselect is the feature selection package of the mlr3 ecosystem. It implements the `RFE`

and `RFE-CV`

algorithm. We load all packages of the ecosystem with the `mlr3verse`

package.

`library(mlr3verse)`

We retrieve the `RFE`

optimizer with the `fs()`

function.

```
optimizer = fs("rfe",
n_features = 1,
feature_number = 1,
aggregation = "rank")
```

The algorithm has multiple control parameters. The optimizer stops when the number of features equals `n_features`

. The parameters `feature_number`

, `feature_fraction`

and `subset_size`

determine the number of features that are removed in each iteration. The `feature_number`

option removes a fixed number of features in each iteration, whereas `feature_fraction`

removes a fraction of the features. The `subset_size`

argument is a vector that specifies exactly how many features are removed in each iteration. The parameters are mutually exclusive and the default is `feature_fraction = 0.5`

. Usually, RFE fits a new model in each resampling iteration and calculates the feature importance again. We can deactivate this behavior by setting `recursive = FALSE`

. The selection of feature subsets in all iterations is then based solely on the importance scores of the first model trained with all features. When running an RFE with a resampling strategy like cross-validation, multiple models and importance scores are generated. The `aggregation`

parameter determines how the importance scores are aggregated. The option `"rank"`

ranks the importance scores in each iteration and then averages the ranks of the features. The feature with the lowest average rank is removed. The option `"mean"`

averages the importance scores of the features across the iterations. The `"mean"`

should only be used if the learner’s importance scores can be reasonably averaged.

The objective of the `Sonar`

data set is to predict whether a sonar signal bounced off a metal cylinder or a rock. The data set includes 60 numerical features (see Figure 1).

`task = tsk("sonar")`

```
library(ggplot2)
library(data.table)
data = melt(as.data.table(task), id.vars = task$target_names, measure.vars = task$feature_names)
data = data[c("V1", "V10", "V11", "V12", "V13", "V14"), , on = "variable"]
ggplot(data, aes(x = value, fill = Class)) +
geom_density(alpha = 0.5) +
facet_wrap(~ variable, ncol = 6, scales = "free") +
scale_fill_viridis_d(end = 0.8) +
theme_minimal() +
theme(axis.title.x = element_blank())
```

We start with the `GBM learner`

and set the predict type to `"prob"`

to obtain class probabilities.

```
learner = lrn("classif.gbm",
distribution = "bernoulli",
predict_type = "prob")
```

Now we define the feature selection problem by using the `fsi()`

function that constructs an `FSelectInstanceSingleCrit`

. In addition to the task and learner, we have to select a `resampling strategy`

and `performance measure`

to determine how the performance of a feature subset is evaluated. We pass the `"none"`

terminator because the `n_features`

parameter of the optimizer determines when the feature selection stops.

```
instance = fsi(
task = task,
learner = learner,
resampling = rsmp("cv", folds = 6),
measures = msr("classif.auc"),
terminator = trm("none"))
```

We are now ready to start the RFE. To do this, we simply pass the instance to the `$optimize()`

method of the optimizer.

`optimizer$optimize(instance)`

The optimizer saves the best feature set and the corresponding estimated performance in `instance$result`

.

Figure 2 shows the optimization path of the feature selection. We observe that the performance increases first as the number of features decreases. As soon as informative features are removed, the performance drops.

```
library(viridisLite)
library(mlr3misc)
data = as.data.table(instance$archive)
data[, n:= map_int(importance, length)]
ggplot(data, aes(x = n, y = classif.auc)) +
geom_line(
color = viridis(1, begin = 0.5),
linewidth = 1) +
geom_point(
fill = viridis(1, begin = 0.5),
shape = 21,
size = 3,
stroke = 0.5,
alpha = 0.8) +
xlab("Number of Features") +
scale_x_reverse() +
theme_minimal()
```

The importance scores of the feature sets are recorded in the archive.

`as.data.table(instance$archive)[, list(features, classif.auc, importance)]`

```
features classif.auc importance
1: V1,V10,V11,V12,V13,V14,... 0.8929304 58.83333,58.83333,54.50000,54.00000,53.33333,52.50000,...
2: V1,V10,V11,V12,V13,V15,... 0.9177811 57.33333,56.00000,54.00000,53.66667,50.50000,50.00000,...
3: V1,V10,V11,V12,V13,V15,... 0.9045253 54.83333,54.66667,54.66667,53.00000,51.83333,51.33333,...
4: V1,V10,V11,V12,V13,V15,... 0.8927833 56.00000,55.83333,53.00000,52.00000,50.16667,50.00000,...
5: V1,V10,V11,V12,V13,V15,... 0.9016274 55.50000,53.50000,51.33333,50.00000,49.00000,48.50000,...
---
56: V11,V12,V16,V48,V9 0.8311625 4.166667,3.333333,2.833333,2.500000,2.166667
57: V11,V12,V16,V9 0.8216772 3.833333,2.666667,2.000000,1.500000
58: V11,V12,V16 0.8065807 2.833333,1.833333,1.333333
59: V11,V12 0.8023780 1.833333,1.166667
60: V11 0.7515904 1
```

Now we will select the optimal feature set for an SVM with a linear kernel. The importance scores are the weights of the model.

```
learner = lrn("classif.svm",
type = "C-classification",
kernel = "linear",
predict_type = "prob")
```

The `SVM learner`

does not support the calculation of importance scores at first. The reason is that importance scores cannot be determined with all kernels. This can be seen by the missing `"importance"`

property.

`learner$properties`

`[1] "multiclass" "twoclass" `

Using the `"mlr3fselect.svm_rfe"`

callback however makes it possible to use a linear SVM with the `RFE`

optimizer. The callback adds the `$importance()`

method internally to the learner. We load the callback with the `clbk()`

function and pass it as the `"callback"`

argument to `fsi()`

.

```
instance = fsi(
task = task,
learner = learner,
resampling = rsmp("cv", folds = 6),
measures = msr("classif.auc"),
terminator = trm("none"),
callback = clbk("mlr3fselect.svm_rfe"))
```

We start the feature selection.

`optimizer$optimize(instance)`

Figure 3 shows the average performance of the SVMs depending on the number of features. We can see that the performance increases significantly with a reduced feature set.

```
library(viridisLite)
library(mlr3misc)
data = as.data.table(instance$archive)
data[, n:= map_int(importance, length)]
ggplot(data, aes(x = n, y = classif.auc)) +
geom_line(
color = viridis(1, begin = 0.5),
linewidth = 1) +
geom_point(
fill = viridis(1, begin = 0.5),
shape = 21,
size = 3,
stroke = 0.5,
alpha = 0.8) +
xlab("Number of Features") +
scale_x_reverse() +
theme_minimal()
```

For datasets with a lot of features, it is more efficient to remove several features per iteration. We show an example where 25% of the features are removed in each iteration.

```
optimizer = fs("rfe", n_features = 1, feature_fraction = 0.75)
instance = fsi(
task = task,
learner = learner,
resampling = rsmp("cv", folds = 6),
measures = msr("classif.auc"),
terminator = trm("none"),
callback = clbk("mlr3fselect.svm_rfe"))
optimizer$optimize(instance)
```

Figure 4 shows a similar optimization curve as Figure 3 but with fewer evaluated feature sets.

```
library(viridisLite)
library(mlr3misc)
data = as.data.table(instance$archive)
data[, n:= map_int(importance, length)]
ggplot(data, aes(x = n, y = classif.auc)) +
geom_line(
color = viridis(1, begin = 0.5),
linewidth = 1) +
geom_point(
fill = viridis(1, begin = 0.5),
shape = 21,
size = 3,
stroke = 0.5,
alpha = 0.8) +
xlab("Number of Features") +
scale_x_reverse() +
theme_minimal()
```

RFE-CV estimates the optimal number of features before selecting a feature set. For this, an RFE is run in each resampling iteration and the number of features with the best mean performance is selected (see Figure 5). Then one more RFE is carried out on the complete dataset with the optimal number of features as the final feature set size.

We retrieve the `RFE-CV`

optimizer. RFE-CV has almost the same control parameters as the RFE optimizer. The only difference is that no aggregation is needed.

```
optimizer = fs("rfecv",
n_features = 1,
feature_number = 1)
```

The chosen resampling strategy is used to estimate the optimal number of features. The 6-fold cross-validation results in 6 RFE runs. You can choose any other resampling strategy with multiple iterations. Let’s start the feature selection.

```
learner = lrn("classif.svm",
type = "C-classification",
kernel = "linear",
predict_type = "prob")
instance = fsi(
task = task,
learner = learner,
resampling = rsmp("cv", folds = 6),
measures = msr("classif.auc"),
terminator = trm("none"),
callback = clbk("mlr3fselect.svm_rfe"))
optimizer$optimize(instance)
```

Warning

The performance of the optimal feature set is calculated on the complete data set and should not be reported as the performance of the final model. Estimate the performance of the final model with nested resampling.

We visualize the selection of the optimal number of features. Each point is the mean performance of the number of features. We achieved the best performance with 19 features.

```
library(ggplot2)
library(viridisLite)
library(mlr3misc)
data = as.data.table(instance$archive)[!is.na(iteration), ]
aggr = data[, list("y" = mean(unlist(.SD))), by = "batch_nr", .SDcols = "classif.auc"]
aggr[, batch_nr := 61 - batch_nr]
data[, n:= map_int(importance, length)]
ggplot(aggr, aes(x = batch_nr, y = y)) +
geom_line(
color = viridis(1, begin = 0.5),
linewidth = 1) +
geom_point(
fill = viridis(1, begin = 0.5),
shape = 21,
size = 3,
stroke = 0.5,
alpha = 0.8) +
geom_vline(
xintercept = aggr[y == max(y)]$batch_nr,
colour = viridis(1, begin = 0.33),
linetype = 3
) +
xlab("Number of Features") +
ylab("Mean AUC") +
scale_x_reverse() +
theme_minimal()
```

The archive contains the extra column `"iteration"`

that indicates in which resampling iteration the feature set was evaluated. The feature subsets of the final RFE run have no value in the `"iteration"`

column because they were evaluated on the complete data set.

`as.data.table(instance$archive)[, list(features, classif.auc, iteration, importance)]`

```
features classif.auc iteration importance
1: V1,V10,V11,V12,V13,V14,... 0.8782895 1 2.864018,1.532774,1.408485,1.399930,1.326165,1.167745,...
2: V1,V10,V11,V12,V13,V14,... 0.7026144 2 2.056442,1.706077,1.258703,1.191762,1.190752,1.178514,...
3: V1,V10,V11,V12,V13,V14,... 0.8790850 3 1.950412,1.887710,1.820891,1.616219,1.231928,1.138675,...
4: V1,V10,V11,V12,V13,V14,... 0.8125000 4 2.6958580,1.5623759,1.4990138,1.3902109,0.9385757,0.9232132,...
5: V1,V10,V11,V12,V13,V14,... 0.8807018 5 2.487483,1.470778,1.356517,1.033764,0.635383,0.575074,...
---
398: V1,V11,V12,V16,V23,V3,... 0.9605275 NA 2.0089739,1.1047492,1.0011253,0.6602411,0.6015470,0.5431803,...
399: V1,V12,V16,V23,V3,V30,... 0.9595988 NA 1.8337471,1.1937962,0.9853467,0.7751384,0.7296726,0.6222569,...
400: V1,V12,V16,V23,V3,V30,... 0.9589486 NA 1.8824952,1.2468164,1.0106654,0.8090618,0.6983925,0.6568389,...
401: V1,V12,V16,V23,V3,V30,... 0.9559766 NA 2.3872902,0.9094028,0.8809098,0.8277941,0.7841591,0.7792772,...
402: V1,V12,V16,V23,V3,V30,... 0.9521687 NA 1.9485133,1.1482257,1.1098823,0.9591012,0.8234140,0.8118616,...
```

The learner we use to make predictions on new data is called the final model. The final model is trained with the optimal feature set on the full data set. The optimal set consists of 19 features and is stored in `instance$result_feature_set`

. We subset the task to the optimal feature set and train the learner.

```
task$select(instance$result_feature_set)
learner$train(task)
```

The trained model can now be used to predict new, external data.

The RFE algorithm is a valuable feature selection method, especially for high-dimensional datasets with only a few observations. The numerous settings of the algorithm in mlr3 make it possible to apply it to many datasets and learners. If you want to know more about feature selection in general, we recommend having a look at our book.

Guyon, Isabelle, Jason Weston, Stephen Barnhill, and Vladimir Vapnik. 2002. “Gene Selection for Cancer Classification Using Support Vector Machines.” *Machine Learning* 46 (1): 389–422. https://doi.org/10.1023/A:1012487302797.

Kuhn, Max. 2008. “Building Predictive Models in r Using the Caret Package.” *Journal of Statistical Software* 28 (November): 1–26. https://doi.org/10.18637/jss.v028.i05.

Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, et al. 2011. “Scikit-Learn: Machine Learning in Python.” *Journal of Machine Learning Research* 12 (85): 2825–30. http://jmlr.org/papers/v12/pedregosa11a.html.

Feature selection is the process of finding an optimal set of features to improve the performance, interpretability and robustness of machine learning algorithms. In this article, we introduce the *Shadow Variable Search* algorithm which is a wrapper method for feature selection. Wrapper methods iteratively add features to the model that optimize a performance measure. As an example, we will search for the optimal set of features for a `support vector machine`

on the `Pima Indian Diabetes`

data set. We assume that you are already familiar with the basic building blocks of the mlr3 ecosystem. If you are new to feature selection, we recommend reading the feature selection chapter of the mlr3book first. Some knowledge about mlr3pipelines is beneficial but not necessary to understand the example.

Adding shadow variables to a data set is a well-known method in machine learning (Wu, Boos, and Stefanski 2007; Thomas et al. 2017). The idea is to add permutated copies of the original features to the data set. These permutated copies are called shadow variables or pseudovariables and the permutation breaks any relationship with the target variable, making them useless for prediction. The subsequent search is similar to the sequential forward selection algorithm, where one new feature is added in each iteration of the algorithm. This new feature is selected as the one that improves the performance of the model the most. This selection is computationally expensive, as one model for each of the not yet included features has to be trained. The difference between shadow variable search and sequential forward selection is that the former uses the selection of a shadow variable as the termination criterion. Selecting a shadow variable means that the best improvement is achieved by adding a feature that is unrelated to the target variable. Consequently, the variables not yet selected are most likely also correlated to the target variable only by chance. Therefore, only the previously selected features have a true influence on the target variable.

mlr3fselect is the feature selection package of the mlr3 ecosystem. It implements the `shadow variable search`

algorithm. We load all packages of the ecosystem with the `mlr3verse`

package.

`library(mlr3verse)`

We retrieve the `shadow variable search`

optimizer with the `fs()`

function. The algorithm has no control parameters.

`optimizer = fs("shadow_variable_search")`

The objective of the `Pima Indian Diabetes`

data set is to predict whether a person has diabetes or not. The data set includes 768 patients with 8 measurements (see Figure 1).

`task = tsk("pima")`

```
library(ggplot2)
library(data.table)
data = melt(as.data.table(task), id.vars = task$target_names, measure.vars = task$feature_names)
ggplot(data, aes(x = value, fill = diabetes)) +
geom_density(alpha = 0.5) +
facet_wrap(~ variable, ncol = 8, scales = "free") +
scale_fill_viridis_d(end = 0.8) +
theme_minimal() +
theme(axis.title.x = element_blank())
```

The data set contains missing values.

`task$missings()`

```
diabetes age glucose insulin mass pedigree pregnant pressure triceps
0 0 5 374 11 0 0 35 227
```

Support vector machines cannot handle missing values. We impute the missing values with the `histogram imputation`

method.

`learner = po("imputehist") %>>% lrn("classif.svm", predict_type = "prob")`

Now we define the feature selection problem by using the `fsi()`

function that constructs an `FSelectInstanceSingleCrit`

. In addition to the task and learner, we have to select a `resampling strategy`

and `performance measure`

to determine how the performance of a feature subset is evaluated. We pass the `"none"`

terminator because the shadow variable search algorithm terminates by itself.

```
instance = fsi(
task = task,
learner = learner,
resampling = rsmp("cv", folds = 3),
measures = msr("classif.auc"),
terminator = trm("none")
)
```

We are now ready to start the shadow variable search. To do this, we simply pass the instance to the `$optimize()`

method of the optimizer.

`optimizer$optimize(instance)`

```
age glucose insulin mass pedigree pregnant pressure triceps features classif.auc
1: TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE age,glucose,mass,pedigree 0.835165
```

The optimizer returns the best feature set and the corresponding estimated performance.

Figure 2 shows the optimization path of the feature selection. The feature glucose was selected first and in the following iterations age, mass and pedigree. Then a shadow variable was selected and the feature selection was terminated.

```
library(data.table)
library(ggplot2)
library(mlr3misc)
library(viridisLite)
data = as.data.table(instance$archive)[order(-classif.auc), head(.SD, 1), by = batch_nr][order(batch_nr)]
data[, features := map_chr(features, str_collapse)]
data[, batch_nr := as.character(batch_nr)]
ggplot(data, aes(x = batch_nr, y = classif.auc)) +
geom_bar(
stat = "identity",
width = 0.5,
fill = viridis(1, begin = 0.5),
alpha = 0.8) +
geom_text(
data = data,
mapping = aes(x = batch_nr, y = 0, label = features),
hjust = 0,
nudge_y = 0.05,
color = "white",
size = 5
) +
coord_flip() +
xlab("Iteration") +
theme_minimal()
```

The archive contains all evaluated feature sets. We can see that each feature has a corresponding shadow variable. We only show the variables age, glucose and insulin and their shadow variables here.

`as.data.table(instance$archive)[, .(age, glucose, insulin, permuted__age, permuted__glucose, permuted__insulin, classif.auc)]`

```
age glucose insulin permuted__age permuted__glucose permuted__insulin classif.auc
1: TRUE FALSE FALSE FALSE FALSE FALSE 0.6437052
2: FALSE TRUE FALSE FALSE FALSE FALSE 0.7598155
3: FALSE FALSE TRUE FALSE FALSE FALSE 0.4900280
4: FALSE FALSE FALSE FALSE FALSE FALSE 0.6424026
5: FALSE FALSE FALSE FALSE FALSE FALSE 0.5690107
---
54: TRUE TRUE FALSE FALSE FALSE FALSE 0.8266713
55: TRUE TRUE FALSE FALSE FALSE FALSE 0.8063568
56: TRUE TRUE FALSE FALSE FALSE FALSE 0.8244232
57: TRUE TRUE FALSE FALSE FALSE FALSE 0.8234605
58: TRUE TRUE FALSE FALSE FALSE FALSE 0.8164784
```

The learner we use to make predictions on new data is called the final model. The final model is trained with the optimal feature set on the full data set. We subset the task to the optimal feature set and train the learner.

```
task$select(instance$result_feature_set)
learner$train(task)
```

The trained model can now be used to predict new, external data.

The shadow variable search is a fast feature selection method that is easy to use. More information on the theoretical background can be found in Wu, Boos, and Stefanski (2007) and Thomas et al. (2017). If you want to know more about feature selection in general, we recommend having a look at our book.

Thomas, Janek, Tobias Hepp, Andreas Mayr, and Bernd Bischl. 2017. “Probing for Sparse and Fast Variable Selection with Model-Based Boosting.” *Computational and Mathematical Methods in Medicine* 2017 (July): e1421409. https://doi.org/10.1155/2017/1421409.

Wu, Yujun, Dennis D Boos, and Leonard A Stefanski. 2007. “Controlling Variable Selection by the Addition of Pseudovariables.” *Journal of the American Statistical Association* 102 (477): 235–43. https://doi.org/10.1198/016214506000000843.

The predictive performance of modern machine learning algorithms is highly dependent on the choice of their hyperparameter configuration. Options for setting hyperparameters are tuning, manual selection by the user, and using the default configuration of the algorithm. The default configurations are chosen to work with a wide range of data sets but they usually do not achieve the best predictive performance. When tuning a learner in mlr3, we can run the default configuration as a baseline. Seeing how well it performs will tell us whether tuning pays off. If the optimized configurations perform worse, we could expand the search space or try a different optimization algorithm. Of course, it could also be that tuning on the given data set is simply not worth it.

Probst, Boulesteix, and Bischl (2019) studied the tunability of machine learning algorithms. They found that the tunability of algorithms varies widely. Algorithms like glmnet and XGBoost are highly tunable, while algorithms like random forests work well with their default configuration. The highly tunable algorithms should thus beat their baselines more easily with optimized hyperparameters. In this article, we will tune the hyperparameters of a random forest and compare the performance of the default configuration with the optimized configurations.

We tune the hyperparameters of the `ranger learner`

on the `spam`

data set. The search space is taken from Bischl et al. (2021).

```
library(mlr3verse)
learner = lrn("classif.ranger",
mtry.ratio = to_tune(0, 1),
replace = to_tune(),
sample.fraction = to_tune(1e-1, 1),
num.trees = to_tune(1, 2000)
)
```

When creating the tuning instance, we set `evaluate_default = TRUE`

to test the default hyperparameter configuration. The default configuration is evaluated in the first batch of the tuning run. The other batches use the specified tuning method. In this example, they are randomly drawn configurations.

```
instance = tune(
method = tnr("random_search", batch_size = 5),
task = tsk("spam"),
learner = learner,
resampling = rsmp ("holdout"),
measures = msr("classif.ce"),
term_evals = 51,
evaluate_default = TRUE
)
```

The default configuration is recorded in the first row of the archive. The other rows contain the results of the random search.

`as.data.table(instance$archive)[, .(batch_nr, mtry.ratio, replace, sample.fraction, num.trees, classif.ce)]`

```
batch_nr mtry.ratio replace sample.fraction num.trees classif.ce
1: 1 0.12280702 TRUE 1.0000000 500 0.04889179
2: 2 0.81757154 FALSE 0.8117389 1528 0.06518905
3: 2 0.90097848 FALSE 0.9188504 571 0.06975228
4: 2 0.65584252 TRUE 0.3145144 681 0.06323338
5: 2 0.40363652 FALSE 0.7508936 1807 0.05801825
---
47: 11 0.71528316 TRUE 0.4398745 1394 0.06127771
48: 11 0.19136788 FALSE 0.8293552 249 0.04889179
49: 11 0.09430346 FALSE 0.6233559 1307 0.04889179
50: 11 0.52643368 FALSE 0.5993606 1403 0.05997392
51: 11 0.17115160 TRUE 0.3309041 114 0.05867014
```

We plot the performances of the evaluated hyperparameter configurations. The blue line connects the best configuration of each batch. We see that the default configuration already performs well and the optimized configurations can not beat it.

```
library(mlr3viz)
autoplot(instance, type = "performance")
```

The time required to test the default configuration is negligible compared to the time required to run the hyperparameter optimization. It gives us a valuable indication of whether our tuning is properly configured. Running the default configuration as a baseline is a good practice that should be used in every tuning run.

Bischl, Bernd, Martin Binder, Michel Lang, Tobias Pielok, Jakob Richter, Stefan Coors, Janek Thomas, et al. 2021. “Hyperparameter Optimization: Foundations, Algorithms, Best Practices and Open Challenges.” *arXiv:2107.05847 [Cs, Stat]*, July. http://arxiv.org/abs/2107.05847.

Probst, Philipp, Anne-Laure Boulesteix, and Bernd Bischl. 2019. “Tunability: Importance of Hyperparameters of Machine Learning Algorithms.” *Journal of Machine Learning Research* 20 (53): 1–32. http://jmlr.org/papers/v20/18-444.html.

Hotstarting a learner resumes the training from an already fitted model. An example would be to train an already fit XGBoost model for an additional 500 boosting iterations. In mlr3, we call this process **Hotstarting**, where a learner has access to a cache of already trained models which is called a `mlr3::HoststartStack`

We distinguish between forward and backward hotstarting. We start this post with backward hotstarting and then talk about the less efficient forward hotstarting.

In this example, we optimize the hyperparameters of a random forest and use hotstarting to reduce the runtime. Hotstarting a random forest backwards is very simple. The model remains unchanged and only a subset of the trees is used for prediction i.e. a new model is not fitted. For example, a random forest is trained with 1000 trees and a specific hyperparameter configuration. If another random forest with 500 trees but with the same hyperparameter configuration has to be trained, the model with 1000 trees is copied and only 500 trees are used for prediction.

We load the `ranger learner`

and set the search space from the Bischl et al. (2021) article.

```
library(mlr3verse)
learner = lrn("classif.ranger",
mtry.ratio = to_tune(0, 1),
replace = to_tune(),
sample.fraction = to_tune(1e-1, 1),
num.trees = to_tune(1, 2000)
)
```

We activate hotstarting with the `allow_hotstart`

option. When running a grid search with hotstarting, the grid is sorted by the hot start parameter. This means the models with 2000 trees are trained first. The models with less than 2000 trees hot start on the 2000 trees models which allows the training to be completed immediately.

```
instance = tune(
method = tnr("grid_search", resolution = 5, batch_size = 5),
task = tsk("spam"),
learner = learner,
resampling = rsmp("holdout"),
measure = msr("classif.ce"),
allow_hotstart = TRUE
)
```

For comparison, we perform the same tuning without hotstarting.

```
instance_2 = tune(
method = tnr("grid_search", resolution = 5, batch_size = 5),
task = tsk("spam"),
learner = learner,
resampling = rsmp("holdout"),
measure = msr("classif.ce"),
allow_hotstart = FALSE
)
```

We plot the time of completion of each batch (see Figure 1). Each batch includes 5 configurations. We can see that tuning with hotstarting is slower at first. As soon as all models are fitted with 2000 trees, the tuning runs much faster and overtakes the tuning without hotstarting.

Forward hotstarting is currently only supported by XGBoost. However, we have observed that hotstarting only provides a speed advantage for very large datasets and models with more than 5000 boosting rounds. The reason is that copying the models from the main process to the workers is a major bottleneck. The parallelization package future copies the models sequentially to the workers. Consequently, it takes a long time until the last worker can even start. Moreover, copying itself consumes a lot of time, and copying the model back from the worker blocks the main process again. During the development process, we overestimated the speed benefits of hotstarting and underestimated the overhead of parallelization. We can therefore only advise against using forward hotstarting during tuning. It is much more efficient to use the internal early-stopping mechanism of XGBoost. This eliminates the need to copy models to the worker. See the gallery post on early stopping for an example. We might improve the efficiency of the hotstarting mechanism in the future, if there are convincing use cases.

Nevertheless, forward hotstarting can be useful without parallelization. If you have an already trained model and want to add more boosting iteration to it. In this example, the `learner_5000`

is the already trained model. We create a new learner with the same hyperparameters but double the number of boosting iteration. To activate hotstarting, we create a `HotstartStack`

and copy it to the `$hotstart_stack`

slot of the new learner.

```
task = tsk("spam")
learner_5000 = lrn("classif.xgboost", nrounds = 5000, eta = 0.1)
learner_5000$train(task)
learner_10000 = lrn("classif.xgboost", nrounds = 10000, eta = 0.1)
learner_10000$hotstart_stack = HotstartStack$new(learner_5000)
learner_10000$train(task)
```

Training the initial model took 59.885 seconds.

`learner_5000$state$train_time`

`[1] 59.885`

Adding 5000 boosting rounds took 46.837 seconds.

`learner_10000$state$train_time - learner_5000$state$train_time`

`[1] 46.837`

Training the model from the beginning would have taken about two minutes. This means, without parallelization, we get the expected speed advantage.

We have seen how mlr3 enables to reduce the training time, by building on a hotstart stack of already trained learners. One has to be careful, however, when using forward hotstarting during tuning because of the high parallelization overhead that arises from copying the models between the processes. If a model has an internal early stopping implementation, it should usually be relied upon instead of using the mlr3 hotstarting mechanism. However, manual forward hotstarting can be helpful in some situations when we do not want to train a large model from the beginning.

Bischl, Bernd, Martin Binder, Michel Lang, Tobias Pielok, Jakob Richter, Stefan Coors, Janek Thomas, et al. 2021. “Hyperparameter Optimization: Foundations, Algorithms, Best Practices and Open Challenges.” *arXiv:2107.05847 [Cs, Stat]*, July. http://arxiv.org/abs/2107.05847.

We continue working with the *Hyperband* optimization algorithm (Li et al. 2018). The previous post used the number of boosting iterations of an XGBoost model as the resource. However, Hyperband is not limited to machine learning algorithms that are trained iteratively. The resource can also be the number of features, the training time of a model, or the size of the training data set. In this post, we will tune a support vector machine and use the size of the training data set as the fidelity parameter. The time to train a support vector machine and the performance increases with the size of the data set. This makes the data set size a suitable fidelity parameter for Hyperband. This is the second part of the Hyperband series. The first part can be found here Hyperband Series - Iterative Training. If you don’t know much about Hyperband, check out the first post which explains the algorithm in detail. We assume that you are already familiar with tuning in the mlr3 ecosystem. If not, you should start with the book chapter on optimization or the Hyperparameter Optimization on the Palmer Penguins Data Set post. A little knowledge about mlr3pipelines is beneficial but not necessary to understand the example.

In this post, we will optimize the hyperparameters of the support vector machine on the `Sonar`

data set. We begin by constructing a classification machine by setting `type`

to `"C-classification"`

.

```
library("mlr3verse")
learner = lrn("classif.svm", id = "svm", type = "C-classification")
```

The mlr3pipelines package features a `PipeOp`

for subsampling.

`po("subsample")`

```
PipeOp: <subsample> (not trained)
values: <frac=0.6321, stratify=FALSE, replace=FALSE>
Input channels <name [train type, predict type]>:
input [Task,Task]
Output channels <name [train type, predict type]>:
output [Task,Task]
```

The `PipeOp`

controls the size of the training data set with the `frac`

parameter. We connect the `PipeOp`

with the learner and get a `GraphLearner`

.

```
graph_learner = as_learner(
po("subsample") %>>%
learner
)
```

The graph learner subsamples and then fits a support vector machine on the data subset. The parameter set of the graph learner is a combination of the parameter sets of the `PipeOp`

and learner.

`as.data.table(graph_learner$param_set)[, .(id, lower, upper, levels)]`

```
id lower upper levels
1: subsample.frac 0 Inf
2: subsample.stratify NA NA TRUE,FALSE
3: subsample.replace NA NA TRUE,FALSE
4: svm.cachesize -Inf Inf
5: svm.class.weights NA NA
---
15: svm.nu -Inf Inf
16: svm.scale NA NA
17: svm.shrinking NA NA TRUE,FALSE
18: svm.tolerance 0 Inf
19: svm.type NA NA C-classification,nu-classification
```

Next, we create the search space. We use `TuneToken`

to mark which hyperparameters should be tuned. We have to prefix the hyperparameters with the id of the `PipeOps`

. The `subsample.frac`

is the fidelity parameter that must be tagged with `"budget"`

in the search space. The data set size is increased from 3.7% to 100%. For the other hyperparameters, we took the search space for support vector machines from the Kuehn et al. (2018) article. This search space works for a wide range of data sets.

```
graph_learner$param_set$set_values(
subsample.frac = to_tune(p_dbl(3^-3, 1, tags = "budget")),
svm.kernel = to_tune(c("linear", "polynomial", "radial")),
svm.cost = to_tune(1e-4, 1e3, logscale = TRUE),
svm.gamma = to_tune(1e-4, 1e3, logscale = TRUE),
svm.tolerance = to_tune(1e-4, 2, logscale = TRUE),
svm.degree = to_tune(2, 5)
)
```

Support vector machines often crash or never finish the training with certain hyperparameter configurations. We set a timeout of 30 seconds and a fallback learner to handle these cases.

```
graph_learner$encapsulate = c(train = "evaluate", predict = "evaluate")
graph_learner$timeout = c(train = 30, predict = 30)
graph_learner$fallback = lrn("classif.featureless")
```

Let’s create the tuning instance. We use the `"none"`

terminator because Hyperband controls the termination itself.

```
instance = ti(
task = tsk("sonar"),
learner = graph_learner,
resampling = rsmp("cv", folds = 3),
measures = msr("classif.ce"),
terminator = trm("none")
)
instance
```

```
<TuningInstanceSingleCrit>
* State: Not optimized
* Objective: <ObjectiveTuning:subsample.svm_on_sonar>
* Search Space:
id class lower upper nlevels
1: subsample.frac ParamDbl 0.03703704 1.0000000 Inf
2: svm.cost ParamDbl -9.21034037 6.9077553 Inf
3: svm.degree ParamInt 2.00000000 5.0000000 4
4: svm.gamma ParamDbl -9.21034037 6.9077553 Inf
5: svm.kernel ParamFct NA NA 3
6: svm.tolerance ParamDbl -9.21034037 0.6931472 Inf
* Terminator: <TerminatorNone>
```

We load the `Hyperband tuner`

and set `eta = 3`

.

```
library("mlr3hyperband")
tuner = tnr("hyperband", eta = 3)
```

Using `eta = 3`

and a lower bound of 3.7% for the data set size, results in the following schedule. Configurations with the same data set size are evaluated in parallel.

Now we are ready to start the tuning.

`tuner$optimize(instance)`

The best model is a support vector machine with a polynomial kernel.

`instance$result[, .(subsample.frac, svm.cost, svm.degree, svm.gamma, svm.kernel, svm.tolerance, classif.ce)]`

```
subsample.frac svm.cost svm.degree svm.gamma svm.kernel svm.tolerance classif.ce
1: 1 3.04282 3 6.850538 polynomial -1.914435 0.1926156
```

The archive contains all evaluated configurations. We look at the 8 configurations that were evaluated on the complete data set. The configuration with the best classification error on the full data set was sampled in bracket 2. The classification error was estimated to be 26% on 33% of the data set and increased to 19% on the full data set (see green line in Figure 1).

Using the data set size as the budget parameter in Hyperband allows the tuning of machine learning models that are not trained iteratively. We have tried to keep the runtime of the example low. For your optimization, you should use cross-validation and run multiple iterations of Hyperband.

Kuehn, Daniel, Philipp Probst, Janek Thomas, and Bernd Bischl. 2018. “Automatic Exploration of Machine Learning Experiments on OpenML.” https://arxiv.org/abs/1806.10961.

Li, Lisha, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. 2018. “Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization.” *Journal of Machine Learning Research* 18 (185): 1–52. https://jmlr.org/papers/v18/16-558.html.

Increasingly large data sets and search spaces make hyperparameter optimization a time-consuming task. *Hyperband* (Li et al. 2018) solves this by approximating the performance of a configuration on a simplified version of the problem such as a small subset of the training data, with just a few training epochs in a neural network, or with only a small number of iterations in a gradient-boosting model. After starting randomly sampled configurations, Hyperband iteratively allocates more resources to promising configurations and terminates low-performing ones. This type of optimization is called *multi-fidelity* optimization. The fidelity parameter is part of the search space and controls the tradeoff between the runtime and accuracy of the performance approximation. In this post, we will optimize XGBoost and use the number of boosting iterations as the fidelity parameter. This means Hyperband will allocate more boosting iterations to well-performing configurations. The number of boosting iterations increases the time to train a model and improves the performance until the model is overfitting to the training data. It is therefore a suitable fidelity parameter. We assume that you are already familiar with tuning in the mlr3 ecosystem. If not, you should start with the book chapter on optimization or the Hyperparameter Optimization on the Palmer Penguins Data Set post. This is the first part of the Hyperband series. The second part can be found here Hyperband Series - Data Set Subsampling.

Hyperband is an advancement of the Successive Halving algorithm by Jamieson and Talwalkar (2016). Successive Halving is initialized with the number of starting configurations , the proportion of configurations discarded in each stage , and the minimum and maximum budget of a single evaluation. The algorithm starts by sampling random configurations and allocating the minimum budget to them. The configurations are evaluated and of the worst-performing configurations are discarded. The remaining configurations are promoted to the next stage and evaluated on a larger budget. This continues until one or more configurations are evaluated on the maximum budget and the best performing configuration is selected. The number of stages is calculated so that each stage consumes approximately the same budget. This sometimes results in the minimum budget having to be slightly adjusted by the algorithm. Successive Halving has the disadvantage that is not clear whether we should choose a large and try many configurations on a small budget or choose a small and train more configurations on the full budget.

Hyperband solves this problem by running Successive Halving with different numbers of stating configurations. The algorithm is initialized with the same parameters as Successive Halving but without . Each run of Successive Halving is called a bracket and starts with a different budget . A smaller starting budget means that more configurations can be tried out. The most explorative bracket allocated the minimum budget . The next bracket increases the starting budget by a factor of . In each bracket, the starting budget increases further until the last bracket essentially performs a random search with the full budget . The number of brackets is calculated with . Under the condition that increases by with each bracket, sometimes has to be adjusted slightly in order not to use more than resources in the last bracket. The number of configurations in the base stages is calculated so that each bracket uses approximately the same amount of budget. The following table shows a full run of the Hyperband algorithm. The bracket is the most explorative bracket and performance a random search on the full budget.

The Hyperband implementation in mlr3hyperband evaluates configurations with the same budget in parallel. This results in all brackets finishing at approximately the same time. The colors in Figure 1 indicate batches that are evaluated in parallel.

In this practical example, we will optimize the hyperparameters of XGBoost on the `Spam`

data set. We begin by loading the `XGBoost learner.`

.

```
library("mlr3verse")
learner = lrn("classif.xgboost")
```

The next thing we do is define the search space. The `nrounds`

parameter controls the number of boosting iterations. We set a range from 16 to 128 boosting iterations. This is used as and by the Hyperband algorithm. We need to tag the parameter with `"budget"`

to identify it as a fidelity parameter. For the other hyperparameters, we take the search space for XGBoost from the Bischl et al. (2021) article. This search space works for a wide range of data sets.

```
learner$param_set$set_values(
nrounds = to_tune(p_int(16, 128, tags = "budget")),
eta = to_tune(1e-4, 1, logscale = TRUE),
max_depth = to_tune(1, 20),
colsample_bytree = to_tune(1e-1, 1),
colsample_bylevel = to_tune(1e-1, 1),
lambda = to_tune(1e-3, 1e3, logscale = TRUE),
alpha = to_tune(1e-3, 1e3, logscale = TRUE),
subsample = to_tune(1e-1, 1)
)
```

We construct the tuning instance. We use the `"none"`

terminator because Hyperband terminates itself when all brackets are evaluated.

```
instance = ti(
task = tsk("spam"),
learner = learner,
resampling = rsmp("holdout"),
measures = msr("classif.ce"),
terminator = trm("none")
)
instance
```

```
<TuningInstanceSingleCrit>
* State: Not optimized
* Objective: <ObjectiveTuning:classif.xgboost_on_spam>
* Search Space:
id class lower upper nlevels
1: nrounds ParamInt 16.000000 128.000000 113
2: eta ParamDbl -9.210340 0.000000 Inf
3: max_depth ParamInt 1.000000 20.000000 20
4: colsample_bytree ParamDbl 0.100000 1.000000 Inf
5: colsample_bylevel ParamDbl 0.100000 1.000000 Inf
6: lambda ParamDbl -6.907755 6.907755 Inf
7: alpha ParamDbl -6.907755 6.907755 Inf
8: subsample ParamDbl 0.100000 1.000000 Inf
* Terminator: <TerminatorNone>
```

We load the `Hyperband tuner`

and set `eta = 2`

. Hyperband can start from the beginning when the last bracket is evaluated. We control the number of Hyperband runs with the `repetition`

argument. The setting `repetition = Inf`

is useful when a terminator should stop the optimization.

```
library("mlr3hyperband")
tuner = tnr("hyperband", eta = 2, repetitions = 1)
```

The Hyperband implementation in mlr3hyperband evaluates configurations with the same budget in parallel. This results in all brackets finishing at approximately the same time. You can think of it as going diagonally through Figure 1. Using `eta = 2`

and a range from 16 to 128 boosting iterations results in the following schedule.

Now we are ready to start the tuning.

`tuner$optimize(instance)`

The result of a run is the configuration with the best performance. This does not necessarily have to be a configuration evaluated with the highest budget since we can overfit the data with too many boosting iterations.

`instance$result[, .(nrounds, eta, max_depth, colsample_bytree, colsample_bylevel, lambda, alpha, subsample)]`

```
nrounds eta max_depth colsample_bytree colsample_bylevel lambda alpha subsample
1: 128 -0.4334209 20 0.1574264 0.2886485 -1.333902 -3.394965 0.764349
```

The archive of a Hyperband run has the additional columns `"bracket"`

and `"stage"`

.

`as.data.table(instance$archive)[, .(bracket, stage, classif.ce, eta, max_depth, colsample_bytree)]`

```
bracket stage classif.ce eta max_depth colsample_bytree
1: 3 0 0.06518905 -7.0150617 9 0.2885488
2: 3 0 0.23859192 -2.5492834 17 0.2052036
3: 3 0 0.33898305 -9.1773946 6 0.3447989
4: 3 0 0.07692308 -2.1745616 12 0.1800334
5: 3 0 0.28617992 -6.5822516 1 0.5811652
---
31: 0 0 0.51760104 -5.2073558 4 0.3352148
32: 3 3 0.05019557 -0.8928307 6 0.4666395
33: 2 2 0.05541069 -8.0585126 15 0.6845195
34: 1 1 0.08018253 -8.3130313 7 0.7767661
35: 1 1 0.08279009 -5.5293970 7 0.5714851
```

The handling of Hyperband in mlr3tuning is very similar to that of other tuners. We only have to select an additional fidelity parameter and tag it with `"budget"`

. We have tried to keep the runtime of the example low. For your optimization, you should use cross-validation and increase the maximum number of boosting rounds. The Bischl et al. (2021) search space suggests 5000 boosting rounds. Check out our next post on Hyperband which uses the size of the training data set as the fidelity parameter.

Jamieson, Kevin, and Ameet Talwalkar. 2016. “Non-Stochastic Best Arm Identification and Hyperparameter Optimization.” In *Proceedings of the 19th International Conference on Artificial Intelligence and Statistics*, edited by Arthur Gretton and Christian C. Robert, 51:240–48. Proceedings of Machine Learning Research. Cadiz, Spain: PMLR. http://proceedings.mlr.press/v51/jamieson16.html.

Li, Lisha, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. 2018. “Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization.” *Journal of Machine Learning Research* 18 (185): 1–52. https://jmlr.org/papers/v18/16-558.html.

We showcase the visualization functions of the mlr3 ecosystem. The mlr3viz package creates a plot for almost all mlr3 objects. This post displays all available plots with their reproducible code. We start with plots of the base mlr3 objects. This includes boxplots of tasks, dendrograms of cluster learners and ROC curves of predictions. After that, we tune a classification tree and visualize the results. Finally, we show visualizations for filters.

The mlr3viz package defines `autoplot()`

functions to draw plots with ggplot2. Often there is more than one type of plot for an object. You can change the plot with the `type`

argument. The help pages list all possible choices. The easiest way to access the help pages is via the pkgdown website. The plots use the viridis color pallet and the appearance is controlled with the `theme`

argument. By default, the `minimal theme`

is applied.

We begin with plots of the classification task `Palmer Penguins`

. We plot the class frequency of the target variable.

```
library(mlr3verse)
library(mlr3viz)
task = tsk("penguins")
task$select(c("body_mass", "bill_length"))
autoplot(task, type = "target")
```

The `"duo"`

plot shows the distribution of multiple features.

`autoplot(task, type = "duo")`

The `"pairs"`

plot shows the pairwise comparison of multiple features. The classes of the target variable are shown in different colors.

`autoplot(task, type = "pairs")`

Next, we plot the regression task `mtcars`

. We create a boxplot of the target variable.

```
task = tsk("mtcars")
task$select(c("am", "carb"))
autoplot(task, type = "target")
```

The `"pairs"`

plot shows the pairwise comparison of mutiple features and the target variable.

`autoplot(task, type = "pairs")`

Finally, we plot the cluster task `US Arrests`

. The `"pairs"`

plot shows the pairwise comparison of mutiple features.

```
library(mlr3cluster)
task = mlr_tasks$get("usarrests")
autoplot(task, type = "pairs")
```

The `classification`

and `regression`

GLMNet learner is equipped with a plot function.

```
library(mlr3data)
task = tsk("ilpd")
task$select(setdiff(task$feature_names, "gender"))
learner = lrn("classif.glmnet")
learner$train(task)
autoplot(learner)
```

```
task = tsk("mtcars")
learner = lrn("regr.glmnet")
learner$train(task)
autoplot(learner)
```

We plot a `classification tree`

of the rpart package. We have to fit the learner with `keep_model = TRUE`

to keep the model object.

```
task = tsk("penguins")
learner = lrn("classif.rpart", keep_model = TRUE)
learner$train(task)
autoplot(learner)
```

We can also plot regression trees.

```
task = tsk("mtcars")
learner = lrn("regr.rpart", keep_model = TRUE)
learner$train(task)
autoplot(learner)
```

The `"dend"`

plot shows the result of the hierarchical clustering of the data.

```
library(mlr3cluster)
task = tsk("usarrests")
learner = lrn("clust.hclust")
learner$train(task)
autoplot(learner, type = "dend", task = task)
```

The `"scree"`

type plots the number of clusters and the height.

`autoplot(learner, type = "scree")`

We plot the predictions of a classification learner. The `"stacked"`

plot shows the predicted and true class labels.

```
task = tsk("spam")
learner = lrn("classif.rpart", predict_type = "prob")
pred = learner$train(task)$predict(task)
autoplot(pred, type = "stacked")
```

The ROC curve plots the true positive rate against the false positive rate at different thresholds.

`autoplot(pred, type = "roc")`

The precision-recall curve plots the precision against the recall at different thresholds.

`autoplot(pred, type = "prc")`

The `"threshold"`

plot varies the threshold of a binary classification and plots against the resulting performance.

`autoplot(pred, type = "threshold")`

The predictions of a regression learner are often presented as a scatterplot of truth and predicted response.

```
task = tsk("boston_housing")
learner = lrn("regr.rpart")
pred = learner$train(task)$predict(task)
autoplot(pred, type = "xy")
```

Additionally, we plot the response with the residuals.

`autoplot(pred, type = "residual")`

We can also plot the distribution of the residuals.

`autoplot(pred, type = "histogram")`

The predictions of a cluster learner are often presented as a scatterplot of the data points colored by the cluster.

```
library(mlr3cluster)
task = tsk("usarrests")
learner = lrn("clust.kmeans", centers = 3)
pred = learner$train(task)$predict(task)
autoplot(pred, task, type = "scatter")
```

The `"sil"`

plot shows the silhouette width of the clusters. The dashed line is the mean silhouette width.

`autoplot(pred, task, type = "sil")`

The `"pca"`

plot shows the first two principal components of the data colored by the cluster.

`autoplot(pred, task, type = "pca")`

The `"boxplot"`

shows the distribution of the performance measures.

```
task = tsk("sonar")
learner = lrn("classif.rpart", predict_type = "prob")
resampling = rsmp("cv")
rr = resample(task, learner, resampling)
autoplot(rr, type = "boxplot")
```

We can also plot the distribution of the performance measures as a “`histogram`

”.

`autoplot(rr, type = "histogram")`

The ROC curve plots the true positive rate against the false positive rate at different thresholds.

`autoplot(rr, type = "roc")`

The precision-recall curve plots the precision against the recall at different thresholds.

`autoplot(rr, type = "prc")`

The `"prediction"`

plot shows two features and the predicted class in the background. Points mark the observations of the test set and the color presents the truth.

```
task = tsk("pima")
task$filter(seq(100))
task$select(c("age", "glucose"))
learner = lrn("classif.rpart")
resampling = rsmp("cv", folds = 3)
rr = resample(task, learner, resampling, store_models = TRUE)
autoplot(rr, type = "prediction")
```

Alternatively, we can plot class probabilities.

```
task = tsk("pima")
task$filter(seq(100))
task$select(c("age", "glucose"))
learner = lrn("classif.rpart", predict_type = "prob")
resampling = rsmp("cv", folds = 3)
rr = resample(task, learner, resampling, store_models = TRUE)
autoplot(rr, type = "prediction")
```

In addition to the test set, we can also plot the train set.

```
task = tsk("pima")
task$filter(seq(100))
task$select(c("age", "glucose"))
learner = lrn("classif.rpart", predict_type = "prob", predict_sets = c("train", "test"))
resampling = rsmp("cv", folds = 3)
rr = resample(task, learner, resampling, store_models = TRUE)
autoplot(rr, type = "prediction", predict_sets = c("train", "test"))
```

The `"prediction"`

plot can also show categorical features.

```
task = tsk("german_credit")
task$filter(seq(100))
task$select(c("housing", "employment_duration"))
learner = lrn("classif.rpart")
resampling = rsmp("cv", folds = 3)
rr = resample(task, learner, resampling, store_models = TRUE)
autoplot(rr, type = "prediction")
```

The “`prediction`

” plot shows one feature and the response. Points mark the observations of the test set.

```
task = tsk("boston_housing")
task$select("age")
task$filter(seq(100))
learner = lrn("regr.rpart")
resampling = rsmp("cv", folds = 3)
rr = resample(task, learner, resampling, store_models = TRUE)
autoplot(rr, type = "prediction")
```

Additionally, we can add confidence bounds.

```
task = tsk("boston_housing")
task$select("age")
task$filter(seq(100))
learner = lrn("regr.lm", predict_type = "se")
resampling = rsmp("cv", folds = 3)
rr = resample(task, learner, resampling, store_models = TRUE)
autoplot(rr, type = "prediction")
```

And add the train set.

```
task = tsk("boston_housing")
task$select("age")
task$filter(seq(100))
learner = lrn("regr.lm", predict_type = "se", predict_sets = c("train", "test"))
resampling = rsmp("cv", folds = 3)
rr = resample(task, learner, resampling, store_models = TRUE)
autoplot(rr, type = "prediction", predict_sets = c("train", "test"))
```

We can also add the prediction surface to the background.

```
task = tsk("boston_housing")
task$select(c("age", "rm"))
task$filter(seq(100))
learner = lrn("regr.rpart")
resampling = rsmp("cv", folds = 3)
rr = resample(task, learner, resampling, store_models = TRUE)
autoplot(rr, type = "prediction")
```

We show the performance distribution of a benchmark with multiple tasks.

```
tasks = tsks(c("pima", "sonar"))
learner = lrns(c("classif.featureless", "classif.rpart", "classif.xgboost"), predict_type = "prob")
resampling = rsmps("cv")
bmr = benchmark(benchmark_grid(tasks, learner, resampling))
autoplot(bmr, type = "boxplot")
```

We plot a benchmark result with one task and multiple learners.

```
tasks = tsk("pima")
learner = lrns(c("classif.featureless", "classif.rpart", "classif.xgboost"), predict_type = "prob")
resampling = rsmps("cv")
bmr = benchmark(benchmark_grid(tasks, learner, resampling))
```

We plot an roc curve for each learner.

`autoplot(bmr, type = "roc")`

Alternatively, we can plot precision-recall curves.

`autoplot(bmr, type = "prc")`

We tune the hyperparameters of a decision tree on the sonar task. The `"performance"`

plot shows the performance over batches.

```
library(mlr3tuning)
library(mlr3tuningspaces)
library(mlr3learners)
instance = tune(
method = tnr("gensa"),
task = tsk("sonar"),
learner = lts(lrn("classif.rpart")),
resampling = rsmp("holdout"),
measures = msr("classif.ce"),
term_evals = 100
)
autoplot(instance, type = "performance")
```

The `"parameter"`

plot shows the performance for each hyperparameter setting.

`autoplot(instance, type = "parameter", cols_x = c("cp", "minsplit"))`

The `"marginal"`

plot shows the performance of different hyperparameter values. The color indicates the batch.

`autoplot(instance, type = "marginal", cols_x = "cp")`

The `"parallel"`

plot visualizes the relationship of hyperparameters.

`autoplot(instance, type = "parallel")`

We plot `cp`

against `minsplit`

and color the points by the performance.

`autoplot(instance, type = "points", cols_x = c("cp", "minsplit"))`

Next, we plot all hyperparameters against each other.

`autoplot(instance, type = "pairs")`

We plot the performance surface of two hyperparameters. The surface is interpolated with a learner.

`autoplot(instance, type = "surface", cols_x = c("cp", "minsplit"), learner = mlr3::lrn("regr.ranger"))`

We plot filter scores for the mtcars task.

```
library(mlr3filters)
task = tsk("mtcars")
f = flt("correlation")
f$calculate(task)
autoplot(f, n = 5)
```

The mlr3viz package brings together the visualization functions of the mlr3 ecosystem. All plots are drawn with the `autoplot()`

function and the appearance can be customized with the `theme`

argument. If you need to highly customize a plot e.g. for a publication, we encourage you to check our code on GitHub. The code should be easily adaptable to your needs. We are also looking forward to new visualizations. You can suggest new plots in an issue on GitHub.

In this post, we optimize the hyperparameters of a simple `classification tree`

on the `Palmer Penguins`

data set with only a few lines of code.

First, we introduce tuning spaces and show the importance of transformation functions. Next, we execute the tuning and present the basic building blocks of tuning in mlr3. Finally, we fit a classification tree with optimized hyperparameters on the full data set.

We load the mlr3verse package which pulls the most important packages for this example. Among other packages, it loads the hyperparameter optimization package of the mlr3 ecosystem mlr3tuning.

`library(mlr3verse)`

In this example, we use the `Palmer Penguins`

data set which classifies 344 penguins in three species. The data set was collected from 3 islands in the Palmer Archipelago in Antarctica. It includes the name of the island, the size (flipper length, body mass, and bill dimension), and the sex of the penguin.

`tsk("penguins")`

```
<TaskClassif:penguins> (344 x 8): Palmer Penguins
* Target: species
* Properties: multiclass
* Features (7):
- int (3): body_mass, flipper_length, year
- dbl (2): bill_depth, bill_length
- fct (2): island, sex
```

```
library(palmerpenguins)
library(ggplot2)
ggplot(data = penguins, aes(x = flipper_length_mm, y = bill_length_mm)) +
geom_point(aes(color = species, shape = species), size = 3, alpha = 0.8) +
geom_smooth(method = "lm", se = FALSE, aes(color = species)) +
theme_minimal() +
scale_color_manual(values = c("darkorange","purple","cyan4")) +
labs(x = "Flipper length (mm)", y = "Bill length (mm)", color = "Penguin species", shape = "Penguin species") +
theme(
legend.position = c(0.85, 0.15),
legend.background = element_rect(fill = "white", color = NA),
text = element_text(size = 10))
```

We use the `rpart classification tree`

. A learner stores all information about its hyperparameters in the slot `$param_set`

. Not all parameters are tunable. We have to choose a subset of the hyperparameters we want to tune.

```
learner = lrn("classif.rpart")
as.data.table(learner$param_set)[, list(id, class, lower, upper, nlevels)]
```

```
id class lower upper nlevels
1: cp ParamDbl 0 1 Inf
2: keep_model ParamLgl NA NA 2
3: maxcompete ParamInt 0 Inf Inf
4: maxdepth ParamInt 1 30 30
5: maxsurrogate ParamInt 0 Inf Inf
6: minbucket ParamInt 1 Inf Inf
7: minsplit ParamInt 1 Inf Inf
8: surrogatestyle ParamInt 0 1 2
9: usesurrogate ParamInt 0 2 3
10: xval ParamInt 0 Inf Inf
```

The package mlr3tuningspaces is a collection of search spaces for hyperparameter tuning from peer-reviewed articles. We use the search space from the Bischl et al. (2021) article.

`lts("classif.rpart.default")`

```
<TuningSpace:classif.rpart.default>: Classification Rpart with Default
id lower upper levels logscale
1: minsplit 2e+00 128.0 TRUE
2: minbucket 1e+00 64.0 TRUE
3: cp 1e-04 0.1 TRUE
```

The classification tree is mainly influenced by three hyperparameters:

- The complexity hyperparameter
`cp`

that controls when the learner considers introducing another branch. - The
`minsplit`

hyperparameter that controls how many observations must be present in a leaf for another split to be attempted. - The
`minbucket`

hyperparameter that the minimum number of observations in any terminal node.

We argument the learner with the search space in one go.

`lts(lrn("classif.rpart"))`

```
<LearnerClassifRpart:classif.rpart>: Classification Tree
* Model: -
* Parameters: xval=0, minsplit=<RangeTuneToken>, minbucket=<RangeTuneToken>, cp=<RangeTuneToken>
* Packages: mlr3, rpart
* Predict Types: [response], prob
* Feature Types: logical, integer, numeric, factor, ordered
* Properties: importance, missings, multiclass, selected_features, twoclass, weights
```

The column `logscale`

indicates that the hyperparameters are tuned on the logarithmic scale. The tuning algorithm proposes hyperparameter values that are transformed with the exponential function before they are passed to the learner. For example, the `cp`

parameter is bounded between 0 and 1. The tuning algorithm searches between `log(1e-04)`

and `log(1e-01)`

but the learner gets the transformed values between `1e-04`

and `1e-01`

. Using the log transformation emphasizes smaller `cp`

values but also creates large values.

`lts("classif.rpart.default")`

```
<TuningSpace:classif.rpart.default>: Classification Rpart with Default
id lower upper levels logscale
1: minsplit 2e+00 128.0 TRUE
2: minbucket 1e+00 64.0 TRUE
3: cp 1e-04 0.1 TRUE
```

The `tune()`

function controls and executes the tuning. The `method`

sets the optimization algorithm. The mlr3 ecosystem offers various optimization algorithms e.g. `Random Search`

, `GenSA`

, and `Hyperband`

. In this example, we will use a simple grid search with a grid resolution of 5. Our three-dimensional grid consists of hyperparameter configurations. The `resampling strategy`

and `performance measure`

specify how the performance of a model is evaluated. We choose a `3-fold cross-validation`

and use the `classification error`

.

```
instance = tune(
method = "grid_search",
task = tsk("penguins"),
learner = lts(lrn("classif.rpart")),
resampling = rsmp("cv", folds = 3),
measure = msr("classif.ce"),
resolution = 5
)
```

The `tune()`

function returns a tuning instance that includes an archive with all evaluated hyperparameter configurations.

`as.data.table(instance$archive)[, list(minsplit, minbucket, cp, classif.ce, resample_result)]`

```
minsplit minbucket cp classif.ce resample_result
1: 0.6931472 3.130790 -4.029524 0.05519959 <ResampleResult[21]>
2: 1.7348135 2.087194 -2.302585 0.05519959 <ResampleResult[21]>
3: 2.7764798 4.174387 -5.756463 0.14233410 <ResampleResult[21]>
4: 1.7348135 4.174387 -5.756463 0.14233410 <ResampleResult[21]>
5: 1.7348135 1.043597 -5.756463 0.03773201 <ResampleResult[21]>
---
121: 4.8598124 1.043597 -2.302585 0.05519959 <ResampleResult[21]>
122: 0.6931472 2.087194 -2.302585 0.05519959 <ResampleResult[21]>
123: 1.7348135 3.130790 -2.302585 0.05519959 <ResampleResult[21]>
124: 0.6931472 0.000000 -7.483402 0.02903636 <ResampleResult[21]>
125: 3.8181461 1.043597 -9.210340 0.04645309 <ResampleResult[21]>
```

The best configuration and the corresponding measured performance can be retrieved from the tuning instance.

`instance$result`

```
minsplit minbucket cp learner_param_vals x_domain classif.ce
1: 0.6931472 0 -5.756463 <list[4]> <list[3]> 0.02903636
```

The `$result_learner_param_vals`

field contains the best hyperparameter setting on the learner scale.

`instance$result_learner_param_vals`

```
$xval
[1] 0
$minsplit
[1] 2
$minbucket
[1] 1
$cp
[1] 0.003162278
```

The learner we use to make predictions on new data is called the final model. The final model is trained on the full data set. We add the optimized hyperparameters to the learner and train the learner on the full dataset.

```
learner = lrn("classif.rpart")
learner$param_set$values = instance$result_learner_param_vals
learner$train(tsk("penguins"))
```

The trained model can now be used to predict new, external data.

Horst, Allison. 2022. “Palmer Penguins Artwork and Figures.” https://github.com/allisonhorst.

In this post, we use early stopping to reduce overfitting when training an `XGBoost model`

. We start with a short recap on early stopping and overfitting. After that, we use the early stopping mechanism of XGBoost and train a model on the `Spam Classification`

data set. Finally we show how to simultaneously tune hyperparameters and use early stopping. The reader should be familiar with tuning in the mlr3 ecosystem.

Early stopping is a technique used to reduce overfitting when fitting a model in an iterative process. Overfitting occurs when a model fits increasingly to the training data but the performance on unseen data decreases. This means the model’s training error decreases, while its test performance deteriorates. When using early stopping, the performance is monitored on a test set, and the training stops when performance decreases in a specific number of iterations.

We initialize the random number generator with a fixed seed for reproducibility. The mlr3verse package provides all functions required for this example.

```
set.seed(7832)
library(mlr3verse)
```

When training an XGBoost model, we can use early stopping to find the optimal number of boosting rounds. The `partition()`

function splits the observations of the task into two disjoint sets. We use 80% of observations to train the model and the remaining 20% as the test set to monitor the performance.

```
task = tsk("spam")
split = partition(task, ratio = 0.8)
task$set_row_roles(split$test, "test")
```

The `early_stopping_set`

parameter controls which set is used to monitor the performance. Additionally, we need to define the range in which the performance must increase with `early_stopping_rounds`

and the maximum number of boosting rounds with `nrounds`

. In this example, the training is stopped when the classification error is not decreasing for 100 rounds or 1000 rounds are reached.

```
learner = lrn("classif.xgboost",
nrounds = 1000,
early_stopping_rounds = 100,
early_stopping_set = "test",
eval_metric = "error"
)
```

We train the learner with early stopping.

`learner$train(task)`

The `$evaluation_log`

of the model stores the performance scores on the training and test set. Figure 1 shows that the classification error on the training set decreases, whereas the error on the test set increases after 20 rounds.

```
library(ggplot2)
library(data.table)
data = melt(
learner$model$evaluation_log,
id.vars = "iter",
variable.name = "set",
value.name = "error"
)
ggplot(data, aes(x = iter, y = error, group = set)) +
geom_line(aes(color = set)) +
geom_vline(aes(xintercept = learner$model$best_iteration), color = "grey") +
scale_colour_manual(values=c("#f8766d", "#00b0f6"), labels = c("Train", "Test")) +
labs(x = "Rounds", y = "Classification Error", color = "Set") +
theme_minimal()
```

The slot `$best_iteration`

contains the optimal number of boosting rounds.

`learner$model$best_iteration`

`[1] 20`

Note that, `learner$predict()`

will use the model from the last iteration, not the best one. See the next section on how to fit a model with the optimal number of boosting rounds and hyperparameter configuration.

In this section, we want to tune the hyperparameters of an XGBoost model and find the optimal number of boosting rounds in one go. For this, we need the `early stopping callback`

which handles early stopping during the tuning process. The performance of a hyperparameter configuration is evaluated with a resampling strategy while tuning e.g. 3-fold cross-validation. In each resampling iteration, a new XGBoost model is trained and early stopping is used to find the optimal number of boosting rounds. This results in three different optimal numbers of boosting rounds for one hyperparameter configuration when applying 3-fold cross-validation. The callback picks the maximum of the three values and writes it to the archive. It uses the maximum value because the final model is fitted on the complete data set. Now let’s start with a practical example.

First, we load the XGBoost learner and set the early stopping parameters.

```
learner = lrn("classif.xgboost",
nrounds = 1000,
early_stopping_rounds = 100,
early_stopping_set = "test"
)
```

Next, we load a predefined tuning space from the mlr3tuningspaces package. The tuning space includes the most commonly tuned parameters of XGBoost.

```
tuning_space = lts("classif.xgboost.default")
as.data.table(tuning_space)
```

```
id lower upper logscale
1: eta 1e-04 1 TRUE
2: nrounds 1e+00 5000 FALSE
3: max_depth 1e+00 20 FALSE
4: colsample_bytree 1e-01 1 FALSE
5: colsample_bylevel 1e-01 1 FALSE
6: lambda 1e-03 1000 TRUE
7: alpha 1e-03 1000 TRUE
8: subsample 1e-01 1 FALSE
```

We argument the learner with the tuning space.

`learner = lts(learner)`

The default tuning space contains the `nrounds`

hyperparameter. We have to overwrite it with an upper bound for early stopping.

`learner$param_set$set_values(nrounds = 1000)`

We run a small batch of random hyperparameter configurations.

```
instance = tune(
method = "random_search",
task = task,
learner = learner,
resampling = rsmp("cv", folds = 3),
measure = msr("classif.ce"),
term_evals = 4,
batch_size = 2,
callbacks = clbk("mlr3tuning.early_stopping")
)
```

We can see that the optimal number of boosting rounds (`max_nrounds`

) strongly depends on the other hyperparameters.

`as.data.table(instance$archive)[, list(batch_nr, max_nrounds, eta, max_depth, colsample_bytree, colsample_bylevel, lambda, alpha, subsample)]`

```
batch_nr max_nrounds eta max_depth colsample_bytree colsample_bylevel lambda alpha subsample
1: 1 273 -1.04605873 1 0.8918211 0.1841578 -6.2828642 -2.748495 0.6890264
2: 1 93 -0.01516098 19 0.5108089 0.2405859 -0.8666842 4.442711 0.5464676
3: 2 1000 -8.46723302 13 0.8662932 0.5460656 -5.7251541 -3.850319 0.2734089
4: 2 1000 -7.04702376 8 0.6054186 0.5921445 -4.8507050 -2.466443 0.5887968
```

In the best hyperparameter configuration, the value of `nrounds`

is replaced by `max_nrounds`

and early stopping is deactivated.

`instance$result_learner_param_vals`

```
$nrounds
[1] 273
$nthread
[1] 1
$verbose
[1] 0
$early_stopping_set
[1] "none"
$eta
[1] 0.3513197
$max_depth
[1] 1
$colsample_bytree
[1] 0.8918211
$colsample_bylevel
[1] 0.1841578
$lambda
[1] 0.001868043
$alpha
[1] 0.06402412
$subsample
[1] 0.6890264
```

Finally, fit the final model on the complete data set.

```
learner = lrn("classif.xgboost")
learner$param_set$values = instance$result_learner_param_vals
learner$train(task)
```

The trained model can now be used to make predictions on new data.

We can also use the `AutoTuner`

to get a tuned XGBoost model. Note that, early stopping is deactivated when the final model is fitted.

The package mlr3tuningspaces offers a selection of published search spaces for many popular machine learning algorithms. In this post, we show how to tune a `mlr3 learners`

with these search spaces.

The packages mlr3verse and mlr3tuningspaces are required for this demonstration:

```
library(mlr3verse)
library(mlr3tuningspaces)
```

We initialize the random number generator with a fixed seed for reproducibility, and decrease the verbosity of the logger to keep the output clearly represented.

```
set.seed(7832)
lgr::get_logger("mlr3")$set_threshold("warn")
lgr::get_logger("bbotk")$set_threshold("warn")
```

In the example, we use the `pima indian diabetes data set`

which is used to predict whether or not a patient has diabetes. The patients are characterized by 8 numeric features, some of them have missing values.

```
# retrieve the task from mlr3
task = tsk("pima")
# generate a quick textual overview using the skimr package
skimr::skim(task$data())
```

Name | task$data() |

Number of rows | 768 |

Number of columns | 9 |

Key | NULL |

_______________________ | |

Column type frequency: | |

factor | 1 |

numeric | 8 |

________________________ | |

Group variables | None |

**Variable type: factor**

skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|

diabetes | 0 | 1 | FALSE | 2 | neg: 500, pos: 268 |

**Variable type: numeric**

skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|

age | 0 | 1.00 | 33.24 | 11.76 | 21.00 | 24.00 | 29.00 | 41.00 | 81.00 | ▇▃▁▁▁ |

glucose | 5 | 0.99 | 121.69 | 30.54 | 44.00 | 99.00 | 117.00 | 141.00 | 199.00 | ▁▇▇▃▂ |

insulin | 374 | 0.51 | 155.55 | 118.78 | 14.00 | 76.25 | 125.00 | 190.00 | 846.00 | ▇▂▁▁▁ |

mass | 11 | 0.99 | 32.46 | 6.92 | 18.20 | 27.50 | 32.30 | 36.60 | 67.10 | ▅▇▃▁▁ |

pedigree | 0 | 1.00 | 0.47 | 0.33 | 0.08 | 0.24 | 0.37 | 0.63 | 2.42 | ▇▃▁▁▁ |

pregnant | 0 | 1.00 | 3.85 | 3.37 | 0.00 | 1.00 | 3.00 | 6.00 | 17.00 | ▇▃▂▁▁ |

pressure | 35 | 0.95 | 72.41 | 12.38 | 24.00 | 64.00 | 72.00 | 80.00 | 122.00 | ▁▃▇▂▁ |

triceps | 227 | 0.70 | 29.15 | 10.48 | 7.00 | 22.00 | 29.00 | 36.00 | 99.00 | ▆▇▁▁▁ |

For tuning, it is important to create a search space that defines the type and range of the hyperparameters. A learner stores all information about its hyperparameters in the slot `$param_set`

. Usually, we have to chose a subset of hyperparameters we want to tune.

`lrn("classif.rpart")$param_set`

```
<ParamSet>
id class lower upper nlevels default value
1: cp ParamDbl 0 1 Inf 0.01
2: keep_model ParamLgl NA NA 2 FALSE
3: maxcompete ParamInt 0 Inf Inf 4
4: maxdepth ParamInt 1 30 30 30
5: maxsurrogate ParamInt 0 Inf Inf 5
6: minbucket ParamInt 1 Inf Inf <NoDefault[3]>
7: minsplit ParamInt 1 Inf Inf 20
8: surrogatestyle ParamInt 0 1 2 0
9: usesurrogate ParamInt 0 2 3 2
10: xval ParamInt 0 Inf Inf 10 0
```

At the heart of mlr3tuningspaces is the R6 class `TuningSpace`

. It stores a list of `TuneToken`

, helper functions and additional meta information. The list of `TuneToken`

can be directly applied to the `$values`

slot of a learner’s `ParamSet`

. The search spaces are stored in the `mlr_tuning_spaces`

dictionary.

`as.data.table(mlr_tuning_spaces)`

```
key label learner n_values
1: classif.glmnet.default Classification GLM with Default classif.glmnet 2
2: classif.glmnet.rbv2 Classification GLM with RandomBot classif.glmnet 2
3: classif.kknn.default Classification KKNN with Default classif.kknn 3
4: classif.kknn.rbv2 Classification KKNN with RandomBot classif.kknn 1
5: classif.ranger.default Classification Ranger with Default classif.ranger 4
6: classif.ranger.rbv2 Classification Ranger with RandomBot classif.ranger 8
7: classif.rpart.default Classification Rpart with Default classif.rpart 3
8: classif.rpart.rbv2 Classification Rpart with RandomBot classif.rpart 4
9: classif.svm.default Classification SVM with Default classif.svm 4
10: classif.svm.rbv2 Classification SVM with RandomBot classif.svm 5
11: classif.xgboost.default Classification XGBoost with Default classif.xgboost 8
12: classif.xgboost.rbv2 Classification XGBoost with RandomBot classif.xgboost 13
13: regr.glmnet.default Regression GLM with Default regr.glmnet 2
14: regr.glmnet.rbv2 Regression GLM with RandomBot regr.glmnet 2
15: regr.kknn.default Regression KKNN with Default regr.kknn 3
16: regr.kknn.rbv2 Regression KKNN with RandomBot regr.kknn 1
17: regr.ranger.default Regression Ranger with Default regr.ranger 4
18: regr.ranger.rbv2 Regression Ranger with RandomBot regr.ranger 7
19: regr.rpart.default Regression Rpart with Default regr.rpart 3
20: regr.rpart.rbv2 Regression Rpart with RandomBot regr.rpart 4
21: regr.svm.default Regression SVM with Default regr.svm 4
22: regr.svm.rbv2 Regression SVM with RandomBot regr.svm 5
23: regr.xgboost.default Regression XGBoost with Default regr.xgboost 8
24: regr.xgboost.rbv2 Regression XGBoost with RandomBot regr.xgboost 13
key label learner n_values
```

We can use the sugar function `lts()`

to retrieve a `TuningSpace`

.

```
tuning_space_rpart = lts("classif.rpart.default")
tuning_space_rpart
```

```
<TuningSpace:classif.rpart.default>: Classification Rpart with Default
id lower upper levels logscale
1: minsplit 2e+00 128.0 TRUE
2: minbucket 1e+00 64.0 TRUE
3: cp 1e-04 0.1 TRUE
```

The `$values`

slot contains the list of of `TuneToken`

.

`tuning_space_rpart$values`

```
$minsplit
Tuning over:
range [2, 128] (log scale)
$minbucket
Tuning over:
range [1, 64] (log scale)
$cp
Tuning over:
range [1e-04, 0.1] (log scale)
```

We apply the search space and tune the `learner`

.

```
learner = lrn("classif.rpart")
learner$param_set$values = tuning_space_rpart$values
instance = tune(
method = "random_search",
task = tsk("pima"),
learner = learner,
resampling = rsmp ("holdout"),
measure = msr("classif.ce"),
term_evals = 10)
instance$result
```

```
minsplit minbucket cp learner_param_vals x_domain classif.ce
1: 1.377705 2.369973 -5.610915 <list[3]> <list[3]> 0.2265625
```

We can also get the `learner`

with search space already applied from the `TuningSpace`

.

```
learner = tuning_space_rpart$get_learner()
print(learner$param_set)
```

```
<ParamSet>
id class lower upper nlevels default value
1: cp ParamDbl 0 1 Inf 0.01 <RangeTuneToken[2]>
2: keep_model ParamLgl NA NA 2 FALSE
3: maxcompete ParamInt 0 Inf Inf 4
4: maxdepth ParamInt 1 30 30 30
5: maxsurrogate ParamInt 0 Inf Inf 5
6: minbucket ParamInt 1 Inf Inf <NoDefault[3]> <RangeTuneToken[2]>
7: minsplit ParamInt 1 Inf Inf 20 <RangeTuneToken[2]>
8: surrogatestyle ParamInt 0 1 2 0
9: usesurrogate ParamInt 0 2 3 2
10: xval ParamInt 0 Inf Inf 10 0
```

This method also allows to set constant parameters.

```
learner = tuning_space_rpart$get_learner(maxdepth = 15)
print(learner$param_set)
```

```
<ParamSet>
id class lower upper nlevels default value
1: cp ParamDbl 0 1 Inf 0.01 <RangeTuneToken[2]>
2: keep_model ParamLgl NA NA 2 FALSE
3: maxcompete ParamInt 0 Inf Inf 4
4: maxdepth ParamInt 1 30 30 30 15
5: maxsurrogate ParamInt 0 Inf Inf 5
6: minbucket ParamInt 1 Inf Inf <NoDefault[3]> <RangeTuneToken[2]>
7: minsplit ParamInt 1 Inf Inf 20 <RangeTuneToken[2]>
8: surrogatestyle ParamInt 0 1 2 0
9: usesurrogate ParamInt 0 2 3 2
10: xval ParamInt 0 Inf Inf 10 0
```

The `lts()`

function sets the default search space directly to a `learner`

.

```
learner = lts(lrn("classif.rpart", maxdepth = 15))
print(learner$param_set)
```

```
<ParamSet>
id class lower upper nlevels default value
1: cp ParamDbl 0 1 Inf 0.01 <RangeTuneToken[2]>
2: keep_model ParamLgl NA NA 2 FALSE
3: maxcompete ParamInt 0 Inf Inf 4
4: maxdepth ParamInt 1 30 30 30 15
5: maxsurrogate ParamInt 0 Inf Inf 5
6: minbucket ParamInt 1 Inf Inf <NoDefault[3]> <RangeTuneToken[2]>
7: minsplit ParamInt 1 Inf Inf 20 <RangeTuneToken[2]>
8: surrogatestyle ParamInt 0 1 2 0
9: usesurrogate ParamInt 0 2 3 2
10: xval ParamInt 0 Inf Inf 10 0
```

This is the fourth part of the practical tuning series. The other parts can be found here:

- Part I - Tune a Support Vector Machine
- Part II - Tune a Preprocessing Pipeline
- Part III - Build an Automated Machine Learning System

In this post, we teach how to run various jobs in mlr3 in parallel. The goal is to map *computational jobs* (e.g. evaluation of one configuration) to a pool of *workers* (usually physical CPU cores, sometimes remote computational nodes) to reduce the run time needed for tuning.

We load the mlr3verse package which pulls in the most important packages for this example. Additionally, make sure you have installed the packages future and future.apply.

`library(mlr3verse)`

We decrease the verbosity of the logger to keep the output clearly represented. The `lgr`

package is used for logging in all mlr3 packages. The mlr3 logger prints the logging messages from the base package, whereas the bbotk logger is responsible for logging messages from the optimization packages (e.g. mlr3tuning ).

```
set.seed(7832)
lgr::get_logger("mlr3")$set_threshold("warn")
lgr::get_logger("bbotk")$set_threshold("warn")
```

The workers are specified by the parallel backend which orchestrates starting up, shutting down, and communication with the workers. On a single machine, `multisession`

and `multicore`

are common backends. The `multisession`

backend spawns new background R processes. It is available on all platforms.

`future::plan("multisession")`

The `multicore`

backend uses forked R processes which allows the workers to access R objects in a shared memory. This reduces the overhead since R objects are only copied in memory if they are modified. Unfortunately, forking processes is not supported on Windows and when running R from within RStudio.

`future::plan("multicore")`

Both backends support the `workers`

argument that specifies the number of used cores.

Use this code if your code should run with the `multicore`

backend when possible.

```
if (future::supportsMulticore()) {
future::plan(future::multicore)
} else {
future::plan(future::multisession)
}
```

The `resample()`

and `benchmark()`

functions in mlr3 can be executed in parallel. The parallelization is triggered by simply declaring a plan via `future::plan()`

.

```
future::plan("multisession")
task = tsk("pima")
learner = lrn("classif.rpart") # classification tree
resampling = rsmp("cv", folds = 3)
resample(task, learner, resampling)
```

```
<ResampleResult> of 3 iterations
* Task: pima
* Learner: classif.rpart
* Warnings: 0 in 0 iterations
* Errors: 0 in 0 iterations
```

The 3-fold cross-validation gives us 3 jobs since each resampling iteration is executed in parallel.

The `benchmark()`

function accepts a design of experiments as input where each experiment is defined as a combination of a task, a learner, and a resampling strategy. For each experiment, resampling is performed. The nested loop over experiments and resampling iterations is flattened so that all resampling iterations of all experiments can be executed in parallel.

```
future::plan("multisession")
tasks = list(tsk("pima"), tsk("iris"))
learner = lrn("classif.rpart")
resampling = rsmp("cv", folds = 3)
grid = benchmark_grid(tasks, learner, resampling)
benchmark(grid)
```

```
<BenchmarkResult> of 6 rows with 2 resampling runs
nr task_id learner_id resampling_id iters warnings errors
1 pima classif.rpart cv 3 0 0
2 iris classif.rpart cv 3 0 0
```

The 2 experiments and the 3-fold cross-validation result in 6 jobs which are executed in parallel.

The mlr3tuning package internally calls `benchmark()`

during tuning. If the tuner is capable of suggesting multiple configurations per iteration (such as random search, grid search, or hyperband), these configurations represent individual experiments, and the loop flattening of `benchmark()`

is triggered. E.g., all resampling iterations of all hyperparameter configurations on a grid can be executed in parallel.

```
future::plan("multisession")
learner = lrn("classif.rpart")
learner$param_set$values$cp = to_tune(0.001, 0.1)
learner$param_set$values$minsplit = to_tune(1, 10)
instance = tune(
method = "random_search",
task = tsk("pima"),
learner = learner,
resampling = rsmp("cv", folds = 3),
measure = msr("classif.ce"),
term_evals = 10,
batch_size = 5 # random search suggests 5 configurations per batch
)
```

The batch size of 5 and the 3-fold cross-validation gives us 15 jobs. This is done twice because of the limit of 10 evaluations in total.

Nested resampling results in two nested resampling loops. For this, an `AutoTuner`

is passed to `resample()`

or `benchmark()`

. We can choose different parallelization backends for the inner and outer resampling loop, respectively. We just have to pass a list of backends.

```
# Runs the outer loop in parallel and the inner loop sequentially
future::plan(list("multisession", "sequential"))
# Runs the outer loop sequentially and the inner loop in parallel
future::plan(list("sequential", "multisession"))
learner = lrn("classif.rpart")
learner$param_set$values$cp = to_tune(0.001, 0.1)
learner$param_set$values$minsplit = to_tune(1, 10)
rr = tune_nested(
method = "random_search",
task = tsk("pima"),
learner = learner,
inner_resampling = rsmp ("cv", folds = 3),
outer_resampling = rsmp("cv", folds = 3),
measure = msr("classif.ce"),
term_evals = 10,
batch_size = 5
)
```

While nesting real parallelization backends is often unintended and causes unnecessary overhead, it is useful in some distributed computing setups. It can be achieved with future by forcing a fixed number of workers for each loop.

```
# Runs both loops in parallel
future::plan(list(future::tweak("multisession", workers = 2),
future::tweak("multisession", workers = 4)))
```

This example would run on 8 cores (`= 2 * 4`

) on the local machine.

The mlr3book includes a chapters on parallelization. The mlr3cheatsheets contain frequently used commands and workflows of mlr3.

This is the third part of the practical tuning series. The other parts can be found here:

- Part I - Tune a Support Vector Machine
- Part II - Tune a Preprocessing Pipeline
- Part IV - Tuning and Parallel Processing

In this post, we implement a simple automated machine learning (AutoML) system which includes preprocessing, a switch between multiple learners and hyperparameter tuning. For this, we build a pipeline with the mlr3pipelines extension package. Additionally, we use nested resampling to get an unbiased performance estimate of our AutoML system.

We load the mlr3verse package which pulls in the most important packages for this example.

`library(mlr3verse)`

We initialize the random number generator with a fixed seed for reproducibility, and decrease the verbosity of the logger to keep the output clearly represented. The `lgr`

package is used for logging in all mlr3 packages. The mlr3 logger prints the logging messages from the base package, whereas the bbotk logger is responsible for logging messages from the optimization packages (e.g. mlr3tuning ).

```
set.seed(7832)
lgr::get_logger("mlr3")$set_threshold("warn")
lgr::get_logger("bbotk")$set_threshold("warn")
```

In this example, we use the Pima Indians Diabetes data set which is used to to predict whether or not a patient has diabetes. The patients are characterized by 8 numeric features and some have missing values.

`task = tsk("pima")`

We use three popular machine learning algorithms: k-nearest-neighbors, support vector machines and random forests.

```
learners = list(
lrn("classif.kknn", id = "kknn"),
lrn("classif.svm", id = "svm", type = "C-classification"),
lrn("classif.ranger", id = "ranger")
)
```

The `PipeOpBranch`

allows us to specify multiple alternatives paths. In this graph, the paths lead to the different learner models. The `selection`

hyperparameter controls which path is executed i.e., which learner is used to fit a model. It is important to use the `PipeOpBranch`

after the branching so that the outputs are merged into one result object. We visualize the graph with branching below.

```
graph =
po("branch", options = c("kknn", "svm", "ranger")) %>>%
gunion(lapply(learners, po)) %>>%
po("unbranch")
graph$plot(html = FALSE)
```

Alternatively, we can use the `ppl()`

-shortcut to load a predefined graph from the `mlr_graphs`

dictionary. For this, the learner list must be named.

```
learners = list(
kknn = lrn("classif.kknn", id = "kknn"),
svm = lrn("classif.svm", id = "svm", type = "C-classification"),
ranger = lrn("classif.ranger", id = "ranger")
)
graph = ppl("branch", lapply(learners, po))
```

The task has missing data in five columns.

`round(task$missings() / task$nrow, 2)`

```
diabetes age glucose insulin mass pedigree pregnant pressure triceps
0.00 0.00 0.01 0.49 0.01 0.00 0.00 0.05 0.30
```

The pipeline `"robustify"`

function creates a preprocessing pipeline based on our task. The resulting pipeline imputes missing values with `PipeOpImputeHist`

and creates a dummy column (`PipeOpMissInd`

) which indicates the imputed missing values. Internally, this creates two paths and the results are combined with `PipeOpFeatureUnion`

. In contrast to `PipeOpBranch`

, both paths are executed. Additionally, `"robustify"`

adds `PipeOpEncode`

to encode factor columns and `PipeOpRemoveConstants`

to remove features with a constant value.

```
graph = ppl("robustify", task = task, factors_to_numeric = TRUE) %>>%
graph
plot(graph, html = FALSE)
```

We could also create the preprocessing pipeline manually.

```
gunion(list(po("imputehist"),
po("missind", affect_columns = selector_type(c("numeric", "integer"))))) %>>%
po("featureunion") %>>%
po("encode") %>>%
po("removeconstants")
```

```
Graph with 5 PipeOps:
ID State sccssors prdcssors
imputehist <<UNTRAINED>> featureunion
missind <<UNTRAINED>> featureunion
featureunion <<UNTRAINED>> encode imputehist,missind
encode <<UNTRAINED>> removeconstants featureunion
removeconstants <<UNTRAINED>> encode
```

We use `as_learner()`

to create a `GraphLearner`

which encapsulates the pipeline and can be used like a learner.

`graph_learner = as_learner(graph)`

The parameter set of the graph learner includes all hyperparameters from all contained learners. The hyperparameter ids are prefixed with the corresponding learner ids. The hyperparameter `branch.selection`

controls which learner is used.

`as.data.table(graph_learner$param_set)[, .(id, class, lower, upper, nlevels)]`

```
id class lower upper nlevels
1: removeconstants_prerobustify.ratio ParamDbl 0 1 Inf
2: removeconstants_prerobustify.rel_tol ParamDbl 0 Inf Inf
3: removeconstants_prerobustify.abs_tol ParamDbl 0 Inf Inf
4: removeconstants_prerobustify.na_ignore ParamLgl NA NA 2
5: removeconstants_prerobustify.affect_columns ParamUty NA NA Inf
6: imputehist.affect_columns ParamUty NA NA Inf
7: missind.which ParamFct NA NA 2
8: missind.type ParamFct NA NA 4
9: missind.affect_columns ParamUty NA NA Inf
10: imputesample.affect_columns ParamUty NA NA Inf
11: encode.method ParamFct NA NA 5
12: encode.affect_columns ParamUty NA NA Inf
13: removeconstants_postrobustify.ratio ParamDbl 0 1 Inf
14: removeconstants_postrobustify.rel_tol ParamDbl 0 Inf Inf
15: removeconstants_postrobustify.abs_tol ParamDbl 0 Inf Inf
16: removeconstants_postrobustify.na_ignore ParamLgl NA NA 2
17: removeconstants_postrobustify.affect_columns ParamUty NA NA Inf
18: kknn.k ParamInt 1 Inf Inf
19: kknn.distance ParamDbl 0 Inf Inf
20: kknn.kernel ParamFct NA NA 10
21: kknn.scale ParamLgl NA NA 2
22: kknn.ykernel ParamUty NA NA Inf
23: kknn.store_model ParamLgl NA NA 2
24: svm.cachesize ParamDbl -Inf Inf Inf
25: svm.class.weights ParamUty NA NA Inf
26: svm.coef0 ParamDbl -Inf Inf Inf
27: svm.cost ParamDbl 0 Inf Inf
28: svm.cross ParamInt 0 Inf Inf
29: svm.decision.values ParamLgl NA NA 2
30: svm.degree ParamInt 1 Inf Inf
31: svm.epsilon ParamDbl 0 Inf Inf
32: svm.fitted ParamLgl NA NA 2
33: svm.gamma ParamDbl 0 Inf Inf
34: svm.kernel ParamFct NA NA 4
35: svm.nu ParamDbl -Inf Inf Inf
36: svm.scale ParamUty NA NA Inf
37: svm.shrinking ParamLgl NA NA 2
38: svm.tolerance ParamDbl 0 Inf Inf
39: svm.type ParamFct NA NA 2
40: ranger.alpha ParamDbl -Inf Inf Inf
41: ranger.always.split.variables ParamUty NA NA Inf
42: ranger.class.weights ParamUty NA NA Inf
43: ranger.holdout ParamLgl NA NA 2
44: ranger.importance ParamFct NA NA 4
45: ranger.keep.inbag ParamLgl NA NA 2
46: ranger.max.depth ParamInt 0 Inf Inf
47: ranger.min.node.size ParamInt 1 Inf Inf
48: ranger.min.prop ParamDbl -Inf Inf Inf
49: ranger.minprop ParamDbl -Inf Inf Inf
50: ranger.mtry ParamInt 1 Inf Inf
51: ranger.mtry.ratio ParamDbl 0 1 Inf
52: ranger.num.random.splits ParamInt 1 Inf Inf
53: ranger.num.threads ParamInt 1 Inf Inf
54: ranger.num.trees ParamInt 1 Inf Inf
55: ranger.oob.error ParamLgl NA NA 2
56: ranger.regularization.factor ParamUty NA NA Inf
57: ranger.regularization.usedepth ParamLgl NA NA 2
58: ranger.replace ParamLgl NA NA 2
59: ranger.respect.unordered.factors ParamFct NA NA 3
60: ranger.sample.fraction ParamDbl 0 1 Inf
61: ranger.save.memory ParamLgl NA NA 2
62: ranger.scale.permutation.importance ParamLgl NA NA 2
63: ranger.se.method ParamFct NA NA 2
64: ranger.seed ParamInt -Inf Inf Inf
65: ranger.split.select.weights ParamUty NA NA Inf
66: ranger.splitrule ParamFct NA NA 3
67: ranger.verbose ParamLgl NA NA 2
68: ranger.write.forest ParamLgl NA NA 2
69: branch.selection ParamFct NA NA 3
id class lower upper nlevels
```

We will only tune one hyperparameter for each learner in this example. Additionally, we tune the branching parameter which selects one of the three learners. We have to specify that a hyperparameter is only valid for a certain learner by using `depends = branch.selection == <learner_id>`

.

```
# branch
graph_learner$param_set$values$branch.selection =
to_tune(c("kknn", "svm", "ranger"))
# kknn
graph_learner$param_set$values$kknn.k =
to_tune(p_int(3, 50, logscale = TRUE, depends = branch.selection == "kknn"))
# svm
graph_learner$param_set$values$svm.cost =
to_tune(p_dbl(-1, 1, trafo = function(x) 10^x, depends = branch.selection == "svm"))
# ranger
graph_learner$param_set$values$ranger.mtry =
to_tune(p_int(1, 8, depends = branch.selection == "ranger"))
# short learner id for printing
graph_learner$id = "graph_learner"
```

We define a tuning instance and select a random search which is stopped after 20 evaluated configurations.

```
instance = tune(
method = "random_search",
task = task,
learner = graph_learner,
resampling = rsmp("cv", folds = 3),
measure = msr("classif.ce"),
term_evals = 20
)
```

The following shows a quick way to visualize the tuning results.

```
autoplot(instance, type = "marginal",
cols_x = c("x_domain_kknn.k", "x_domain_svm.cost", "ranger.mtry"))
```

We add the optimized hyperparameters to the graph learner and train the learner on the full dataset.

```
learner = as_learner(graph)
learner$param_set$values = instance$result_learner_param_vals
learner$train(task)
```

The trained model can now be used to make predictions on new data. A common mistake is to report the performance estimated on the resampling sets on which the tuning was performed (`instance$result_y`

) as the model’s performance. Instead we have to use nested resampling to get an unbiased performance estimate.

We use nested resampling to get an unbiased estimate of the predictive performance of our graph learner.

```
graph_learner = as_learner(graph)
graph_learner$param_set$values$branch.selection =
to_tune(c("kknn", "svm", "ranger"))
graph_learner$param_set$values$kknn.k =
to_tune(p_int(3, 50, logscale = TRUE, depends = branch.selection == "kknn"))
graph_learner$param_set$values$svm.cost =
to_tune(p_dbl(-1, 1, trafo = function(x) 10^x, depends = branch.selection == "svm"))
graph_learner$param_set$values$ranger.mtry =
to_tune(p_int(1, 8, depends = branch.selection == "ranger"))
graph_learner$id = "graph_learner"
inner_resampling = rsmp("cv", folds = 3)
at = AutoTuner$new(
learner = graph_learner,
resampling = inner_resampling,
measure = msr("classif.ce"),
terminator = trm("evals", n_evals = 10),
tuner = tnr("random_search")
)
outer_resampling = rsmp("cv", folds = 3)
rr = resample(task, at, outer_resampling, store_models = TRUE)
```

We check the inner tuning results for stable hyperparameters. This means that the selected hyperparameters should not vary too much. We might observe unstable models in this example because the small data set and the low number of resampling iterations might introduce too much randomness. Usually, we aim for the selection of stable hyperparameters for all outer training sets.

`extract_inner_tuning_results(rr)`

Next, we want to compare the predictive performances estimated on the outer resampling to the inner resampling. Significantly lower predictive performances on the outer resampling indicate that the models with the optimized hyperparameters overfit the data.

`rr$score()[, .(iteration, task_id, learner_id, resampling_id, classif.ce)]`

```
iteration task_id learner_id resampling_id classif.ce
1: 1 pima graph_learner.tuned cv 0.2304688
2: 2 pima graph_learner.tuned cv 0.2578125
3: 3 pima graph_learner.tuned cv 0.2070312
```

The aggregated performance of all outer resampling iterations is essentially the unbiased performance of the graph learner with optimal hyperparameter found by random search.

`rr$aggregate()`

```
classif.ce
0.2317708
```

Applying nested resampling can be shortened by using the `tune_nested()`

-shortcut.

```
graph_learner = as_learner(graph)
graph_learner$param_set$values$branch.selection =
to_tune(c("kknn", "svm", "ranger"))
graph_learner$param_set$values$kknn.k =
to_tune(p_int(3, 50, logscale = TRUE, depends = branch.selection == "kknn"))
graph_learner$param_set$values$svm.cost =
to_tune(p_dbl(-1, 1, trafo = function(x) 10^x, depends = branch.selection == "svm"))
graph_learner$param_set$values$ranger.mtry =
to_tune(p_int(1, 8, depends = branch.selection == "ranger"))
graph_learner$id = "graph_learner"
rr = tune_nested(
method = "random_search",
task = task,
learner = graph_learner,
inner_resampling = rsmp("cv", folds = 3),
outer_resampling = rsmp("cv", folds = 3),
measure = msr("classif.ce"),
term_evals = 10,
)
```

The mlr3book includes chapters on pipelines and hyperparameter tuning. The mlr3cheatsheets contain frequently used commands and workflows of mlr3.

This is the second part of the practical tuning series. The other parts can be found here:

- Part I - Tune a Support Vector Machine
- Part III - Build an Automated Machine Learning System
- Part IV - Tuning and Parallel Processing

In this post, we build a simple preprocessing pipeline and tune it. For this, we are using the mlr3pipelines extension package. First, we start by imputing missing values in the Pima Indians Diabetes data set. After that, we encode a factor column to numerical dummy columns in the data set. Next, we combine both preprocessing steps to a `Graph`

and create a `GraphLearner`

. Finally, nested resampling is used to compare the performance of two imputation methods.

We load the mlr3verse package which pulls in the most important packages for this example.

`library(mlr3verse)`

We initialize the random number generator with a fixed seed for reproducibility, and decrease the verbosity of the logger to keep the output clearly represented. The `lgr`

package is used for logging in all mlr3 packages. The mlr3 logger prints the logging messages from the base package, whereas the bbotk logger is responsible for logging messages from the optimization packages (e.g. mlr3tuning ).

```
set.seed(7832)
lgr::get_logger("mlr3")$set_threshold("warn")
lgr::get_logger("bbotk")$set_threshold("warn")
```

In this example, we use the Pima Indians Diabetes data set which is used to predict whether or not a patient has diabetes. The patients are characterized by 8 numeric features of which some have missing values. We alter the data set by categorizing the feature `pressure`

(blood pressure) into the categories `"low"`

, `"mid"`

, and `"high"`

.

```
# retrieve the task from mlr3
task = tsk("pima")
# create data frame with categorized pressure feature
data = task$data(cols = "pressure")
breaks = quantile(data$pressure, probs = c(0, 0.33, 0.66, 1), na.rm = TRUE)
data$pressure = cut(data$pressure, breaks, labels = c("low", "mid", "high"))
# overwrite the feature in the task
task$cbind(data)
# generate a quick textual overview
skimr::skim(task$data())
```

Name | task$data() |

Number of rows | 768 |

Number of columns | 9 |

Key | NULL |

_______________________ | |

Column type frequency: | |

factor | 2 |

numeric | 7 |

________________________ | |

Group variables | None |

**Variable type: factor**

skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|

diabetes | 0 | 1.00 | FALSE | 2 | neg: 500, pos: 268 |

pressure | 36 | 0.95 | FALSE | 3 | low: 282, mid: 245, hig: 205 |

**Variable type: numeric**

skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|

age | 0 | 1.00 | 33.24 | 11.76 | 21.00 | 24.00 | 29.00 | 41.00 | 81.00 | ▇▃▁▁▁ |

glucose | 5 | 0.99 | 121.69 | 30.54 | 44.00 | 99.00 | 117.00 | 141.00 | 199.00 | ▁▇▇▃▂ |

insulin | 374 | 0.51 | 155.55 | 118.78 | 14.00 | 76.25 | 125.00 | 190.00 | 846.00 | ▇▂▁▁▁ |

mass | 11 | 0.99 | 32.46 | 6.92 | 18.20 | 27.50 | 32.30 | 36.60 | 67.10 | ▅▇▃▁▁ |

pedigree | 0 | 1.00 | 0.47 | 0.33 | 0.08 | 0.24 | 0.37 | 0.63 | 2.42 | ▇▃▁▁▁ |

pregnant | 0 | 1.00 | 3.85 | 3.37 | 0.00 | 1.00 | 3.00 | 6.00 | 17.00 | ▇▃▂▁▁ |

triceps | 227 | 0.70 | 29.15 | 10.48 | 7.00 | 22.00 | 29.00 | 36.00 | 99.00 | ▆▇▁▁▁ |

We choose the xgboost algorithm from the xgboost package as learner.

`learner = lrn("classif.xgboost", nrounds = 100, id = "xgboost", verbose = 0)`

The task has missing data in five columns.

`round(task$missings() / task$nrow, 2)`

```
diabetes age glucose insulin mass pedigree pregnant pressure triceps
0.00 0.00 0.01 0.49 0.01 0.00 0.00 0.05 0.30
```

The `xgboost`

learner has an internal method for handling missing data but some learners cannot handle missing values. We will try to beat the internal method in terms of predictive performance. The mlr3pipelines package offers various methods to impute missing values.

`mlr_pipeops$keys("^impute")`

```
[1] "imputeconstant" "imputehist" "imputelearner" "imputemean" "imputemedian" "imputemode"
[7] "imputeoor" "imputesample"
```

We choose the `PipeOpImputeOOR`

that adds the new factor level `".MISSING".`

to factorial features and imputes numerical features by constant values shifted below the minimum (default) or above the maximum.

```
imputer = po("imputeoor")
print(imputer)
```

```
PipeOp: <imputeoor> (not trained)
values: <min=TRUE, offset=1, multiplier=1>
Input channels <name [train type, predict type]>:
input [Task,Task]
Output channels <name [train type, predict type]>:
output [Task,Task]
```

As the output suggests, the in- and output of this pipe operator is a `Task`

for both the training and the predict step. We can manually train the pipe operator to check its functionality:

```
task_imputed = imputer$train(list(task))[[1]]
task_imputed$missings()
```

```
diabetes age pedigree pregnant glucose insulin mass pressure triceps
0 0 0 0 0 0 0 0 0
```

Let’s compare an observation with missing values to the observation with imputed observation.

```
rbind(
task$data()[8,],
task_imputed$data()[8,]
)
```

```
diabetes age glucose insulin mass pedigree pregnant pressure triceps
1: neg 29 115 NA 35.3 0.134 10 <NA> NA
2: neg 29 115 -819 35.3 0.134 10 .MISSING -86
```

Note that OOR imputation is in particular useful for tree-based models, but should not be used for linear models or distance-based models.

The `xgboost`

learner cannot handle categorical features. Therefore, we must to convert factor columns to numerical dummy columns. For this, we argument the `xgboost`

learner with automatic factor encoding.

The `PipeOpEncode`

encodes factor columns with one of six methods. In this example, we use `one-hot`

encoding which creates a new binary column for each factor level.

`factor_encoding = po("encode", method = "one-hot")`

We manually trigger the encoding on the task.

`factor_encoding$train(list(task))`

```
$output
<TaskClassif:pima> (768 x 11): Pima Indian Diabetes
* Target: diabetes
* Properties: twoclass
* Features (10):
- dbl (10): age, glucose, insulin, mass, pedigree, pregnant, pressure.high, pressure.low, pressure.mid,
triceps
```

The factor column `pressure`

has been converted to the three binary columns `"pressure.low"`

, `"pressure.mid"`

, and `"pressure.high"`

.

We created two preprocessing steps which could be used to create a new task with encoded factor variables and imputed missing values. However, if we do this before resampling, information from the test can leak into our training step which typically leads to overoptimistic performance measures. To avoid this, we add the preprocessing steps to the `Learner`

itself, creating a `GraphLearner`

. For this, we create a `Graph`

first.

```
graph = po("encode") %>>%
po("imputeoor") %>>%
learner
plot(graph, html = FALSE)
```

We use `as_learner()`

to wrap the `Graph`

into a `GraphLearner`

with which allows us to use the graph like a normal learner.

```
graph_learner = as_learner(graph)
# short learner id for printing
graph_learner$id = "graph_learner"
```

The `GraphLearner`

can be trained and used for making predictions. Instead of calling `$train()`

or `$predict()`

manually, we will directly use it for resampling. We choose a 3-fold cross-validation as the resampling strategy.

```
resampling = rsmp("cv", folds = 3)
rr = resample(task = task, learner = graph_learner, resampling = resampling)
```

`rr$score()[, c("iteration", "task_id", "learner_id", "resampling_id", "classif.ce"), with = FALSE]`

```
iteration task_id learner_id resampling_id classif.ce
1: 1 pima graph_learner cv 0.2851562
2: 2 pima graph_learner cv 0.2460938
3: 3 pima graph_learner cv 0.2968750
```

For each resampling iteration, the following steps are performed:

- The task is subsetted to the training indices.
- The factor encoder replaces factor features with dummy columns in the training task.
- The OOR imputer determines values to impute from the training task and then replaces all missing values with learned imputation values.
- The learner is applied on the modified training task and the model is stored inside the learner.

Next is the predict step:

- The task is subsetted to the test indices.
- The factor encoder replaces all factor features with dummy columns in the test task.
- The OOR imputer replaces all missing values of the test task with the imputation values learned on the training set.
- The learner’s predict method is applied on the modified test task.

By following this procedure, it is guaranteed that no information can leak from the training step to the predict step.

Let’s have a look at the parameter set of the `GraphLearner`

. It consists of the `xgboost`

hyperparameters, and additionally, the parameter of the `PipeOp`

`encode`

and `imputeoor`

. All hyperparameters are prefixed with the id of the respective `PipeOp`

or learner.

`as.data.table(graph_learner$param_set)[, c("id", "class", "lower", "upper", "nlevels"), with = FALSE]`

```
id class lower upper nlevels
1: encode.method ParamFct NA NA 5
2: encode.affect_columns ParamUty NA NA Inf
3: imputeoor.min ParamLgl NA NA 2
4: imputeoor.offset ParamDbl 0 Inf Inf
5: imputeoor.multiplier ParamDbl 0 Inf Inf
6: imputeoor.affect_columns ParamUty NA NA Inf
7: xgboost.alpha ParamDbl 0 Inf Inf
8: xgboost.approxcontrib ParamLgl NA NA 2
9: xgboost.base_score ParamDbl -Inf Inf Inf
10: xgboost.booster ParamFct NA NA 3
11: xgboost.callbacks ParamUty NA NA Inf
12: xgboost.colsample_bylevel ParamDbl 0 1 Inf
13: xgboost.colsample_bynode ParamDbl 0 1 Inf
14: xgboost.colsample_bytree ParamDbl 0 1 Inf
15: xgboost.disable_default_eval_metric ParamLgl NA NA 2
16: xgboost.early_stopping_rounds ParamInt 1 Inf Inf
17: xgboost.early_stopping_set ParamFct NA NA 3
18: xgboost.eta ParamDbl 0 1 Inf
19: xgboost.eval_metric ParamUty NA NA Inf
20: xgboost.feature_selector ParamFct NA NA 5
21: xgboost.feval ParamUty NA NA Inf
22: xgboost.gamma ParamDbl 0 Inf Inf
23: xgboost.grow_policy ParamFct NA NA 2
24: xgboost.interaction_constraints ParamUty NA NA Inf
25: xgboost.iterationrange ParamUty NA NA Inf
26: xgboost.lambda ParamDbl 0 Inf Inf
27: xgboost.lambda_bias ParamDbl 0 Inf Inf
28: xgboost.max_bin ParamInt 2 Inf Inf
29: xgboost.max_delta_step ParamDbl 0 Inf Inf
30: xgboost.max_depth ParamInt 0 Inf Inf
31: xgboost.max_leaves ParamInt 0 Inf Inf
32: xgboost.maximize ParamLgl NA NA 2
33: xgboost.min_child_weight ParamDbl 0 Inf Inf
34: xgboost.missing ParamDbl -Inf Inf Inf
35: xgboost.monotone_constraints ParamUty NA NA Inf
36: xgboost.normalize_type ParamFct NA NA 2
37: xgboost.nrounds ParamInt 1 Inf Inf
38: xgboost.nthread ParamInt 1 Inf Inf
39: xgboost.ntreelimit ParamInt 1 Inf Inf
40: xgboost.num_parallel_tree ParamInt 1 Inf Inf
41: xgboost.objective ParamUty NA NA Inf
42: xgboost.one_drop ParamLgl NA NA 2
43: xgboost.outputmargin ParamLgl NA NA 2
44: xgboost.predcontrib ParamLgl NA NA 2
45: xgboost.predictor ParamFct NA NA 2
46: xgboost.predinteraction ParamLgl NA NA 2
47: xgboost.predleaf ParamLgl NA NA 2
48: xgboost.print_every_n ParamInt 1 Inf Inf
49: xgboost.process_type ParamFct NA NA 2
50: xgboost.rate_drop ParamDbl 0 1 Inf
51: xgboost.refresh_leaf ParamLgl NA NA 2
52: xgboost.reshape ParamLgl NA NA 2
53: xgboost.seed_per_iteration ParamLgl NA NA 2
54: xgboost.sampling_method ParamFct NA NA 2
55: xgboost.sample_type ParamFct NA NA 2
56: xgboost.save_name ParamUty NA NA Inf
57: xgboost.save_period ParamInt 0 Inf Inf
58: xgboost.scale_pos_weight ParamDbl -Inf Inf Inf
59: xgboost.skip_drop ParamDbl 0 1 Inf
60: xgboost.strict_shape ParamLgl NA NA 2
61: xgboost.subsample ParamDbl 0 1 Inf
62: xgboost.top_k ParamInt 0 Inf Inf
63: xgboost.training ParamLgl NA NA 2
64: xgboost.tree_method ParamFct NA NA 5
65: xgboost.tweedie_variance_power ParamDbl 1 2 Inf
66: xgboost.updater ParamUty NA NA Inf
67: xgboost.verbose ParamInt 0 2 3
68: xgboost.watchlist ParamUty NA NA Inf
69: xgboost.xgb_model ParamUty NA NA Inf
id class lower upper nlevels
```

We will tune the encode method.

`graph_learner$param_set$values$encode.method = to_tune(c("one-hot", "treatment"))`

We define a tuning instance and use grid search since we want to try all encode methods.

```
instance = tune(
method = "grid_search",
task = task,
learner = graph_learner,
resampling = rsmp("cv", folds = 3),
measure = msr("classif.ce")
)
```

The archive shows us the performance of the model with different encoding methods.

`print(instance$archive)`

```
<ArchiveTuning>
encode.method classif.ce runtime_learners timestamp batch_nr warnings errors resample_result
1: one-hot 0.27 1.1 2023-03-03 10:39:06.57 1 0 0 <ResampleResult[21]>
2: treatment 0.27 1.2 2023-03-03 10:39:07.95 2 0 0 <ResampleResult[21]>
```

We create one `GraphLearner`

with `imputeoor`

and test it against a `GraphLearner`

that uses the internal imputation method of `xgboost`

. Applying nested resampling ensures a fair comparison of the predictive performances.

```
graph_1 = po("encode") %>>%
learner
graph_learner_1 = GraphLearner$new(graph_1)
graph_learner_1$param_set$values$encode.method = to_tune(c("one-hot", "treatment"))
at_1 = AutoTuner$new(
learner = graph_learner_1,
resampling = resampling,
measure = msr("classif.ce"),
terminator = trm("none"),
tuner = tnr("grid_search"),
store_models = TRUE
)
```

```
graph_2 = po("encode") %>>%
po("imputeoor") %>>%
learner
graph_learner_2 = GraphLearner$new(graph_2)
graph_learner_2$param_set$values$encode.method = to_tune(c("one-hot", "treatment"))
at_2 = AutoTuner$new(
learner = graph_learner_2,
resampling = resampling,
measure = msr("classif.ce"),
terminator = trm("none"),
tuner = tnr("grid_search"),
store_models = TRUE
)
```

We run the benchmark.

```
resampling_outer = rsmp("cv", folds = 3)
design = benchmark_grid(task, list(at_1, at_2), resampling_outer)
bmr = benchmark(design, store_models = TRUE)
```

We compare the aggregated performances on the outer test sets which give us an unbiased performance estimate of the `GraphLearner`

s with the different encoding methods.

`bmr$aggregate()`

```
nr resample_result task_id learner_id resampling_id iters classif.ce
1: 1 <ResampleResult[21]> pima encode.xgboost.tuned cv 3 0.2695312
2: 2 <ResampleResult[21]> pima encode.imputeoor.xgboost.tuned cv 3 0.2682292
```

`autoplot(bmr)`

Note that in practice, it is required to tune preprocessing hyperparameters jointly with the hyperparameters of the learner. Otherwise, comparing preprocessing steps is not feasible and can lead to wrong conclusions.

Applying nested resampling can be shortened by using the `auto_tuner()`

-shortcut.

```
graph_1 = po("encode") %>>% learner
graph_learner_1 = as_learner(graph_1)
graph_learner_1$param_set$values$encode.method = to_tune(c("one-hot", "treatment"))
at_1 = auto_tuner(
method = "grid_search",
learner = graph_learner_1,
resampling = resampling,
measure = msr("classif.ce"),
store_models = TRUE)
graph_2 = po("encode") %>>% po("imputeoor") %>>% learner
graph_learner_2 = as_learner(graph_2)
graph_learner_2$param_set$values$encode.method = to_tune(c("one-hot", "treatment"))
at_2 = auto_tuner(
method = "grid_search",
learner = graph_learner_2,
resampling = resampling,
measure = msr("classif.ce"),
store_models = TRUE)
design = benchmark_grid(task, list(at_1, at_2), rsmp("cv", folds = 3))
bmr = benchmark(design, store_models = TRUE)
```

We train the chosen `GraphLearner`

with the `AutoTuner`

to get a final model with optimized hyperparameters.

`at_2$train(task)`

The trained model can now be used to make predictions on new data `at_2$predict()`

. The pipeline ensures that the preprocessing is always a part of the train and predict step.

The mlr3book includes chapters on pipelines and hyperparameter tuning. The mlr3cheatsheets contain frequently used commands and workflows of mlr3.

This is the first part of the practical tuning series. The other parts can be found here:

- Part II - Tune a Preprocessing Pipeline
- Part III - Build an Automated Machine Learning System
- Part IV - Tuning and Parallel Processing

In this post, we demonstrate how to optimize the hyperparameters of a support vector machine (SVM). We are using the mlr3 machine learning framework with the mlr3tuning extension package.

First, we start by showing the basic building blocks of mlr3tuning and tune the `cost`

and `gamma`

hyperparameters of an SVM with a radial basis function on the Iris data set. After that, we use transformations to tune the both hyperparameters on the logarithmic scale. Next, we explain the importance of dependencies to tune hyperparameters like `degree`

which are dependent on the choice of kernel. After that, we fit an SVM with optimized hyperparameters on the full dataset. Finally, nested resampling is used to compute an unbiased performance estimate of our tuned SVM.

We load the mlr3verse package which pulls in the most important packages for this example.

`library(mlr3verse)`

We initialize the random number generator with a fixed seed for reproducibility, and decrease the verbosity of the logger to keep the output clearly represented. The `lgr`

package is used for logging in all mlr3 packages. The mlr3 logger prints the logging messages from the base package, whereas the bbotk logger is responsible for logging messages from the optimization packages (e.g. mlr3tuning ).

```
set.seed(7832)
lgr::get_logger("mlr3")$set_threshold("warn")
lgr::get_logger("bbotk")$set_threshold("warn")
```

In the example, we use the Iris data set which classifies 150 flowers in three species of Iris. The flowers are characterized by sepal length and width and petal length and width. The Iris data set allows us to quickly fit models to it. However, the influence of hyperparameter tuning on the predictive performance might be minor. Other data sets might give more meaningful tuning results.

```
# retrieve the task from mlr3
task = tsk("iris")
# generate a quick textual overview using the skimr package
skimr::skim(task$data())
```

Name | task$data() |

Number of rows | 150 |

Number of columns | 5 |

Key | NULL |

_______________________ | |

Column type frequency: | |

factor | 1 |

numeric | 4 |

________________________ | |

Group variables | None |

**Variable type: factor**

skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|

Species | 0 | 1 | FALSE | 3 | set: 50, ver: 50, vir: 50 |

**Variable type: numeric**

skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|

Petal.Length | 0 | 1 | 3.76 | 1.77 | 1.0 | 1.6 | 4.35 | 5.1 | 6.9 | ▇▁▆▇▂ |

Petal.Width | 0 | 1 | 1.20 | 0.76 | 0.1 | 0.3 | 1.30 | 1.8 | 2.5 | ▇▁▇▅▃ |

Sepal.Length | 0 | 1 | 5.84 | 0.83 | 4.3 | 5.1 | 5.80 | 6.4 | 7.9 | ▆▇▇▅▂ |

Sepal.Width | 0 | 1 | 3.06 | 0.44 | 2.0 | 2.8 | 3.00 | 3.3 | 4.4 | ▁▆▇▂▁ |

We choose the support vector machine implementation from the e1071 package (which is based on LIBSVM) and use it as a classification machine by setting `type`

to `"C-classification"`

.

`learner = lrn("classif.svm", type = "C-classification", kernel = "radial")`

For tuning, it is important to create a search space that defines the type and range of the hyperparameters. A learner stores all information about its hyperparameters in the slot `$param_set`

. Not all parameters are tunable. We have to choose a subset of the hyperparameters we want to tune.

`as.data.table(learner$param_set)[, .(id, class, lower, upper, nlevels)]`

We use the `to_tune()`

function to define the range over which the hyperparameter should be tuned. We opt for the `cost`

and `gamma`

hyperparameters of the `radial`

kernel and set the tuning ranges with lower and upper bounds.

```
learner$param_set$values$cost = to_tune(0.1, 10)
learner$param_set$values$gamma = to_tune(0, 5)
```

We specify how to evaluate the performance of the different hyperparameter configurations. For this, we choose 3-fold cross validation as the resampling strategy and the classification error as the performance measure.

```
resampling = rsmp("cv", folds = 3)
measure = msr("classif.ce")
```

Usually, we have to select a budget for the tuning. This is done by choosing a `Terminator`

, which stops the tuning e.g. after a performance level is reached or after a given time. However, some tuners like grid search terminate themselves. In this case, we choose a terminator that never stops and the tuning is not stopped before all grid points are evaluated.

`terminator = trm("none")`

At this point, we can construct a `TuningInstanceSingleCrit`

that describes the tuning problem.

```
instance = TuningInstanceSingleCrit$new(
task = task,
learner = learner,
resampling = resampling,
measure = measure,
terminator = terminator
)
print(instance)
```

```
<TuningInstanceSingleCrit>
* State: Not optimized
* Objective: <ObjectiveTuning:classif.svm_on_iris>
* Search Space:
id class lower upper nlevels
1: cost ParamDbl 0.1 10 Inf
2: gamma ParamDbl 0.0 5 Inf
* Terminator: <TerminatorNone>
```

Finally, we have to choose a `Tuner`

. Grid Search discretizes numeric parameters into a given resolution and constructs a grid from the Cartesian product of these sets. Categorical parameters produce a grid over all levels specified in the search space. In this example, we only use a resolution of 5 to keep the runtime low. Usually, a higher resolution is used to create a denser grid.

```
tuner = tnr("grid_search", resolution = 5)
print(tuner)
```

```
<TunerGridSearch>: Grid Search
* Parameters: resolution=5, batch_size=1
* Parameter classes: ParamLgl, ParamInt, ParamDbl, ParamFct
* Properties: dependencies, single-crit, multi-crit
* Packages: mlr3tuning
```

We can preview the proposed configurations by using `generate_design_grid()`

. This function is internally executed by `TunerGridSearch`

.

`generate_design_grid(learner$param_set$search_space(), resolution = 5)`

```
<Design> with 25 rows:
cost gamma
1: 0.100 0.00
2: 0.100 1.25
3: 0.100 2.50
4: 0.100 3.75
5: 0.100 5.00
6: 2.575 0.00
7: 2.575 1.25
8: 2.575 2.50
9: 2.575 3.75
10: 2.575 5.00
11: 5.050 0.00
12: 5.050 1.25
13: 5.050 2.50
14: 5.050 3.75
15: 5.050 5.00
16: 7.525 0.00
17: 7.525 1.25
18: 7.525 2.50
19: 7.525 3.75
20: 7.525 5.00
21: 10.000 0.00
22: 10.000 1.25
23: 10.000 2.50
24: 10.000 3.75
25: 10.000 5.00
cost gamma
```

We trigger the tuning by passing the `TuningInstanceSingleCrit`

to the `$optimize()`

method of the `Tuner`

. The instance is modified in-place.

`tuner$optimize(instance)`

```
cost gamma learner_param_vals x_domain classif.ce
1: 5.05 1.25 <list[4]> <list[2]> 0.04
```

We plot the performances depending on the evaluated `cost`

and `gamma`

values.

```
autoplot(instance, type = "surface", cols_x = c("cost", "gamma"),
learner = lrn("regr.km"))
```

The points mark the evaluated `cost`

and `gamma`

values. We should not infer the performance of new values from the heatmap since it is only an interpolation. However, we can see the general interaction between the hyperparameters.

Tuning a learner can be shortened by using the `tune()`

-shortcut.

```
learner = lrn("classif.svm", type = "C-classification", kernel = "radial")
learner$param_set$values$cost = to_tune(0.1, 10)
learner$param_set$values$gamma = to_tune(0, 5)
instance = tune(
method = "grid_search",
task = tsk("iris"),
learner = learner,
resampling = rsmp ("holdout"),
measure = msr("classif.ce"),
resolution = 5
)
```

Next, we want to tune the `cost`

and `gamma`

hyperparameter more efficiently. It is recommended to tune `cost`

and `gamma`

on the logarithmic scale (Hsu, Chang, and Lin 2003). The log transformation emphasizes smaller `cost`

and `gamma`

values but also creates large values. Therefore, we use a log transformation to emphasize this region of the search space with a denser grid.

Generally speaking, transformations can be used to convert hyperparameters to a new scale. These transformations are applied before the proposed configuration is passed to the `Learner`

. We can directly define the transformation in the `to_tune()`

function. The lower and upper bound is set on the original scale.

```
learner = lrn("classif.svm", type = "C-classification", kernel = "radial")
# tune from 2^-15 to 2^15 on a log scale
learner$param_set$values$cost = to_tune(p_dbl(-15, 15, trafo = function(x) 2^x))
# tune from 2^-15 to 2^5 on a log scale
learner$param_set$values$gamma = to_tune(p_dbl(-15, 5, trafo = function(x) 2^x))
```

Transformations to the log scale are the ones most commonly used. We can use a shortcut for this transformation. The lower and upper bound is set on the transformed scale.

```
learner$param_set$values$cost = to_tune(p_dbl(1e-5, 1e5, logscale = TRUE))
learner$param_set$values$gamma = to_tune(p_dbl(1e-5, 1e5, logscale = TRUE))
```

We use the `tune()`

-shortcut to run the tuning.

```
instance = tune(
method = "grid_search",
task = task,
learner = learner,
resampling = resampling,
measure = measure,
resolution = 5
)
```

The hyperparameter values after the transformation are stored in the `x_domain`

column as lists. We can expand these lists into multiple columns by using `as.data.table()`

. The hyperparameter names are prefixed by `x_domain`

.

`as.data.table(instance$archive)[, .(cost, gamma, x_domain_cost, x_domain_gamma)]`

```
cost gamma x_domain_cost x_domain_gamma
1: 11.512925 -11.512925 1.000000e+05 1.000000e-05
2: 5.756463 0.000000 3.162278e+02 1.000000e+00
3: -11.512925 11.512925 1.000000e-05 1.000000e+05
4: 0.000000 5.756463 1.000000e+00 3.162278e+02
5: -11.512925 -5.756463 1.000000e-05 3.162278e-03
6: 0.000000 0.000000 1.000000e+00 1.000000e+00
7: 11.512925 5.756463 1.000000e+05 3.162278e+02
8: -5.756463 -11.512925 3.162278e-03 1.000000e-05
9: -11.512925 -11.512925 1.000000e-05 1.000000e-05
10: -5.756463 11.512925 3.162278e-03 1.000000e+05
11: -11.512925 5.756463 1.000000e-05 3.162278e+02
12: 11.512925 0.000000 1.000000e+05 1.000000e+00
13: -11.512925 0.000000 1.000000e-05 1.000000e+00
14: 5.756463 -11.512925 3.162278e+02 1.000000e-05
15: 5.756463 5.756463 3.162278e+02 3.162278e+02
16: 5.756463 -5.756463 3.162278e+02 3.162278e-03
17: 5.756463 11.512925 3.162278e+02 1.000000e+05
18: 11.512925 11.512925 1.000000e+05 1.000000e+05
19: 11.512925 -5.756463 1.000000e+05 3.162278e-03
20: -5.756463 -5.756463 3.162278e-03 3.162278e-03
21: 0.000000 -11.512925 1.000000e+00 1.000000e-05
22: 0.000000 11.512925 1.000000e+00 1.000000e+05
23: 0.000000 -5.756463 1.000000e+00 3.162278e-03
24: -5.756463 0.000000 3.162278e-03 1.000000e+00
25: -5.756463 5.756463 3.162278e-03 3.162278e+02
cost gamma x_domain_cost x_domain_gamma
```

We plot the performances depending on the evaluated `cost`

and `gamma`

values.

```
library(ggplot2)
library(scales)
autoplot(instance, type = "points", cols_x = c("x_domain_cost", "x_domain_gamma")) +
scale_x_continuous(
trans = log2_trans(),
breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x))) +
scale_y_continuous(
trans = log2_trans(),
breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x)))
```

Dependencies ensure that certain parameters are only proposed depending on values of other hyperparameters. We want to tune the `degree`

hyperparameter that is only needed for the `polynomial`

kernel.

```
learner = lrn("classif.svm", type = "C-classification")
learner$param_set$values$cost = to_tune(p_dbl(1e-5, 1e5, logscale = TRUE))
learner$param_set$values$gamma = to_tune(p_dbl(1e-5, 1e5, logscale = TRUE))
learner$param_set$values$kernel = to_tune(c("polynomial", "radial"))
learner$param_set$values$degree = to_tune(1, 4)
```

The dependencies are already stored in the learner parameter set.

`learner$param_set$deps`

```
id on cond
1: cost type <CondEqual[9]>
2: nu type <CondEqual[9]>
3: degree kernel <CondEqual[9]>
4: coef0 kernel <CondAnyOf[9]>
5: gamma kernel <CondAnyOf[9]>
```

The `gamma`

hyperparameter depends on the kernel being `polynomial`

, `radial`

or `sigmoid`

`learner$param_set$deps$cond[[5]]`

`CondAnyOf: x ∈ {polynomial, radial, sigmoid}`

whereas the `degree`

hyperparameter is solely used by the `polynomial`

kernel.

`learner$param_set$deps$cond[[3]]`

`CondEqual: x = polynomial`

We preview the grid to show the effect of the dependencies.

`generate_design_grid(learner$param_set$search_space(), resolution = 2)`

```
<Design> with 12 rows:
cost gamma kernel degree
1: -11.51293 -11.51293 polynomial 1
2: -11.51293 -11.51293 polynomial 4
3: -11.51293 -11.51293 radial NA
4: -11.51293 11.51293 polynomial 1
5: -11.51293 11.51293 polynomial 4
6: -11.51293 11.51293 radial NA
7: 11.51293 -11.51293 polynomial 1
8: 11.51293 -11.51293 polynomial 4
9: 11.51293 -11.51293 radial NA
10: 11.51293 11.51293 polynomial 1
11: 11.51293 11.51293 polynomial 4
12: 11.51293 11.51293 radial NA
```

The value for `degree`

is `NA`

if the dependency on the `kernel`

is not satisfied.

We use the `tune()`

-shortcut to run the tuning.

```
instance = tune(
method = "grid_search",
task = task,
learner = learner,
resampling = resampling,
measure = measure,
resolution = 3
)
```

`instance$result`

```
cost gamma kernel degree learner_param_vals x_domain classif.ce
1: 0 0 polynomial 1 <list[5]> <list[4]> 0.02
```

We add the optimized hyperparameters to the learner and train the learner on the full dataset.

```
learner = lrn("classif.svm")
learner$param_set$values = instance$result_learner_param_vals
learner$train(task)
```

The trained model can now be used to make predictions on new data. A common mistake is to report the performance estimated on the resampling sets on which the tuning was performed (`instance$result_y`

) as the model’s performance. These scores might be biased and overestimate the ability of the fitted model to predict with new data. Instead, we have to use nested resampling to get an unbiased performance estimate.

Tuning should not be performed on the same resampling sets which are used for evaluating the model itself, since this would result in a biased performance estimate. Nested resampling uses an outer and inner resampling to separate the tuning from the performance estimation of the model. We can use the `AutoTuner`

class for running nested resampling. The `AutoTuner`

wraps a `Learner`

and tunes the hyperparameter of the learner during `$train()`

. This is our inner resampling loop.

```
learner = lrn("classif.svm", type = "C-classification")
learner$param_set$values$cost = to_tune(p_dbl(1e-5, 1e5, logscale = TRUE))
learner$param_set$values$gamma = to_tune(p_dbl(1e-5, 1e5, logscale = TRUE))
learner$param_set$values$kernel = to_tune(c("polynomial", "radial"))
learner$param_set$values$degree = to_tune(1, 4)
resampling_inner = rsmp("cv", folds = 3)
terminator = trm("none")
tuner = tnr("grid_search", resolution = 3)
at = AutoTuner$new(
learner = learner,
resampling = resampling_inner,
measure = measure,
terminator = terminator,
tuner = tuner,
store_models = TRUE)
```

We put the `AutoTuner`

into a `resample()`

call to get the outer resampling loop.

```
resampling_outer = rsmp("cv", folds = 3)
rr = resample(task = task, learner = at, resampling = resampling_outer, store_models = TRUE)
```

We check the inner tuning results for stable hyperparameters. This means that the selected hyperparameters should not vary too much. We might observe unstable models in this example because the small data set and the low number of resampling iterations might introduce too much randomness. Usually, we aim for the selection of stable hyperparameters for all outer training sets.

`extract_inner_tuning_results(rr)[, .SD, .SDcols = !c("learner_param_vals", "x_domain")]`

```
iteration cost gamma kernel degree classif.ce task_id learner_id resampling_id
1: 1 0.00000 11.51293 polynomial 1 0.04010695 iris classif.svm.tuned cv
2: 2 11.51293 -11.51293 radial NA 0.04961378 iris classif.svm.tuned cv
3: 3 11.51293 -11.51293 radial NA 0.03030303 iris classif.svm.tuned cv
```

Next, we want to compare the predictive performances estimated on the outer resampling to the inner resampling (`extract_inner_tuning_results(rr)`

). Significantly lower predictive performances on the outer resampling indicate that the models with the optimized hyperparameters overfit the data.

`rr$score()[, .(iteration, task_id, learner_id, resampling_id, classif.ce)]`

```
iteration task_id learner_id resampling_id classif.ce
1: 1 iris classif.svm.tuned cv 0.06
2: 2 iris classif.svm.tuned cv 0.04
3: 3 iris classif.svm.tuned cv 0.04
```

The archives of the `AutoTuner`

s allows us to inspect all evaluated hyperparameters configurations with the associated predictive performances.

`extract_inner_tuning_archives(rr, unnest = NULL, exclude_columns = c("resample_result", "uhash", "x_domain", "timestamp"))`

```
iteration cost gamma kernel degree classif.ce runtime_learners batch_nr warnings errors task_id
1: 1 11.51293 11.51293 polynomial 2 0.17052882 0.029 1 0 0 iris
2: 1 -11.51293 -11.51293 polynomial 1 0.53921569 0.019 2 0 0 iris
3: 1 -11.51293 11.51293 radial NA 0.62002377 0.020 3 0 0 iris
4: 1 0.00000 0.00000 polynomial 4 0.12091503 0.020 4 0 0 iris
5: 1 0.00000 0.00000 radial NA 0.07040998 0.018 5 0 0 iris
---
104: 3 11.51293 0.00000 polynomial 4 0.14884135 0.017 32 0 0 iris
105: 3 -11.51293 11.51293 polynomial 4 0.14884135 0.017 33 0 0 iris
106: 3 11.51293 -11.51293 radial NA 0.03030303 0.018 34 0 0 iris
107: 3 -11.51293 0.00000 polynomial 2 0.71925134 0.018 35 0 0 iris
108: 3 0.00000 11.51293 polynomial 2 0.09001783 0.018 36 0 0 iris
learner_id resampling_id
1: classif.svm.tuned cv
2: classif.svm.tuned cv
3: classif.svm.tuned cv
4: classif.svm.tuned cv
5: classif.svm.tuned cv
---
104: classif.svm.tuned cv
105: classif.svm.tuned cv
106: classif.svm.tuned cv
107: classif.svm.tuned cv
108: classif.svm.tuned cv
```

The aggregated performance of all outer resampling iterations is essentially the unbiased performance of an SVM with optimal hyperparameter found by grid search.

`rr$aggregate()`

```
classif.ce
0.04666667
```

Applying nested resampling can be shortened by using the `tune_nested()`

-shortcut.

```
learner = lrn("classif.svm", type = "C-classification")
learner$param_set$values$cost = to_tune(p_dbl(1e-5, 1e5, logscale = TRUE))
learner$param_set$values$gamma = to_tune(p_dbl(1e-5, 1e5, logscale = TRUE))
learner$param_set$values$kernel = to_tune(c("polynomial", "radial"))
learner$param_set$values$degree = to_tune(1, 4)
rr = tune_nested(
method = "grid_search",
task = tsk("iris"),
learner = learner,
inner_resampling = rsmp ("cv", folds = 3),
outer_resampling = rsmp("cv", folds = 3),
measure = msr("classif.ce"),
resolution = 3
)
```

The mlr3book includes chapters on tuning spaces and hyperparameter tuning. The mlr3cheatsheets contain frequently used commands and workflows of mlr3.

Hsu, Chih-wei, Chih-chung Chang, and Chih-Jen Lin. 2003. “A Practical Guide to Support Vector Classification.”

- Build a
`Graph`

that consists of two common preprocessing steps, then switches between two dimensionality reduction techniques followed by a`Learner`

vs. no dimensionality reduction followed by another`Learner`

- Define the search space for tuning that handles inter-dependencies between pipeline steps and hyperparameters
- Run a
`grid search`

to find an optimal choice of preprocessing steps and hyperparameters.

Ideally you already had a look at how to tune over multiple learners.

First, we load the packages we will need:

```
library(mlr3verse)
library(mlr3learners)
```

`lgr`

package is used for logging in all mlr3 packages. The mlr3 logger prints the logging messages from the base package, whereas the bbotk logger is responsible for logging messages from the optimization packages (e.g. mlr3tuning ).

```
set.seed(7832)
lgr::get_logger("mlr3")$set_threshold("warn")
lgr::get_logger("bbotk")$set_threshold("warn")
```

We are going to work with some gene expression data included as a supplement in the bst package. The data consists of 2308 gene profiles in 63 training and 20 test samples. The following data preprocessing steps are done analogously as in `vignette("khan", package = "bst")`

:

```
datafile = system.file("extdata", "supplemental_data", package = "bst")
dat0 = read.delim(datafile, header = TRUE, skip = 1)[, -(1:2)]
dat0 = t(dat0)
dat = data.frame(dat0[!(rownames(dat0) %in%
c("TEST.9", "TEST.13", "TEST.5", "TEST.3", "TEST.11")), ])
dat$class = as.factor(
c(substr(rownames(dat)[1:63], start = 1, stop = 2),
c("NB", "RM", "NB", "EW", "RM", "BL", "EW", "RM", "EW", "EW", "EW", "RM",
"BL", "RM", "NB", "NB", "NB", "NB", "BL", "EW")
)
)
```

We then construct our training and test `Task`

:

```
task = as_task_classif(dat, target = "class", id = "SRBCT")
task_train = task$clone(deep = TRUE)
task_train$filter(1:63)
task_test = task$clone(deep = TRUE)
task_test$filter(64:83)
```

Our graph will start with log transforming the features, followed by scaling them. Then, either a `PCA`

or `ICA`

is applied to extract principal / independent components followed by fitting a `LDA`

or a `ranger random forest`

is fitted without any preprocessing (the log transformation and scaling should most likely affect the `LDA`

more than the `ranger random forest`

). Regarding the `PCA`

and `ICA`

, both the number of principal / independent components are tuning parameters. Regarding the `LDA`

, we can further choose different methods for estimating the mean and variance and regarding the `ranger`

, we want to tune the `mtry`

and `num.tree`

parameters. Note that the `PCA-LDA`

combination has already been successfully applied in different cancer diagnostic contexts when the feature space is of high dimensionality (Morais and Lima 2018).

To allow for switching between the `PCA`

/ `ICA`

-`LDA`

and `ranger`

we can either use branching or proxy pipelines, i.e., `PipeOpBranch`

and `PipeOpUnbranch`

or `PipeOpProxy`

. We will first cover branching in detail and later show how the same can be done using `PipeOpProxy`

.

First, we have a look at the baseline `classification accuracy`

of the `LDA`

and `ranger`

on the training task:

```
base = benchmark(benchmark_grid(
task_train,
learners = list(lrn("classif.lda"), lrn("classif.ranger")),
resamplings = rsmp("cv", folds = 3)))
```

```
Warning in lda.default(x, grouping, ...): variables are collinear
Warning in lda.default(x, grouping, ...): variables are collinear
Warning in lda.default(x, grouping, ...): variables are collinear
```

`base$aggregate(measures = msr("classif.acc"))`

```
nr resample_result task_id learner_id resampling_id iters classif.acc
1: 1 <ResampleResult[21]> SRBCT classif.lda cv 3 0.6666667
2: 2 <ResampleResult[21]> SRBCT classif.ranger cv 3 0.9206349
```

The out-of-the-box `ranger`

appears to already have good performance on the training task. Regarding the `LDA`

, we do get a warning message that some features are colinear. This strongly suggests to reduce the dimensionality of the feature space. Let’s see if we can get some better performance, at least for the `LDA`

.

Our graph starts with log transforming the features (we explicitly use base 10 only for better interpretability when inspecting the model later), using `PipeOpColApply`

, followed by scaling the features using `PipeOpScale`

. Then, the first branch allows for switching between the `PCA`

/ `ICA`

-`LDA`

and `ranger`

, and within `PCA`

/ `ICA`

-`LDA`

, the second branch allows for switching between `PCA`

and `ICA`

:

```
graph1 =
po("colapply", applicator = function(x) log(x, base = 10)) %>>%
po("scale") %>>%
# pca / ica followed by lda vs. ranger
po("branch", id = "branch_learner", options = c("pca_ica_lda", "ranger")) %>>%
gunion(list(
po("branch", id = "branch_preproc_lda", options = c("pca", "ica")) %>>%
gunion(list(
po("pca"), po("ica")
)) %>>%
po("unbranch", id = "unbranch_preproc_lda") %>>%
lrn("classif.lda"),
lrn("classif.ranger")
)) %>>%
po("unbranch", id = "unbranch_learner")
```

Note that the names of the options within each branch are arbitrary, but ideally they describe what is happening. Therefore we go with `"pca_ica_lda"`

/ `"ranger`

” and `"pca"`

/ `"ica"`

. Finally, we also could have used the `branch`

`ppl`

to make branching easier (we will come back to this in the Proxy section). The graph looks like the following:

`graph1$plot(html = FALSE)`

We can inspect the parameters of the `ParamSet`

of the graph to see which parameters can be set:

`graph1$param_set$ids()`

```
[1] "colapply.applicator" "colapply.affect_columns"
[3] "scale.center" "scale.scale"
[5] "scale.robust" "scale.affect_columns"
[7] "branch_learner.selection" "branch_preproc_lda.selection"
[9] "pca.center" "pca.scale."
[11] "pca.rank." "pca.affect_columns"
[13] "ica.n.comp" "ica.alg.typ"
[15] "ica.fun" "ica.alpha"
[17] "ica.method" "ica.row.norm"
[19] "ica.maxit" "ica.tol"
[21] "ica.verbose" "ica.w.init"
[23] "ica.affect_columns" "classif.lda.dimen"
[25] "classif.lda.method" "classif.lda.nu"
[27] "classif.lda.predict.method" "classif.lda.predict.prior"
[29] "classif.lda.prior" "classif.lda.tol"
[31] "classif.ranger.alpha" "classif.ranger.always.split.variables"
[33] "classif.ranger.class.weights" "classif.ranger.holdout"
[35] "classif.ranger.importance" "classif.ranger.keep.inbag"
[37] "classif.ranger.max.depth" "classif.ranger.min.node.size"
[39] "classif.ranger.min.prop" "classif.ranger.minprop"
[41] "classif.ranger.mtry" "classif.ranger.mtry.ratio"
[43] "classif.ranger.num.random.splits" "classif.ranger.num.threads"
[45] "classif.ranger.num.trees" "classif.ranger.oob.error"
[47] "classif.ranger.regularization.factor" "classif.ranger.regularization.usedepth"
[49] "classif.ranger.replace" "classif.ranger.respect.unordered.factors"
[51] "classif.ranger.sample.fraction" "classif.ranger.save.memory"
[53] "classif.ranger.scale.permutation.importance" "classif.ranger.se.method"
[55] "classif.ranger.seed" "classif.ranger.split.select.weights"
[57] "classif.ranger.splitrule" "classif.ranger.verbose"
[59] "classif.ranger.write.forest"
```

The `id`

’s are prefixed by the respective `PipeOp`

they belong to, e.g., `pca.rank.`

refers to the `rank.`

parameter of `PipeOpPCA`

.

Our graph either fits a `LDA`

after applying `PCA`

or `ICA`

, or alternatively a `ranger`

with no preprocessing. These two **options** each define selection parameters that we can tune. Moreover, within the respective `PipeOp`

’s we want to tune the following parameters: `pca.rank.`

, `ica.n.comp`

, `classif.lda.method`

, `classif.ranger.mtry`

, and `classif.ranger.num.trees`

. The first two parameters are integers that in-principal could range from 1 to the number of features. However, for `ICA`

, the upper bound must not exceed the number of observations and as we will later use `3-fold`

`cross-validation`

as the resampling method for the tuning, we just set the upper bound to 30 (and do the same for `PCA`

). Regarding the `classif.lda.method`

we will only be interested in `"moment"`

estimation vs. minimum volume ellipsoid covariance estimation (`"mve"`

). Moreover, we set the lower bound of `classif.ranger.mtry`

to 200 (which is around the number of features divided by 10) and the upper bound to 1000.

```
tune_ps1 = ps(
branch_learner.selection =
p_fct(c("pca_ica_lda", "ranger")),
branch_preproc_lda.selection =
p_fct(c("pca", "ica"), depends = branch_learner.selection == "pca_ica_lda"),
pca.rank. =
p_int(1, 30, depends = branch_preproc_lda.selection == "pca"),
ica.n.comp =
p_int(1, 30, depends = branch_preproc_lda.selection == "ica"),
classif.lda.method =
p_fct(c("moment", "mve"), depends = branch_preproc_lda.selection == "ica"),
classif.ranger.mtry =
p_int(200, 1000, depends = branch_learner.selection == "ranger"),
classif.ranger.num.trees =
p_int(500, 2000, depends = branch_learner.selection == "ranger"))
```

The parameter `branch_learner.selection`

defines whether we go down the left (`PCA`

/ `ICA`

followed by `LDA`

) or the right branch (`ranger`

). The parameter `branch_preproc_lda.selection`

defines whether a `PCA`

or `ICA`

will be applied prior to the `LDA`

. The other parameters directly belong to the `ParamSet`

of the `PCA`

/ `ICA`

/ `LDA`

/ `ranger`

. Note that it only makes sense to switch between `PCA`

/ `ICA`

if the `"pca_ica_lda"`

branch was selected beforehand. We have to specify this via the `depends`

parameter.

Finally, we also could have proceeded to tune the numeric parameters on a log scale. I.e., looking at `pca.rank.`

the performance difference between rank 1 and 2 is probably much larger than between rank 29 and rank 30. The mlr3tuning Tutorial covers such transformations.

We can now tune the parameters of our graph as defined in the search space with respect to a measure. We will use the `classification accuracy`

. As a resampling method we use `3-fold cross-validation`

. We will use the `TerminatorNone`

(i.e., no early termination) for terminating the tuning because we will apply a `grid search`

(we use a `grid search`

because it gives nicely plottable and understandable results but if there were much more parameters, `random search`

or more intelligent optimization methods would be preferred to a `grid search`

:

```
tune1 = TuningInstanceSingleCrit$new(
task_train,
learner = graph1,
resampling = rsmp("cv", folds = 3),
measure = msr("classif.acc"),
search_space = tune_ps1,
terminator = trm("none")
)
```

We then perform a `grid search`

using a resolution of 4 for the numeric parameters. The grid being used will look like the following (note that the dependencies we specified above are handled automatically):

`generate_design_grid(tune_ps1, resolution = 4)`

We trigger the tuning.

```
tuner_gs = tnr("grid_search", resolution = 4, batch_size = 10)
tuner_gs$optimize(tune1)
```

```
branch_learner.selection branch_preproc_lda.selection pca.rank. ica.n.comp classif.lda.method classif.ranger.mtry
1: pca_ica_lda ica NA 10 mve NA
classif.ranger.num.trees learner_param_vals x_domain classif.acc
1: NA <list[8]> <list[4]> 0.984127
```

Now, we can inspect the results ordered by the `classification accuracy`

:

`as.data.table(tune1$archive)[order(classif.acc), ]`

We achieve very good accuracy using `ranger`

, more or less regardless how `mtry`

and `num.trees`

are set. However, the `LDA`

also shows very good accuracy when combined with `PCA`

or `ICA`

retaining 30 components.

For now, we decide to use `ranger`

with `mtry`

set to 200 and `num.trees`

set to 1000.

Setting these parameters manually in our graph, then training on the training task and predicting on the test task yields an accuracy of:

```
graph1$param_set$values$branch_learner.selection = "ranger"
graph1$param_set$values$classif.ranger.mtry = 200
graph1$param_set$values$classif.ranger.num.trees = 1000
graph1$train(task_train)
```

```
$unbranch_learner.output
NULL
```

`graph1$predict(task_test)[[1L]]$score(msr("classif.acc"))`

```
classif.acc
1
```

Note that we also could have wrapped our graph in a `GraphLearner`

and proceeded to use this as a learner in an `AutoTuner`

.

Instead of using branches to split our graph with respect to the learner and preprocessing options, we can also use `PipeOpProxy`

. `PipeOpProxy`

accepts a single `content`

parameter that can contain any other `PipeOp`

or `Graph`

. This is extremely flexible in the sense that we do not have to specify our options during construction. However, the parameters of the contained `PipeOp`

or `Graph`

are no longer directly contained in the `ParamSet`

of the resulting graph. Therefore, when tuning the graph, we do have to make use of a `trafo`

function.

```
graph2 =
po("colapply", applicator = function(x) log(x, base = 10)) %>>%
po("scale") %>>%
po("proxy")
```

This graph now looks like the following:

`graph2$plot(html = FALSE)`

At first, this may look like a linear graph. However, as the `content`

parameter of `PipeOpProxy`

can be tuned and set to contain any other `PipeOp`

or `Graph`

, this will allow for a similar non-linear graph as when doing branching.

`graph2$param_set$ids()`

```
[1] "colapply.applicator" "colapply.affect_columns" "scale.center" "scale.scale"
[5] "scale.robust" "scale.affect_columns" "proxy.content"
```

We can tune the graph by using the same search space as before. However, here the `trafo`

function is of central importance to actually set our options and parameters:

`tune_ps2 = tune_ps1$clone(deep = TRUE)`

The `trafo`

function does all the work, i.e., selecting either the `PCA`

/ `ICA`

-`LDA`

or `ranger`

as the `proxy.content`

as well as setting the parameters of the respective preprocessing `PipeOp`

s and `Learner`

s.

```
proxy_options = list(
pca_ica_lda =
ppl("branch", graphs = list(pca = po("pca"), ica = po("ica"))) %>>%
lrn("classif.lda"),
ranger = lrn("classif.ranger")
)
```

Above, we made use of the `branch`

`ppl`

allowing us to easily construct a branching graph. Of course we also could have use another nested `PipeOpProxy`

to specify the preprocessing options (`"pca"`

vs. `"ica"`

) within `proxy_options`

if for some reason we do not want to do branching at all. The `trafo`

function below selects one of the `proxy_options`

from above and sets the respective parameters for the `PCA`

, `ICA`

, `LDA`

and `ranger`

. Here, the argument `x`

is a list which will contain sampled / selected parameters from our `ParamSet`

(in our case, `tune_ps2`

). The return value is a list only including the appropriate `proxy.content`

parameter. In each tuning iteration, the `proxy.content`

parameter of our graph will be set to this value.

```
tune_ps2$trafo = function(x, param_set) {
proxy.content = proxy_options[[x$branch_learner.selection]]
if (x$branch_learner.selection == "pca_ica_lda") {
# pca_ica_lda
proxy.content$param_set$values$branch.selection = x$branch_preproc_lda.selection
if (x$branch_preproc_lda.selection == "pca") {
proxy.content$param_set$values$pca.rank. = x$pca.rank.
} else {
proxy.content$param_set$values$ica.n.comp = x$ica.n.comp
}
proxy.content$param_set$values$classif.lda.method = x$classif.lda.method
} else {
# ranger
proxy.content$param_set$values$mtry = x$classif.ranger.mtry
proxy.content$param_set$values$num.trees = x$classif.ranger.num.trees
}
list(proxy.content = proxy.content)
}
```

I.e., suppose that the following parameters will be selected from our `ParamSet`

:

```
x = list(
branch_learner.selection = "ranger",
classif.ranger.mtry = 200,
classif.ranger.num.trees = 500)
```

The `trafo`

function will then return:

`tune_ps2$trafo(x)`

```
$proxy.content
<LearnerClassifRanger:classif.ranger>
* Model: -
* Parameters: num.threads=1, mtry=200, num.trees=500
* Packages: mlr3, mlr3learners, ranger
* Predict Types: [response], prob
* Feature Types: logical, integer, numeric, character, factor, ordered
* Properties: hotstart_backward, importance, multiclass, oob_error, twoclass, weights
```

Tuning can be carried out analogously as done above:

```
tune2 = TuningInstanceSingleCrit$new(
task_train,
learner = graph2,
resampling = rsmp("cv", folds = 3),
measure = msr("classif.acc"),
search_space = tune_ps2,
terminator = trm("none")
)
tuner_gs$optimize(tune2)
```

`as.data.table(tune2$archive)[order(classif.acc), ]`

Morais, Camilo LM, and Kássio MG Lima. 2018. “Principal Component Analysis with Linear and Quadratic Discriminant Analysis for Identification of Cancer Samples Based on Mass Spectrometry.” *Journal of the Brazilian Chemical Society* 29 (3): 472–81. https://doi.org/10.21577/0103-5053.20170159.

`Tuner`

for real-valued search spaces are not able to tune on integer hyperparameters. However, it is possible to round the real values proposed by a `Tuner`

to integers before passing them to the learner in the evaluation. We show how to apply a parameter transformation to a `ParamSet`

and use this set in the tuning process.

We load the mlr3verse package which pulls in the most important packages for this example.

`library(mlr3verse)`

`Loading required package: mlr3`

We initialize the random number generator with a fixed seed for reproducibility, and decrease the verbosity of the logger to keep the output clearly represented.

```
set.seed(7832)
lgr::get_logger("mlr3")$set_threshold("warn")
lgr::get_logger("bbotk")$set_threshold("warn")
```

In this example, we use the k-Nearest-Neighbor classification learner. We want to tune the integer-valued hyperparameter `k`

which defines the numbers of neighbors.

```
learner = lrn("classif.kknn")
print(learner$param_set$params$k)
```

```
id class lower upper levels default
1: k ParamInt 1 Inf 7
```

We choose generalized simulated annealing as tuning strategy. The `param_classes`

field of `TunerGenSA`

states that the tuner only supports real-valued (`ParamDbl`

) hyperparameter tuning.

`print(tnr("gensa"))`

```
<TunerGenSA>: Generalized Simulated Annealing
* Parameters: trace.mat=FALSE, smooth=FALSE
* Parameter classes: ParamDbl
* Properties: single-crit
* Packages: mlr3tuning, bbotk, GenSA
```

To get integer-valued hyperparameter values for `k`

, we construct a search space with a transformation function. The `as.integer()`

function converts any real valued number to an integer by removing the decimal places.

```
search_space = ps(
k = p_dbl(lower = 3, upper = 7.99, trafo = as.integer)
)
```

We start the tuning and compare the results of the search space to the results in the space of the learners hyperparameter set.

```
instance = tune(
method = "gensa",
task = tsk("iris"),
learner = learner,
resampling = rsmp("holdout"),
measure = msr("classif.ce"),
term_evals = 20,
search_space = search_space)
```

```
Warning in optim(theta.old, fun, gradient, control = control, method = method, : one-dimensional optimization by Nelder-Mead is unreliable:
use "Brent" or optimize() directly
```

The optimal `k`

is still a real number in the search space.

`instance$result_x_search_space`

```
k
1: 3.82686
```

However, in the learners hyperparameters space, `k`

is an integer value.

`instance$result_x_domain`

```
$k
[1] 3
```

The archive shows us that for all real-valued `k`

proposed by GenSA, an integer-valued `k`

in the learner hyperparameter space (`x_domain_k`

) was created.

`as.data.table(instance$archive)[, .(k, classif.ce, x_domain_k)]`

```
k classif.ce x_domain_k
1: 3.826860 0.06 3
2: 5.996323 0.06 5
3: 5.941332 0.06 5
4: 3.826860 0.06 3
5: 3.826860 0.06 3
6: 3.826860 0.06 3
7: 4.209546 0.06 4
8: 3.444174 0.06 3
9: 4.018203 0.06 4
10: 3.635517 0.06 3
11: 3.922532 0.06 3
12: 3.731189 0.06 3
13: 3.874696 0.06 3
14: 3.779024 0.06 3
15: 3.850778 0.06 3
16: 3.802942 0.06 3
17: 3.838819 0.06 3
18: 3.814901 0.06 3
19: 3.832840 0.06 3
20: 3.820881 0.06 3
```

Internally, `TunerGenSA`

was given the parameter types of the search space and therefore suggested real numbers for `k`

. Before the performance of the different `k`

values was evaluated, the transformation function of the `search_space`

parameter set was called and `k`

was transformed to an integer value.

Note that the tuner is not aware of the transformation. This has two problematic consequences: First, the tuner might propose different real valued configurations that after rounding end up to be already evaluated configurations and we end up with re-evaluating the same hyperparameter configuration. This is only problematic, if we only optimze integer parameters. Second, the rounding introduces discontinuities which can be problematic for some tuners.

We successfully tuned a integer-valued hyperparameter with `TunerGenSA`

which is only suitable for an real-valued search space. This technique is not limited to tuning problems. `Optimizer`

in bbotk can be also used in the same way to produce points with integer parameters.

In this tutorial, we introduce the mlr3fselect package by comparing feature selection methods on the Titanic disaster data set. The objective of feature selection is to enhance the interpretability of models, speed up the learning process and increase the predictive performance.

We load the mlr3verse package which pulls in the most important packages for this example.

```
library(mlr3verse)
library(mlr3fselect)
```

We initialize the random number generator with a fixed seed for reproducibility, and decrease the verbosity of the logger to keep the output clearly represented.

```
set.seed(7832)
lgr::get_logger("mlr3")$set_threshold("warn")
lgr::get_logger("bbotk")$set_threshold("warn")
```

The Titanic data set contains data for 887 Titanic passengers, including whether they survived when the Titanic sank. Our goal will be to predict the survival of the Titanic passengers.

After loading the data set from the mlr3data package, we impute the missing age values with the median age of the passengers, set missing embarked values to `"s"`

and remove `character`

features. We could use feature engineering to create new features from the `character`

features, however we want to focus on feature selection in this tutorial.

In addition to the `survived`

column, the reduced data set contains the following attributes for each passenger:

Feature | Description |
---|---|

`age` |
Age |

`sex` |
Sex |

`sib_sp` |
Number of siblings / spouses aboard |

`parch` |
Number of parents / children aboard |

`fare` |
Amount paid for the ticket |

`pc_class` |
Passenger class |

`embarked` |
Port of embarkation |

```
library(mlr3data)
data("titanic", package = "mlr3data")
titanic$age[is.na(titanic$age)] = median(titanic$age, na.rm = TRUE)
titanic$embarked[is.na(titanic$embarked)] = "S"
titanic$ticket = NULL
titanic$name = NULL
titanic$cabin = NULL
titanic = titanic[!is.na(titanic$survived),]
```

We construct a binary classification task.

`task = as_task_classif(titanic, target = "survived", positive = "yes")`

We use the logistic regression learner provided by the mlr3learners package.

```
library(mlr3learners)
learner = lrn("classif.log_reg")
```

To evaluate the predictive performance, we choose a 3-fold cross-validation and the classification error as the measure.

```
resampling = rsmp("cv", folds = 3)
measure = msr("classif.ce")
resampling$instantiate(task)
```

The `FSelectInstanceSingleCrit`

class specifies a general feature selection scenario. It includes the `ObjectiveFSelect`

object that encodes the black box objective function which is optimized by a feature selection algorithm. The evaluated feature sets are stored in an `ArchiveFSelect`

object. The archive provides a method for querying the best performing feature set.

The `Terminator`

classes determine when to stop the feature selection. In this example we choose a terminator that stops the feature selection after 10 seconds. The sugar functions `trm()`

and `trms()`

can be used to retrieve terminators from the `mlr_terminators`

dictionary.

```
terminator = trm("run_time", secs = 10)
FSelectInstanceSingleCrit$new(
task = task,
learner = learner,
resampling = resampling,
measure = measure,
terminator = terminator)
```

```
<FSelectInstanceSingleCrit>
* State: Not optimized
* Objective: <ObjectiveFSelect:classif.log_reg_on_titanic>
* Terminator: <TerminatorRunTime>
```

The `FSelector`

subclasses describe the feature selection strategy. The sugar function `fs()`

can be used to retrieve feature selection algorithms from the `mlr_fselectors`

dictionary.

`mlr_fselectors`

```
<DictionaryFSelector> with 8 stored values
Keys: design_points, exhaustive_search, genetic_search, random_search, rfe, rfecv, sequential,
shadow_variable_search
```

Random search randomly draws feature sets and evaluates them in batches. We retrieve the `FSelectorRandomSearch`

class with the `fs()`

sugar function and choose `TerminatorEvals`

. We set the `n_evals`

parameter to `10`

which means that 10 feature sets are evaluated.

```
terminator = trm("evals", n_evals = 10)
instance = FSelectInstanceSingleCrit$new(
task = task,
learner = learner,
resampling = resampling,
measure = measure,
terminator = terminator)
fselector = fs("random_search", batch_size = 5)
```

The feature selection is started by passing the `FSelectInstanceSingleCrit`

object to the `$optimize()`

method of `FSelectorRandomSearch`

which generates the feature sets. These features set are internally passed to the `$eval_batch()`

method of `FSelectInstanceSingleCrit`

which evaluates them with the objective function and stores the results in the archive. This general interaction between the objects of **mlr3fselect** stays the same for the different feature selection methods. However, the way how new feature sets are generated differs depending on the chosen `FSelector`

subclass.

`fselector$optimize(instance)`

```
age embarked fare parch pclass sex sib_sp features classif.ce
1: TRUE FALSE TRUE TRUE TRUE TRUE TRUE age,fare,parch,pclass,sex,sib_sp 0.2020202
```

The `ArchiveFSelect`

stores a `data.table::data.table()`

which consists of the evaluated feature sets and the corresponding estimated predictive performances.

`as.data.table(instance$archive, exclude_columns = c("runtime_learners", "resample_result", "uhash"))`

```
age embarked fare parch pclass sex sib_sp classif.ce timestamp batch_nr warnings errors
1: TRUE TRUE TRUE TRUE TRUE TRUE TRUE 0.2031425 2023-03-03 10:45:07 1 0 0
2: TRUE FALSE FALSE FALSE FALSE FALSE TRUE 0.3838384 2023-03-03 10:45:07 1 0 0
3: FALSE FALSE FALSE TRUE FALSE FALSE TRUE 0.3804714 2023-03-03 10:45:07 1 0 0
4: FALSE FALSE TRUE FALSE FALSE FALSE FALSE 0.3288440 2023-03-03 10:45:07 1 0 0
5: FALSE FALSE TRUE FALSE FALSE TRUE FALSE 0.2188552 2023-03-03 10:45:07 1 0 0
6: FALSE FALSE FALSE FALSE TRUE FALSE FALSE 0.3209877 2023-03-03 10:45:08 2 0 0
7: TRUE FALSE FALSE FALSE FALSE FALSE TRUE 0.3838384 2023-03-03 10:45:08 2 0 0
8: TRUE FALSE TRUE TRUE TRUE TRUE TRUE 0.2020202 2023-03-03 10:45:08 2 0 0
9: TRUE TRUE TRUE TRUE TRUE TRUE TRUE 0.2031425 2023-03-03 10:45:08 2 0 0
10: TRUE FALSE TRUE TRUE FALSE FALSE FALSE 0.3389450 2023-03-03 10:45:08 2 0 0
features
1: age,embarked,fare,parch,pclass,sex,...
2: age,sib_sp
3: parch,sib_sp
4: fare
5: fare,sex
6: pclass
7: age,sib_sp
8: age,fare,parch,pclass,sex,sib_sp
9: age,embarked,fare,parch,pclass,sex,...
10: age,fare,parch
```

The associated resampling iterations can be accessed in the `BenchmarkResult`

by calling

`instance$archive$benchmark_result`

```
<BenchmarkResult> of 30 rows with 10 resampling runs
nr task_id learner_id resampling_id iters warnings errors
1 titanic classif.log_reg cv 3 0 0
2 titanic classif.log_reg cv 3 0 0
3 titanic classif.log_reg cv 3 0 0
4 titanic classif.log_reg cv 3 0 0
5 titanic classif.log_reg cv 3 0 0
6 titanic classif.log_reg cv 3 0 0
7 titanic classif.log_reg cv 3 0 0
8 titanic classif.log_reg cv 3 0 0
9 titanic classif.log_reg cv 3 0 0
10 titanic classif.log_reg cv 3 0 0
```

We retrieve the best performing feature set with

`instance$result`

```
age embarked fare parch pclass sex sib_sp features classif.ce
1: TRUE FALSE TRUE TRUE TRUE TRUE TRUE age,fare,parch,pclass,sex,sib_sp 0.2020202
```

We try sequential forward selection. We chose `TerminatorStagnation`

that stops the feature selection if the predictive performance does not increase anymore.

```
terminator = trm("stagnation", iters = 5)
instance = FSelectInstanceSingleCrit$new(
task = task,
learner = learner,
resampling = resampling,
measure = measure,
terminator = terminator)
fselector = fs("sequential")
fselector$optimize(instance)
```

```
age embarked fare parch pclass sex sib_sp features classif.ce
1: FALSE FALSE FALSE TRUE TRUE TRUE TRUE parch,pclass,sex,sib_sp 0.1964085
```

The `FSelectorSequential`

object has a special method for displaying the optimization path of the sequential feature selection.

`fselector$optimization_path(instance)`

```
age embarked fare parch pclass sex sib_sp classif.ce batch_nr
1: TRUE FALSE FALSE FALSE FALSE FALSE FALSE 0.3838384 1
2: TRUE FALSE FALSE FALSE FALSE TRUE FALSE 0.2132435 2
3: TRUE FALSE FALSE FALSE FALSE TRUE TRUE 0.2087542 3
4: TRUE FALSE FALSE FALSE TRUE TRUE TRUE 0.2143659 4
5: TRUE FALSE FALSE TRUE TRUE TRUE TRUE 0.2065095 5
6: TRUE FALSE TRUE TRUE TRUE TRUE TRUE 0.2020202 6
```

Recursive feature elimination utilizes the `$importance()`

method of learners. In each iteration the feature(s) with the lowest importance score is dropped. We choose the non-recursive algorithm (`recursive = FALSE`

) which calculates the feature importance once on the complete feature set. The recursive version (`recursive = TRUE`

) recomputes the feature importance on the reduced feature set in every iteration.

```
learner = lrn("classif.ranger", importance = "impurity")
terminator = trm("none")
instance = FSelectInstanceSingleCrit$new(
task = task,
learner = learner,
resampling = resampling,
measure = measure,
terminator = terminator,
store_models = TRUE)
fselector = fs("rfe", recursive = FALSE)
fselector$optimize(instance)
```

```
age embarked fare parch pclass sex sib_sp features classif.ce
1: TRUE TRUE TRUE TRUE TRUE TRUE TRUE age,embarked,fare,parch,pclass,sex,... 0.1694725
```

We access the results.

`as.data.table(instance$archive, exclude_columns = c("runtime_learners", "timestamp", "batch_nr", "resample_result", "uhash"))`

```
age embarked fare parch pclass sex sib_sp classif.ce warnings errors importance
1: TRUE TRUE TRUE TRUE TRUE TRUE TRUE 0.1694725 0 0 7,6,5,4,3,2,...
2: TRUE FALSE TRUE FALSE FALSE TRUE FALSE 0.2132435 0 0 7,6,5
features
1: age,embarked,fare,parch,pclass,sex,...
2: age,fare,sex
```

It is a common mistake to report the predictive performance estimated on resampling sets during the feature selection as the performance that can be expected from the combined feature selection and model training. The repeated evaluation of the model might leak information about the test sets into the model and thus leads to over-fitting and over-optimistic performance results. Nested resampling uses an outer and inner resampling to separate the feature selection from the performance estimation of the model. We can use the `AutoFSelector`

class for running nested resampling. The `AutoFSelector`

essentially combines a given `Learner`

and feature selection method into a `Learner`

with internal automatic feature selection. The inner resampling loop that is used to determine the best feature set is conducted internally each time the `AutoFSelector`

`Learner`

object is trained.

```
resampling_inner = rsmp("cv", folds = 5)
measure = msr("classif.ce")
at = AutoFSelector$new(
learner = learner,
resampling = resampling_inner,
measure = measure,
terminator = terminator,
fselect = fs("sequential"),
store_models = TRUE)
```

We put the `AutoFSelector`

into a `resample()`

call to get the outer resampling loop.

```
resampling_outer = rsmp("cv", folds = 3)
rr = resample(task, at, resampling_outer, store_models = TRUE)
```

The aggregated performance of all outer resampling iterations is the unbiased predictive performance we can expected from the logistic regression model with an optimized feature set found by sequential selection.

`rr$aggregate()`

```
classif.ce
0.1829405
```

We check whether the feature sets that were selected in the inner resampling are stable. The selected feature sets should not differ too much. We might observe unstable models in this example because the small data set and the low number of resampling iterations might introduces too much randomness. Usually, we aim for the selection of similar feature sets for all outer training sets.

`extract_inner_fselect_results(rr)`

Next, we want to compare the predictive performances estimated on the outer resampling to the inner resampling. Significantly lower predictive performances on the outer resampling indicate that the models with the optimized feature sets overfit the data.

`rr$score()[, .(iteration, task_id, learner_id, resampling_id, classif.ce)]`

```
iteration task_id learner_id resampling_id classif.ce
1: 1 titanic classif.ranger.fselector cv 0.1515152
2: 2 titanic classif.ranger.fselector cv 0.1952862
3: 3 titanic classif.ranger.fselector cv 0.2020202
```

The archives of the `AutoFSelector`

s gives us all evaluated feature sets with the associated predictive performances.

`extract_inner_fselect_archives(rr)`

Selecting a feature subset can be shortened by using the `fselect()`

-shortcut.

```
instance = fselect(
method = "random_search",
task = tsk("iris"),
learner = lrn("classif.log_reg"),
resampling = rsmp("cv", folds = 3),
measure = msr("classif.ce"),
term_evals = 10
)
```

Applying nested resampling can be shortened by using the `fselect_nested()`

-shortcut.

```
rr = fselect_nested(
method = "random_search",
task = tsk("iris"),
learner = lrn("classif.log_reg"),
inner_resampling = rsmp ("cv", folds = 3),
outer_resampling = rsmp("cv", folds = 3),
measure = msr("classif.ce"),
term_evals = 10
)
```

Predicting probabilities in classification tasks allows us to adjust the probability thresholds required for assigning an observation to a certain class. This can lead to improved classification performance, especially for cases where we e.g. aim to balance off metrics such as false positive and false negative rates.

This is for example often done in ROC Analysis. The mlr3book also has a chapter on ROC Analysis) for the interested reader. This post does not focus on ROC analysis, but instead focusses on the general problem of adjusting classification thresholds for arbitrary metrics.

This post assumes some familiarity with the mlr3, and also the mlr3pipelines and mlr3tuning packages, as both are used during the post. The mlr3book contains more details on those two packages. This post is a more in-depth version of the article on threshold tuning in the mlr3book.

We load the mlr3verse package which pulls in the most important packages for this example.

`library(mlr3verse)`

```
set.seed(7832)
lgr::get_logger("mlr3")$set_threshold("warn")
lgr::get_logger("bbotk")$set_threshold("warn")
```

In order to understand thresholds, we will quickly showcase the effect of setting different thresholds:

First we create a learner that predicts probabilities and use it to predict on holdout data, storing the prediction.

```
learner = lrn("classif.rpart", predict_type = "prob")
rr = resample(tsk("pima"), learner, rsmp("holdout"))
prd = rr$prediction()
prd
```

```
<PredictionClassif> for 256 observations:
row_ids truth response prob.pos prob.neg
4 neg neg 0.1057692 0.8942308
6 neg neg 0.0200000 0.9800000
10 pos neg 0.1428571 0.8571429
---
764 neg neg 0.2777778 0.7222222
766 neg neg 0.0200000 0.9800000
767 pos pos 0.8000000 0.2000000
```

If we now look at the confusion matrix, the off-diagonal elements are errors made by our model (*false positives* and *false negatives*) while on-diagol ements are where our model predicted correctly.

```
# Print confusion matrix
prd$confusion
```

```
truth
response pos neg
pos 53 27
neg 37 139
```

```
# Print False Positives and False Negatives
prd$score(list(msr("classif.fp"), msr("classif.fn")))
```

```
classif.fp classif.fn
27 37
```

By adjusting the **classification threshold**, in this case the probability required to predict the positive class, we can now trade off predicting more positive cases (first row) against predicting fewer negative cases (second row) or vice versa.

```
# Lower threshold: More positives
prd$set_threshold(0.25)$confusion
```

```
truth
response pos neg
pos 78 71
neg 12 95
```

```
# Higher threshold: Fewer positives
prd$set_threshold(0.75)$confusion
```

```
truth
response pos neg
pos 52 20
neg 38 146
```

This threshold value can now be adjusted optimally for a given measure, such as accuracy. How this can be done is discussed in the following section.

Currently mlr3pipelines offers two main strategies towards adjusting `classification thresholds`

. We can either expose the thresholds as a `hyperparameter`

of the Learner by using `PipeOpThreshold`

. This allows us to tune the `thresholds`

via an outside optimizer from mlr3tuning.

Alternatively, we can also use `PipeOpTuneThreshold`

which automatically tunes the threshold after each learner fit.

In this blog-post, we’ll go through both strategies.

`PipeOpThreshold`

can be put directly after a `Learner`

.

A simple example would be:

```
gr = lrn("classif.rpart", predict_type = "prob") %>>% po("threshold")
l = GraphLearner$new(gr)
```

Note, that `predict_type`

= “prob” is required for `po("threshold")`

to have any effect.

The `thresholds`

are now exposed as a `hyperparameter`

of the `GraphLearner`

we created:

`as.data.table(l$param_set)[, .(id, class, lower, upper, nlevels)]`

```
id class lower upper nlevels
1: classif.rpart.cp ParamDbl 0 1 Inf
2: classif.rpart.keep_model ParamLgl NA NA 2
3: classif.rpart.maxcompete ParamInt 0 Inf Inf
4: classif.rpart.maxdepth ParamInt 1 30 30
5: classif.rpart.maxsurrogate ParamInt 0 Inf Inf
6: classif.rpart.minbucket ParamInt 1 Inf Inf
7: classif.rpart.minsplit ParamInt 1 Inf Inf
8: classif.rpart.surrogatestyle ParamInt 0 1 2
9: classif.rpart.usesurrogate ParamInt 0 2 3
10: classif.rpart.xval ParamInt 0 Inf Inf
11: threshold.thresholds ParamUty NA NA Inf
```

We can now tune those thresholds from the outside as follows:

Before `tuning`

, we have to define which hyperparameters we want to tune over. In this example, we only tune over the `thresholds`

parameter of the `threshold`

`PipeOp`

. you can easily imagine, that we can also jointly tune over additional hyperparameters, i.e. rpart’s `cp`

parameter.

As the `Task`

we aim to optimize for is a binary task, we can simply specify the threshold parameter:

```
search_space = ps(
threshold.thresholds = p_dbl(lower = 0, upper = 1)
)
```

We now create a `AutoTuner`

, which automatically tunes the supplied learner over the `ParamSet`

we supplied above.

```
at = auto_tuner(
method = "random_search",
learner = l,
resampling = rsmp("cv", folds = 3L),
measure = msr("classif.ce"),
search_space = search_space,
term_evals = 5L,
)
at$train(tsk("german_credit"))
```

For multi-class `Tasks`

, this is a little more complicated. We have to use a `trafo`

to transform a set of `ParamDbl`

into the desired format for `threshold.thresholds`

: A named numeric vector containing the thresholds. This can be easily achieved via a `trafo`

function:

```
search_space = ps(
versicolor = p_dbl(lower = 0, upper = 1),
setosa = p_dbl(lower = 0, upper = 1),
virginica = p_dbl(lower = 0, upper = 1),
.extra_trafo = function(x, param_set) {
list(threshold.thresholds = mlr3misc::map_dbl(x, identity))
}
)
```

Inside the `.exta_trafo`

, we simply collect all set params into a named vector via `map_dbl`

and store it in the `threshold.thresholds`

slot expected by the learner.

Again, we create a `AutoTuner`

, which automatically tunes the supplied learner over the `ParamSet`

we supplied above.

```
at_2 = auto_tuner(
method = "random_search",
learner = l,
resampling = rsmp("cv", folds = 3L),
measure = msr("classif.ce"),
search_space = search_space,
term_evals = 5L,
)
at_2$train(tsk("iris"))
```

One drawback of this strategy is, that this requires us to fit a new model for each new threshold setting. While setting a threshold and computing performance is relatively cheap, fitting the learner is often more computationally demanding. A better strategy is therefore often to optimize the thresholds separately after each model fit.

`PipeOpTuneThreshold`

on the other hand works together with `PipeOpLearnerCV`

. It directly optimizes the `cross-validated`

predictions made by this `PipeOp`

.

A simple example would be:

```
gr = po("learner_cv", lrn("classif.rpart", predict_type = "prob")) %>>%
po("tunethreshold")
l2 = GraphLearner$new(gr)
```

Note, that `predict_type`

= “prob” is required for `po("tunethreshold")`

to have any effect. Additionally, note that this time no `threshold`

parameter is exposed, it is automatically tuned internally.

`as.data.table(l2$param_set)[, .(id, class, lower, upper, nlevels)]`

```
id class lower upper nlevels
1: classif.rpart.resampling.method ParamFct NA NA 2
2: classif.rpart.resampling.folds ParamInt 2 Inf Inf
3: classif.rpart.resampling.keep_response ParamLgl NA NA 2
4: classif.rpart.cp ParamDbl 0 1 Inf
5: classif.rpart.keep_model ParamLgl NA NA 2
6: classif.rpart.maxcompete ParamInt 0 Inf Inf
7: classif.rpart.maxdepth ParamInt 1 30 30
8: classif.rpart.maxsurrogate ParamInt 0 Inf Inf
9: classif.rpart.minbucket ParamInt 1 Inf Inf
10: classif.rpart.minsplit ParamInt 1 Inf Inf
11: classif.rpart.surrogatestyle ParamInt 0 1 2
12: classif.rpart.usesurrogate ParamInt 0 2 3
13: classif.rpart.xval ParamInt 0 Inf Inf
14: classif.rpart.affect_columns ParamUty NA NA Inf
15: tunethreshold.measure ParamUty NA NA Inf
16: tunethreshold.optimizer ParamUty NA NA Inf
17: tunethreshold.log_level ParamUty NA NA Inf
```

If we now use the `GraphLearner`

, it automatically adjusts the `thresholds`

during prediction.

Note that we can set `ResamplingInsample`

as a resampling strategy for `PipeOpLearnerCV`

in order to evaluate predictions on the “training” data. This is generally not advised, as it might lead to over-fitting on the thresholds but can significantly reduce runtime.

Finally, we can compare no threshold tuning to the `tunethreshold`

approach:

```
bmr = benchmark(benchmark_grid(
learners = list(no_tuning = lrn("classif.rpart"), internal = l2),
tasks = tsk("german_credit"),
rsmp("cv", folds = 3L)
))
```

`bmr$aggregate(list(msr("classif.ce"), msr("classif.fnr")))`

```
nr resample_result task_id learner_id resampling_id iters classif.ce classif.fnr
1: 1 <ResampleResult[21]> german_credit classif.rpart cv 3 0.2760095 0.12723983
2: 2 <ResampleResult[21]> german_credit classif.rpart.tunethreshold cv 3 0.2879916 0.04485325
```

The following examples were created as part of the Introduction to Machine Learning Lecture at LMU Munich. The goal of the project was to create and compare one or several machine learning pipelines for the problem at hand together with exploratory analysis and an exposition of results. The posts were contributed to the mlr3gallery by the authors and edited for better legibility by the editor. We want to thank the authors for allowing us to publish their results. Note, that correctness of the results can not be guaranteed.

This tutorial assumes familiarity with the basics of mlr3tuning and mlr3pipelines. Consult the mlr3book if some aspects are not fully understandable. We load the most important packages for this example.

```
library(mlr3verse)
library(dplyr)
library(tidyr)
library(DataExplorer)
library(ggplot2)
library(gridExtra)
```

```
set.seed(7832)
lgr::get_logger("mlr3")$set_threshold("warn")
lgr::get_logger("bbotk")$set_threshold("warn")
```

Note, that expensive calculations are pre-saved in rds files in this tutorial to save computational time.

Machine learning (ML), a branch of both computer science and statistics, in conjunction with new computing technologies has been transforming research and industries across the board over the past decade. A prime example for this is the healthcare industry, where applications of ML, as well as artificial intelligence in general, have become more and more popular in recent years. One very frequently researched and applied use of ML in the medical field is the area of disease identification and diagnosis. ML technologies have shown potential in detecting anomalies and diseases through pattern recognition, even though an entirely digital diagnosis by a computer is probably still something for the far future. However, suitable and reliable models estimating the risk of diseases could help real doctors make quicker and better decisions today already. In this use case we examined machine learning algorithms and learners for the specific application of liver disease detection. The task is therefore a binary classification task to predict whether a patient has liver disease or not based on some common diagnostic measurements. This report is organized as follows. Section 1 introduces the data and section 2 provides more in-depth data exploration. Section 3 presents learners and their hyperparameter tuning while section 4, dealing with model fitting and benchmarking, presents results and conclusions.

The data set we used for our project is the “Indian Liver Patient Dataset” which was obtained from the mlr3data package. It was donated by three professors from India in 2012 “UCI Machine Learning Repository” (n.d.).

```
# Importing data
data("ilpd", package = "mlr3data")
```

It contains data for 583 patients collected in the north east of Andhra Pradesh, one of the 28 states of India. The observations are divided into two classes based on the patient having liver disease or not. Besides the class variable, which is our target, ten, mostly numerical, features are provided. To describe the features in more detail, the table below lists the variables included in the dataset.

Variable | Description |
---|---|

age | Age of the patient (all patients above 89 are labelled as 90 |

gender | Sex of the patient (1 = female, 0 = male) |

total_bilirubin | Total serum bilirubin (in mg/dL) |

direct_bilirubin | Direct bilirubin level (in mg/dL) |

alkaline_phosphatase | Serum alkaline phosphatase level (in U/L) |

alanine_transaminase | Serum alanine transaminase level (in U/L) |

aspartate_transaminase | Serum aspartate transaminase level (in U/L) |

total_protein | Total serum protein (in g/dL) |

albumin | Serum albumin level (in g/dL) |

albumin_globulin_ratio | Albumin-to-globulin ratio |

diseased | Target variable (1 = liver disease, 0 = no liver disease) |

As one can see, besides age and gender, the dataset contains eight additional numerical features. While the names and corresponding measurements look rather cryptic to the uninformed eye, they are all part of standard blood tests conducted to gather information about the state of a patient’s liver, so-called liver function tests. All of these measurements are frequently used markers for liver disease. For the first five, measuring the chemical compound bilirubin and the three enzymes alkaline phosphatase, alanine transaminase and aspartate transaminase, elevated levels indicate liver disease Gowda et al. (2009) Oh and Hustead (2011). For the remaining three, which concern protein levels, lower-than-normal values suggest a liver problem Carvalho and Machado (2018) “Total Protein, Albumin-Globulin (A/G) Ratio” (n.d.). Lastly, one should note that some of the measurements are part of more than one variable. For example, the total serum bilirubin is simply the sum of both the direct and indirect bilirubin levels and the amount of albumin is used to calculate the values of the total serum protein as well as the albumin-to-globulin ratio. So, one might already suspect that some of the features are highly correlated to one another, but more on that kind of analysis in the following section.

Next, we looked into the univariate distribution of each of the variables. We began with the target and the only discrete feature, gender, which are both binary.

The distribution of the target variable is quite imbalanced, as the barplot shows: the number of patients with and without liver disease equals 416 and 167, respectively. The underrepresentation of a class, in our case those without liver disease, might worsen the performance of ML models. In order to examine this, we additionally fitted the models on a dataset where we randomly over-sampled the minority class, resulting in a perfectly balanced dataset. Furthermore, we applied stratified sampling to ensure the proportion of the classes is maintained during cross-validation.

The only discrete feature gender is quite imbalanced, too. As one can see in the next section, this proportion is also observed within each target class. Prior to that, we looked into the distributions of the metric features.

Strikingly, some of the metric features are extremely right-skewed and contain several extreme values. To reduce the impact of outliers and since some models assume normality of features, we log-transformed these variables.

To picture the relationship between the target and the features, we analysed the distributions of the features by class. First, we examined the discrete feature gender.

The percentage of males in the “disease” class is slightly higher, but overall the difference is small. Besides that, the gender imbalance can be observed in both classes, as we mentioned before. To see the differences in metric features, we compare the following boxplots, where right-skewed features are not log-transformed yet.

Except for the total amount of protein, for each feature we obtain differences between the median values of the two classes. Notably, in the case of strongly right-skewed features the “disease” class contains far more extreme values than the “no disease” class, which is probably because of its larger size. This effect is weakened by log-transforming such features, as can be seen in the boxplots below. Moreover, the dispersion in the class “disease” is greater for these features, as the length of the boxes indicates. Overall, the features seem to be correlated to the target, so it makes sense to use them for this task and model their relationship with the target.

Note, that the same result can be achieved more easily by using `PipeOpMutate`

from mlr3pipelines. This PipeOp provides a smooth implementation to scale numeric features for mlr3 tasks.

As we mentioned in the description of the data, there are features that are indirectly measured by another one. This suggests that they are highly correlated. Some of the models we want to compare assume independent features or have problems with multicollinearity. Therefore, we checked for correlations between features.

```
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
```

`Warning in cor(data, use = method[1], method = method[2]): the standard deviation is zero`

For four of the pairs we obtained a very high correlation coefficient. Looking at these features, it is clear they affect each other. As the complexity of the model should be minimized and due to multicollinearity concerns, we decided to take only one of each pair. When deciding on which features to keep, we chose those that are more specific and relevant regarding liver disease. Therefore, we chose albumin over the ratio between albumin and globulin and also over the total amount of protein. The same argument applies to using the amount of direct bilirubin instead of the total amount of bilirubin. Regarding aspartate transaminase and alanine transaminase, it was not clear which one to use, especially since we have no given real world implementation for the task and no medical training. Since we did not notice any fundamental differences in the data for these two features, we arbitrarily chose aspartate transaminase.

```
## Reducing, transforming and scaling dataset
ilpd = ilpd %>%
select(-total_bilirubin, -alanine_transaminase, -total_protein,
-albumin_globulin_ratio) %>%
mutate(
# Recode gender
gender = as.numeric(ifelse(gender == "Female", 1, 0)),
# Remove labels for class
diseased = factor(ifelse(diseased == "yes", 1, 0)),
# Log for features with skewed distributions
alkaline_phosphatase = log(alkaline_phosphatase),
aspartate_transaminase = log(aspartate_transaminase),
direct_bilirubin = log(direct_bilirubin)
)
po_scale = po("scale")
po_scale$param_set$values$affect_columns =
selector_name(c("age", "direct_bilirubin", "alkaline_phosphatase",
"aspartate_transaminase", "albumin"))
```

Lastly, we standardized all metric features, as different ranges and units might weigh features. This is especially important for the k-NN model. The following table shows the final dataset and the transformations we applied. **Note**: Different from `log`

or other transformation, `scaling`

depends on the data themselves. Scaling data before data are split leads to data leakage, were information of train and test set are shared. As Data Leakage causes higher performance, scaling should always be applied in each data split induced by the ML workflow separately. Therefore we strongly recommend the usage of `PipeOpScale`

in such cases.

Variable | Transformation |
---|---|

age | scaled |

albumin | scaled |

alkaline_phosphatase | scaled and log-transformed |

aspartate_transaminase | scaled and log-transformed |

direct_bilirubin | scaled and log-transformed |

diseased | none |

gender | none |

First, we need to define a task which contains the final dataset and some meta information. Further we need to specify the positive class since the package takes the first one as the positive class by default. The specification of the positive class has an impact on the evaluation later on.

```
## Task definition
task_liver = as_task_classif(ilpd, target = "diseased", positive = "1")
```

In the following we are going to evaluate logistic regression, linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), naive Bayes, k-nearest neighbour (k-NN), classification trees (CART) and random forest on the binary target.

```
## Learner definition
# Use predict type "prob" for the AUC score. Predict on train and test sets to
# detect overfitting
learners = list(
learner_logreg = lrn("classif.log_reg", predict_type = "prob",
predict_sets = c("train", "test")),
learner_lda = lrn("classif.lda", predict_type = "prob",
predict_sets = c("train", "test")),
learner_qda = lrn("classif.qda", predict_type = "prob",
predict_sets = c("train", "test")),
learner_nb = lrn("classif.naive_bayes", predict_type = "prob",
predict_sets = c("train", "test")),
learner_knn = lrn("classif.kknn", scale = FALSE,
predict_type = "prob"),
learner_rpart = lrn("classif.rpart",
predict_type = "prob"),
learner_rf = lrn("classif.ranger", num.trees = 1000,
predict_type = "prob")
)
```

In order to find optimal hyperparameters through tuning, we used random search to better cover the hyperparameter space. We define the hyperparameters to tune. We only tuned hyperparameters for k-NN, CART and random forest since the other methods have strong assumptions and serve as baseline. The following table shows the assumptions of the methods we chose.

Learners | Assumption |
---|---|

Logistic regression | No (or little) multicollinearity among features |

Linear discriminant analysis | Normality of classes, equal covariance (target) |

Quadratic discriminant analysis | Normality of classes |

Naive Bayes | Conditional independence of features |

CART | None |

k-NN | None |

Random forest | None |

The following table shows the hyperparameters we tuned.

Learner | Hyperparameters |
---|---|

k-NN | k, distance, kernel |

CART | minsplit, cp |

Random forest | min.node.size, mtry |

For k-NN we chose 3 as the lower limit and 50 as the upper limit for `k`

(number of neighbors). A too small k can lead to overfitting. We also tried different distance measures (e.g. 1 for Manhattan distance, 2 for Euclidean distance) and kernels. For CART we tuned the hyperparameters cp (complexity parameter) and minsplit (minimum number of observations in a node in order to attempt a split). `cp`

controls the size of the tree: small values can result in overfitting while large values can cause underfitting. We also tuned parameters for the minimum size of terminal nodes and the number of variables randomly sampled as candidates at each split (from 1 to number of features) for random forest.

```
tune_ps_knn = ps(
k = p_int(lower = 3, upper = 50), # Number of neighbors considered
distance = p_dbl(lower = 1, upper = 3),
kernel = p_fct(levels = c("rectangular", "gaussian", "rank", "optimal"))
)
tune_ps_rpart = ps(
# Minimum number of observations that must exist in a node in order for a
# split to be attempted
minsplit = p_int(lower = 10, upper = 40),
cp = p_dbl(lower = 0.001, upper = 0.1) # Complexity parameter
)
tune_ps_rf = ps(
# Minimum size of terminal nodes
min.node.size = p_int(lower = 10, upper = 50),
# Number of variables randomly sampled as candidates at each split
mtry = p_int(lower = 1, upper = 6)
)
```

The next step is to instantiate the AutoTuner from mlr3tuning. We employed 5-fold cross-validation for the inner loop of the nested resampling. The number of evaluations was set to 100 as the stopping criterion. As an evaluation metric we used AUC.

```
# AutoTuner for k-NN, CART and random forest
learners$learner_knn = auto_tuner(
method = "random_search",
learner = learners$learner_knn,
resampling = rsmp("cv", folds = 5L),
measure = msr("classif.auc"),
search_space = tune_ps_knn,
term_evals = 100,
)
learners$learner_knn$predict_sets = c("train", "test")
learners$learner_rpart = auto_tuner(
method = "random_search",
learner = learners$learner_rpart,
resampling = rsmp("cv", folds = 5L),
measure = msr("classif.auc"),
search_space = tune_ps_rpart,
term_evals = 100,
)
learners$learner_rpart$predict_sets = c("train", "test")
learners$learner_rf = auto_tuner(
method = "random_search",
learner = learners$learner_rf,
resampling = rsmp("cv", folds = 5L),
measure = msr("classif.auc"),
search_space = tune_ps_rf,
term_evals = 100,
)
learners$learner_rf$predict_sets = c("train", "test")
```

During our research we found that oversampling can potentially increase the performance of the learners. As mentioned in section 2.2, we opted for perfectly balancing the classes. By using mlr3pipelines we can apply the benchmark function later on.

```
# Oversampling minority class to get perfectly balanced classes
po_over = po("classbalancing", id = "oversample", adjust = "minor",
reference = "minor", shuffle = FALSE, ratio = 416/167)
table(po_over$train(list(task_liver))$output$truth()) # Check class balance
```

```
1 0
416 416
```

```
# Learners with balanced/oversampled data
learners_bal = lapply(learners, function(x) {
GraphLearner$new(po_scale %>>% po_over %>>% x)
})
lapply(learners_bal, function(x) x$predict_sets = c("train", "test"))
```

With the learners defined, the inner method of the nested resampling chosen and the tuners set up, we proceeded to choose the outer resampling method. We opted for stratified 5-fold cross-validation to maintain the distribution of the target variable, independent of oversampling. However, it turned out that normal cross-validation without stratification yields very similar results.

```
# 5-fold cross-validation
resampling_outer = rsmp(id = "cv", .key = "cv", folds = 5L)
# Stratification
task_liver$col_roles$stratum = task_liver$target_names
```

To rank the different learners and finally decide which one fits best for the task at hand, we used benchmarking. The following code chunk executes our benchmarking with all learners.

```
design = benchmark_grid(
tasks = task_liver,
learners = c(learners, learners_bal),
resamplings = resampling_outer
)
bmr = benchmark(design, store_models = FALSE)
```

As mentioned above, stratified 5-fold cross-validation was chosen. This means that performance is determined as the average across five model evaluations with a train-test-split of 80% to 20%. Furthermore, the choice of performance metrics is crucial in ranking different learners. While each one of them has its specific use case, we opted for AUC, a performance metric taking into account both sensitivity and specificity, which we also used for hyperparameter tuning.

We first present a comparison of all learners by AUC, with and without oversampling, and for both training and test data.

```
learner_id auc_train auc_test
1: classif.log_reg 0.7555646 0.7416606
2: classif.lda 0.7555611 0.7390708
3: classif.qda 0.7697367 0.7347738
4: classif.naive_bayes 0.7539943 0.7457096
5: classif.kknn.tuned 0.8876589 0.7323200
6: classif.rpart.tuned 0.8045344 0.6627003
7: classif.ranger.tuned 0.9586871 0.7439351
8: scale.oversample.classif.log_reg 0.7556586 0.7434089
9: scale.oversample.classif.lda 0.7547323 0.7434744
10: scale.oversample.classif.qda 0.7678794 0.7340827
11: scale.oversample.classif.naive_bayes 0.7537216 0.7441955
12: scale.oversample.classif.kknn.tuned 1.0000000 0.7026322
13: scale.oversample.classif.rpart.tuned 0.8611873 0.6250559
14: scale.oversample.classif.ranger.tuned 1.0000000 0.7440514
```

As can be seen in the results above, regardless of whether oversampling was applied or not, logistic regression, LDA, QDA, and naive Bayes have very similar performance on training and test data. On the other hand, k-NN, CART and random forest predict much better on the training data, indicating overfitting.

Furthermore, oversampling leaves AUC performance almost untouched for all learners.

The boxplots below graphically summarize AUC performance of all learners, with the blue dots indicating mean AUC performance.

```
Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
ℹ Please use the `fun` argument instead.
```

Random forest is the learner with the best AUC performance, both with and without oversampling. Whereas mean AUC is roughly between 0.65 and 0.75 for all learners, the individual components of AUC might differ substantially.

As a first step towards “AUC decomposition”, we consider the ROC curve, which provides valuable graphical insights into performance - even more so since AUC is directly derived from it.

Subsequently, sensitivity, specificity, false negative rate (FNR), and false positive rate (FPR) for each learner are shown explicitly in the output below, next to AUC.

```
learner_id classif.auc classif.sensitivity classif.specificity classif.fnr classif.fpr
1: classif.log_reg 0.7416606 0.8920252 0.2452763 0.10797476 0.7547237
2: classif.lda 0.7390708 0.9135972 0.1673797 0.08640275 0.8326203
3: classif.qda 0.7347738 0.6875215 0.6709447 0.31247849 0.3290553
4: classif.naive_bayes 0.7457096 0.6393574 0.7606061 0.36064257 0.2393939
5: classif.kknn.tuned 0.7323200 0.8339931 0.3896613 0.16600688 0.6103387
6: classif.rpart.tuned 0.6627003 0.8436317 0.1982175 0.15636833 0.8017825
7: classif.ranger.tuned 0.7439351 0.9422834 0.1376114 0.05771658 0.8623886
8: scale.oversample.classif.log_reg 0.7434089 0.6202524 0.7609626 0.37974756 0.2390374
9: scale.oversample.classif.lda 0.7434744 0.5865175 0.7848485 0.41348250 0.2151515
10: scale.oversample.classif.qda 0.7340827 0.5552209 0.8267380 0.44477912 0.1732620
11: scale.oversample.classif.naive_bayes 0.7441955 0.5407917 0.8386809 0.45920826 0.1613191
12: scale.oversample.classif.kknn.tuned 0.7026322 0.7281985 0.5254902 0.27180149 0.4745098
13: scale.oversample.classif.rpart.tuned 0.6250559 0.6032702 0.6060606 0.39672978 0.3939394
14: scale.oversample.classif.ranger.tuned 0.7440514 0.7548480 0.5388592 0.24515204 0.4611408
```

As it turned out, without oversampling logistic regression, LDA, k-NN, CART, and random forest score very high on sensitivity and rather low on specificity; QDA and naive Bayes, on the other hand, score relatively high on specificity, but not as high on sensitivity. By definition, high sensitivity (specificity) results from a low false negative (positive) rate, which is also represented in the data.

With oversampling, specificity increases at the cost of sensitivity for all learners (even for those which already had high specificity), as can be seen in the two graphs below.

For a given learner, say random forest, the different performance metrics and their dependence upon target variable balance are shown in the following graph.

To get a look at performance from yet another angle, we next considered the confusion matrix for each learner, which simply contrasts the absolute numbers of predictions and true values by category, with and without oversampling. You can have a look at all the confusion matrices, if you run the script.

The confusion matrices confirm the above conclusions: without oversampling, all learners (except QDA and naive Bayes) display very high numbers of true positives, but also of false positives, implying high sensitivity and low specificity. Also note that a trivial model classifying all individuals as “1” would cause fewer misclassifications than all of our models except random forest, casting doubt on the predictive power of the features in our dataset. Regarding learner performance with oversampling, the confusion matrices add another valuable insight:

- the total number of misclassifications increases for all learners
- the correct predictions become more balanced (by partly shifting from true positives to true negatives)
- the misclassifications partly shift from false positives to false negatives

The final decision regarding which learner works best - and also whether oversampling should be used or not - strongly depends on the real world implications of sensitivity and specificity. One of the two might outweigh the other many times over in terms of practical importance. Think of the typical HIV rapid diagnostic test example, where high sensitivity at the cost of low specificity might cause an (unwarranted) shock but is otherwise not dangerous, whereas low sensitivity would be highly perilous. As is usually the case, no black and white “best model” exists here - recall that, even with oversampling, none of our models perform well on both sensitivity and specificity. In our case, we would need to ask ourselves: what would be the consequences of high specificity at the cost of low sensitivity, which implies telling many patients with a liver disease that they are healthy; versus what would be the consequences of high sensitivity at the cost of low specificity, which would mean telling many healthy patients they have a liver disease. In absence of further topic-specific information, we can only state the best-performing learners for the particular performance metric chosen. As mentioned above, random forest performs best based on AUC. Random forest is furthermore the learner with the highest sensitivity score (and the lowest FNR), while naive Bayes is the one with the best specificity (and the lowest FPR) These results - and the ranking of learners in general, independent of the performance metric - are not affected by oversampling.

The analysis we conducted is, however, by no means exhaustive. On the feature level, while we focused almost exclusively on the machine learning and statistical analysis aspect during our analysis, one could also dig deeper into the actual topic (liver disease) and try to understand the variables as well as potential correlations and interactions more thoroughly. This might also mean to consider already thrown out variables again. Furthermore, feature engineering as well as data preprocessing, for instance using principal component analysis, could be applied to the dataset. Regarding hyperparameter tuning, different hyperparameters with larger hyperparameter spaces and numbers of evaluations could be considered. Furthermore, tuning could also be applied to some of those learners that we labeled as baseline learners, though to a lesser extent. Finally, we limited ourselves to those classifiers discussed in detail in the course. More classifiers exist, however; in particular, gradient boosting and support vector machines could additionally be applied to this task and potentially yield better results.

Carvalho, Joana R., and Mariana Verdelho Machado. 2018. “New Insights About Albumin and Liver Disease.” *Annals of Hepatology* 17 (4): 547–60. https://doi.org/10.5604/01.3001.0012.0916.

Gowda, Shivaraj, Prakash B Desai, Vinayak V Hull, Avinash A K Math, Sonal N Vernekar, and Shruthi S Kulkarni. 2009. “A review on laboratory liver function tests.” *The Pan African Medical Journal* 3: 17. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2984286/.

Oh, Robert C, and Thomas R Hustead. 2011. “Causes and evaluation of mildly elevated liver transaminase levels.” *American Family Physician* 84: 1003–8. https://www.aafp.org/afp/2011/1101/p1003.html.

“Total Protein, Albumin-Globulin (A/G) Ratio.” n.d. https://labtestsonline.org/tests/total-protein-albumin-globulin-ag-ratio.

“UCI Machine Learning Repository.” n.d. University of California, Irvine, School of Information; Computer Sciences. http://archive.ics.uci.edu/ml.