Introduction to mlr3tuningspaces

Apply predefined search spaces from scientific articles.

tuning
classification
Author

Marc Becker

Published

July 6, 2021

Scope

The package mlr3tuningspaces offers a selection of published search spaces for many popular machine learning algorithms. In this post, we show how to tune a mlr3 learners with these search spaces.

Prerequisites

The packages mlr3verse and mlr3tuningspaces are required for this demonstration:

library(mlr3verse)
library(mlr3tuningspaces)

We initialize the random number generator with a fixed seed for reproducibility, and decrease the verbosity of the logger to keep the output clearly represented.

set.seed(7832)
lgr::get_logger("mlr3")$set_threshold("warn")
lgr::get_logger("bbotk")$set_threshold("warn")

In the example, we use the pima indian diabetes data set which is used to predict whether or not a patient has diabetes. The patients are characterized by 8 numeric features, some of them have missing values.

# retrieve the task from mlr3
task = tsk("pima")

# generate a quick textual overview using the skimr package
skimr::skim(task$data())
Data summary
Name task$data()
Number of rows 768
Number of columns 9
Key NULL
_______________________
Column type frequency:
factor 1
numeric 8
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
diabetes 0 1 FALSE 2 neg: 500, pos: 268

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 0 1.00 33.24 11.76 21.00 24.00 29.00 41.00 81.00 ▇▃▁▁▁
glucose 5 0.99 121.69 30.54 44.00 99.00 117.00 141.00 199.00 ▁▇▇▃▂
insulin 374 0.51 155.55 118.78 14.00 76.25 125.00 190.00 846.00 ▇▂▁▁▁
mass 11 0.99 32.46 6.92 18.20 27.50 32.30 36.60 67.10 ▅▇▃▁▁
pedigree 0 1.00 0.47 0.33 0.08 0.24 0.37 0.63 2.42 ▇▃▁▁▁
pregnant 0 1.00 3.85 3.37 0.00 1.00 3.00 6.00 17.00 ▇▃▂▁▁
pressure 35 0.95 72.41 12.38 24.00 64.00 72.00 80.00 122.00 ▁▃▇▂▁
triceps 227 0.70 29.15 10.48 7.00 22.00 29.00 36.00 99.00 ▆▇▁▁▁

Tuning Search Space

For tuning, it is important to create a search space that defines the type and range of the hyperparameters. A learner stores all information about its hyperparameters in the slot $param_set. Usually, we have to chose a subset of hyperparameters we want to tune.

lrn("classif.rpart")$param_set
<ParamSet>
                id    class lower upper nlevels        default value
 1:             cp ParamDbl     0     1     Inf           0.01      
 2:     keep_model ParamLgl    NA    NA       2          FALSE      
 3:     maxcompete ParamInt     0   Inf     Inf              4      
 4:       maxdepth ParamInt     1    30      30             30      
 5:   maxsurrogate ParamInt     0   Inf     Inf              5      
 6:      minbucket ParamInt     1   Inf     Inf <NoDefault[3]>      
 7:       minsplit ParamInt     1   Inf     Inf             20      
 8: surrogatestyle ParamInt     0     1       2              0      
 9:   usesurrogate ParamInt     0     2       3              2      
10:           xval ParamInt     0   Inf     Inf             10     0

Package

At the heart of mlr3tuningspaces is the R6 class TuningSpace. It stores a list of TuneToken, helper functions and additional meta information. The list of TuneToken can be directly applied to the $values slot of a learner’s ParamSet. The search spaces are stored in the mlr_tuning_spaces dictionary.

as.data.table(mlr_tuning_spaces)
                        key                                 label         learner n_values
 1:  classif.glmnet.default       Classification GLM with Default  classif.glmnet        2
 2:     classif.glmnet.rbv2     Classification GLM with RandomBot  classif.glmnet        2
 3:    classif.kknn.default      Classification KKNN with Default    classif.kknn        3
 4:       classif.kknn.rbv2    Classification KKNN with RandomBot    classif.kknn        1
 5:  classif.ranger.default    Classification Ranger with Default  classif.ranger        4
 6:     classif.ranger.rbv2  Classification Ranger with RandomBot  classif.ranger        8
 7:   classif.rpart.default     Classification Rpart with Default   classif.rpart        3
 8:      classif.rpart.rbv2   Classification Rpart with RandomBot   classif.rpart        4
 9:     classif.svm.default       Classification SVM with Default     classif.svm        4
10:        classif.svm.rbv2     Classification SVM with RandomBot     classif.svm        5
11: classif.xgboost.default   Classification XGBoost with Default classif.xgboost        8
12:    classif.xgboost.rbv2 Classification XGBoost with RandomBot classif.xgboost       13
13:     regr.glmnet.default           Regression GLM with Default     regr.glmnet        2
14:        regr.glmnet.rbv2         Regression GLM with RandomBot     regr.glmnet        2
15:       regr.kknn.default          Regression KKNN with Default       regr.kknn        3
16:          regr.kknn.rbv2        Regression KKNN with RandomBot       regr.kknn        1
17:     regr.ranger.default        Regression Ranger with Default     regr.ranger        4
18:        regr.ranger.rbv2      Regression Ranger with RandomBot     regr.ranger        7
19:      regr.rpart.default         Regression Rpart with Default      regr.rpart        3
20:         regr.rpart.rbv2       Regression Rpart with RandomBot      regr.rpart        4
21:        regr.svm.default           Regression SVM with Default        regr.svm        4
22:           regr.svm.rbv2         Regression SVM with RandomBot        regr.svm        5
23:    regr.xgboost.default       Regression XGBoost with Default    regr.xgboost        8
24:       regr.xgboost.rbv2     Regression XGBoost with RandomBot    regr.xgboost       13
                        key                                 label         learner n_values

We can use the sugar function lts() to retrieve a TuningSpace.

tuning_space_rpart = lts("classif.rpart.default")
tuning_space_rpart
<TuningSpace:classif.rpart.default>: Classification Rpart with Default
          id lower upper levels logscale
1:  minsplit 2e+00 128.0            TRUE
2: minbucket 1e+00  64.0            TRUE
3:        cp 1e-04   0.1            TRUE

The $values slot contains the list of of TuneToken.

tuning_space_rpart$values
$minsplit
Tuning over:
range [2, 128] (log scale)


$minbucket
Tuning over:
range [1, 64] (log scale)


$cp
Tuning over:
range [1e-04, 0.1] (log scale)

We apply the search space and tune the learner.

learner = lrn("classif.rpart")

learner$param_set$values = tuning_space_rpart$values

instance = tune(
  method = "random_search",
  task = tsk("pima"),
  learner = learner,
  resampling = rsmp ("holdout"),
  measure = msr("classif.ce"),
  term_evals = 10)

instance$result
   minsplit minbucket        cp learner_param_vals  x_domain classif.ce
1: 1.377705  2.369973 -5.610915          <list[3]> <list[3]>  0.2265625

We can also get the learner with search space already applied from the TuningSpace.

learner = tuning_space_rpart$get_learner()
print(learner$param_set)
<ParamSet>
                id    class lower upper nlevels        default               value
 1:             cp ParamDbl     0     1     Inf           0.01 <RangeTuneToken[2]>
 2:     keep_model ParamLgl    NA    NA       2          FALSE                    
 3:     maxcompete ParamInt     0   Inf     Inf              4                    
 4:       maxdepth ParamInt     1    30      30             30                    
 5:   maxsurrogate ParamInt     0   Inf     Inf              5                    
 6:      minbucket ParamInt     1   Inf     Inf <NoDefault[3]> <RangeTuneToken[2]>
 7:       minsplit ParamInt     1   Inf     Inf             20 <RangeTuneToken[2]>
 8: surrogatestyle ParamInt     0     1       2              0                    
 9:   usesurrogate ParamInt     0     2       3              2                    
10:           xval ParamInt     0   Inf     Inf             10                   0

This method also allows to set constant parameters.

learner = tuning_space_rpart$get_learner(maxdepth = 15)
print(learner$param_set)
<ParamSet>
                id    class lower upper nlevels        default               value
 1:             cp ParamDbl     0     1     Inf           0.01 <RangeTuneToken[2]>
 2:     keep_model ParamLgl    NA    NA       2          FALSE                    
 3:     maxcompete ParamInt     0   Inf     Inf              4                    
 4:       maxdepth ParamInt     1    30      30             30                  15
 5:   maxsurrogate ParamInt     0   Inf     Inf              5                    
 6:      minbucket ParamInt     1   Inf     Inf <NoDefault[3]> <RangeTuneToken[2]>
 7:       minsplit ParamInt     1   Inf     Inf             20 <RangeTuneToken[2]>
 8: surrogatestyle ParamInt     0     1       2              0                    
 9:   usesurrogate ParamInt     0     2       3              2                    
10:           xval ParamInt     0   Inf     Inf             10                   0

The lts() function sets the default search space directly to a learner.

learner = lts(lrn("classif.rpart", maxdepth = 15))
print(learner$param_set)
<ParamSet>
                id    class lower upper nlevels        default               value
 1:             cp ParamDbl     0     1     Inf           0.01 <RangeTuneToken[2]>
 2:     keep_model ParamLgl    NA    NA       2          FALSE                    
 3:     maxcompete ParamInt     0   Inf     Inf              4                    
 4:       maxdepth ParamInt     1    30      30             30                  15
 5:   maxsurrogate ParamInt     0   Inf     Inf              5                    
 6:      minbucket ParamInt     1   Inf     Inf <NoDefault[3]> <RangeTuneToken[2]>
 7:       minsplit ParamInt     1   Inf     Inf             20 <RangeTuneToken[2]>
 8: surrogatestyle ParamInt     0     1       2              0                    
 9:   usesurrogate ParamInt     0     2       3              2                    
10:           xval ParamInt     0   Inf     Inf             10                   0