library(mlr3verse)
Intro
mlr3pipelines offers a very flexible way to create data preprocessing steps. This is achieved by a modular approach using PipeOp
s. For detailed overview check the mlr3book.
Recommended prior readings:
This post covers:
- How to apply different preprocessing steps on different features
- How to branch different preprocessing steps, which allows to select the best performing path
- How to tune the whole pipeline
Prerequisites
We load the mlr3verse package which pulls in the most important packages for this example.
We initialize the random number generator with a fixed seed for reproducibility, and decrease the verbosity of the logger to keep the output clearly represented.
set.seed(7832)
::get_logger("mlr3")$set_threshold("warn")
lgr::get_logger("bbotk")$set_threshold("warn") lgr
The Pima Indian Diabetes classification task
will be used.
= tsk("pima")
task_pima ::skim(task_pima$data()) skimr
Name | task_pima$data() |
Number of rows | 768 |
Number of columns | 9 |
Key | NULL |
_______________________ | |
Column type frequency: | |
factor | 1 |
numeric | 8 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
diabetes | 0 | 1 | FALSE | 2 | neg: 500, pos: 268 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
age | 0 | 1.00 | 33.24 | 11.76 | 21.00 | 24.00 | 29.00 | 41.00 | 81.00 | ▇▃▁▁▁ |
glucose | 5 | 0.99 | 121.69 | 30.54 | 44.00 | 99.00 | 117.00 | 141.00 | 199.00 | ▁▇▇▃▂ |
insulin | 374 | 0.51 | 155.55 | 118.78 | 14.00 | 76.25 | 125.00 | 190.00 | 846.00 | ▇▂▁▁▁ |
mass | 11 | 0.99 | 32.46 | 6.92 | 18.20 | 27.50 | 32.30 | 36.60 | 67.10 | ▅▇▃▁▁ |
pedigree | 0 | 1.00 | 0.47 | 0.33 | 0.08 | 0.24 | 0.37 | 0.63 | 2.42 | ▇▃▁▁▁ |
pregnant | 0 | 1.00 | 3.85 | 3.37 | 0.00 | 1.00 | 3.00 | 6.00 | 17.00 | ▇▃▂▁▁ |
pressure | 35 | 0.95 | 72.41 | 12.38 | 24.00 | 64.00 | 72.00 | 80.00 | 122.00 | ▁▃▇▂▁ |
triceps | 227 | 0.70 | 29.15 | 10.48 | 7.00 | 22.00 | 29.00 | 36.00 | 99.00 | ▆▇▁▁▁ |
Selection of features for preprocessing steps
Several features of the pima
task have missing values:
$missings() task_pima
diabetes age glucose insulin mass pedigree pregnant pressure triceps
0 0 5 374 11 0 0 35 227
A common approach in such situations is to impute the missing values and to add a missing indicator column as explained in the Impute missing variables post. Suppose we want to use
PipeOpImputeHist
on features “glucose”, “mass” and “pressure” which have only few missing values andPipeOpImputeMedian
on features “insulin” and “triceps” which have much more missing values.
In the following subsections, we show two approaches to implement this.
1. Consider all features and apply the preprocessing step only to certain features
Using the affect_columns
argument of a PipeOp
to define the variables on which a PipeOp
will operate with an appropriate Selector
function:
# imputes values based on histogram
= po("imputehist",
imputer_hist affect_columns = selector_name(c("glucose", "mass", "pressure")))
# imputes values using the median
= po("imputemedian",
imputer_median affect_columns = selector_name(c("insulin", "triceps")))
# adds an indicator column for each feature with missing values
= po("missind") miss_ind
When PipeOp
s are constructed this way, they will perform the specified preprocessing step on the appropriate features and pass all the input features to the subsequent steps:
# no missings in "glucose", "mass" and "pressure"
$train(list(task_pima))[[1]]$missings() imputer_hist
diabetes age insulin pedigree pregnant triceps glucose mass pressure
0 0 374 0 0 227 0 0 0
# no missings in "insulin" and "triceps"
$train(list(task_pima))[[1]]$missings() imputer_median
diabetes age glucose mass pedigree pregnant pressure insulin triceps
0 0 5 11 0 0 35 0 0
We construct a pipeline that combines imputer_hist
and imputer_median
. Here, imputer_hist
will impute the features “glucose”, “mass” and “pressure”, and imputer_median
will impute “insulin” and “triceps”. In each preprocessing step, all the input features are passed to the next step. In the end, we obtain a data set without missing values:
# combine the two impuation methods
= imputer_hist %>>% imputer_median
impute_graph $plot(html = FALSE) impute_graph
$train(task_pima)[[1]]$missings() impute_graph
diabetes age pedigree pregnant glucose mass pressure insulin triceps
0 0 0 0 0 0 0 0 0
The PipeOpMissInd
operator replaces features with missing values with a missing value indicator:
$train(list(task_pima))[[1]]$data() miss_ind
diabetes missing_glucose missing_insulin missing_mass missing_pressure missing_triceps
1: pos present missing present present present
2: neg present missing present present present
3: pos present missing present present missing
4: neg present present present present present
5: pos present present present present present
---
764: neg present present present present present
765: neg present missing present present present
766: neg present present present present present
767: pos present missing present present missing
768: neg present missing present present present
Obviously, this step can not be applied to the already imputed data as there are no missing values. If we want to combine the previous two imputation steps with a third step that adds missing value indicators, we would need to PipeOpCopy
the data two times and supply the first copy to impute_graph
and the second copy to miss_ind
using gunion()
. Finally, the two outputs can be combined with PipeOpFeatureUnion
:
= po("copy", 2) %>>%
impute_missind gunion(list(impute_graph, miss_ind)) %>>%
po("featureunion")
$plot(html = FALSE) impute_missind
$train(task_pima)[[1]]$data() impute_missind
diabetes age pedigree pregnant glucose mass pressure insulin triceps missing_glucose missing_insulin missing_mass
1: pos 50 0.627 6 148 33.6 72 125 35 present missing present
2: neg 31 0.351 1 85 26.6 66 125 29 present missing present
3: pos 32 0.672 8 183 23.3 64 125 29 present missing present
4: neg 21 0.167 1 89 28.1 66 94 23 present present present
5: pos 33 2.288 0 137 43.1 40 168 35 present present present
---
764: neg 63 0.171 10 101 32.9 76 180 48 present present present
765: neg 27 0.340 2 122 36.8 70 125 27 present missing present
766: neg 30 0.245 5 121 26.2 72 112 23 present present present
767: pos 47 0.349 1 126 30.1 60 125 29 present missing present
768: neg 23 0.315 1 93 30.4 70 125 31 present missing present
missing_pressure missing_triceps
1: present present
2: present present
3: present missing
4: present present
5: present present
---
764: present present
765: present present
766: present present
767: present missing
768: present present
2. Select the features for each preprocessing step and apply the preprocessing steps to this subset
We can use the PipeOpSelect
to select the appropriate features and then apply the desired impute PipeOp
on them:
= po("select",
imputer_hist_2 selector = selector_name(c("glucose", "mass", "pressure")),
id = "slct1") %>>% # unique id so we can combine it in a pipeline with other select PipeOps
po("imputehist")
$plot(html = FALSE) imputer_hist_2
$train(task_pima)[[1]]$data() imputer_hist_2
diabetes glucose mass pressure
1: pos 148 33.6 72
2: neg 85 26.6 66
3: pos 183 23.3 64
4: neg 89 28.1 66
5: pos 137 43.1 40
---
764: neg 101 32.9 76
765: neg 122 36.8 70
766: neg 121 26.2 72
767: pos 126 30.1 60
768: neg 93 30.4 70
=
imputer_median_2 po("select", selector = selector_name(c("insulin", "triceps")), id = "slct2") %>>%
po("imputemedian")
$train(task_pima)[[1]]$data() imputer_median_2
diabetes insulin triceps
1: pos 125 35
2: neg 125 29
3: pos 125 29
4: neg 94 23
5: pos 168 35
---
764: neg 180 48
765: neg 125 27
766: neg 112 23
767: pos 125 29
768: neg 125 31
To reproduce the result of the fist example (1.), we need to copy the data four times and apply imputer_hist_2
, imputer_median_2
and miss_ind
on each of the three copies. The fourth copy is required to select the features without missing values and to append it to the final result. We can do this as follows:
= task_pima$feature_names[task_pima$missings()[-1] == 0]
other_features
= po("copy", 4) %>>%
imputer_missind_2 gunion(list(imputer_hist_2,
imputer_median_2,
miss_ind,po("select", selector = selector_name(other_features), id = "slct3"))) %>>%
po("featureunion")
$plot(html = FALSE) imputer_missind_2
$train(task_pima)[[1]]$data() imputer_missind_2
diabetes glucose mass pressure insulin triceps missing_glucose missing_insulin missing_mass missing_pressure
1: pos 148 33.6 72 125 35 present missing present present
2: neg 85 26.6 66 125 29 present missing present present
3: pos 183 23.3 64 125 29 present missing present present
4: neg 89 28.1 66 94 23 present present present present
5: pos 137 43.1 40 168 35 present present present present
---
764: neg 101 32.9 76 180 48 present present present present
765: neg 122 36.8 70 125 27 present missing present present
766: neg 121 26.2 72 112 23 present present present present
767: pos 126 30.1 60 125 29 present missing present present
768: neg 93 30.4 70 125 31 present missing present present
missing_triceps age pedigree pregnant
1: present 50 0.627 6
2: present 31 0.351 1
3: missing 32 0.672 8
4: present 21 0.167 1
5: present 33 2.288 0
---
764: present 63 0.171 10
765: present 27 0.340 2
766: present 30 0.245 5
767: missing 47 0.349 1
768: present 23 0.315 1
Note that when there is one input channel, it is automatically copied as many times as needed for the downstream PipeOp
s. In other words, the code above works also without po("copy", 4)
:
= gunion(list(imputer_hist_2,
imputer_missind_3
imputer_median_2,
miss_ind,po("select", selector = selector_name(other_features), id = "slct3"))) %>>%
po("featureunion")
$train(task_pima)[[1]]$data() imputer_missind_3
diabetes glucose mass pressure insulin triceps missing_glucose missing_insulin missing_mass missing_pressure
1: pos 148 33.6 72 125 35 present missing present present
2: neg 85 26.6 66 125 29 present missing present present
3: pos 183 23.3 64 125 29 present missing present present
4: neg 89 28.1 66 94 23 present present present present
5: pos 137 43.1 40 168 35 present present present present
---
764: neg 101 32.9 76 180 48 present present present present
765: neg 122 36.8 70 125 27 present missing present present
766: neg 121 26.2 72 112 23 present present present present
767: pos 126 30.1 60 125 29 present missing present present
768: neg 93 30.4 70 125 31 present missing present present
missing_triceps age pedigree pregnant
1: present 50 0.627 6
2: present 31 0.351 1
3: missing 32 0.672 8
4: present 21 0.167 1
5: present 33 2.288 0
---
764: present 63 0.171 10
765: present 27 0.340 2
766: present 30 0.245 5
767: missing 47 0.349 1
768: present 23 0.315 1
Usually, po("copy")
is required when there are more than one input channels and multiple output channels, and their numbers do not match.
Branching
We can not know if the combination of a learner with this preprocessing graph will benefit from the imputation steps and the added missing value indicators. Maybe it would have been better to just use imputemedian
on all the variables. We could investigate this assumption by adding an alternative path to the graph with the mentioned imputemedian
. This is possible using the “branch” PipeOp
:
= po("imputemedian", id = "simple_median") # add the id so it does not clash with `imputer_median`
imputer_median_3
= c("impute_missind", "simple_median") # names of the branches
branches
= po("branch", branches) %>>%
graph_branch gunion(list(impute_missind, imputer_median_3)) %>>%
po("unbranch")
$plot(html = FALSE) graph_branch
Tuning the pipeline
To finalize the graph, we combine it with a rpart learner:
= graph_branch %>>%
graph lrn("classif.rpart")
$plot(html = FALSE) graph
To define the parameters to be tuned, we first check the available ones in the graph:
as.data.table(graph$param_set)[, .(id, class, lower, upper, nlevels)]
id class lower upper nlevels
1: branch.selection ParamFct NA NA 2
2: imputehist.affect_columns ParamUty NA NA Inf
3: imputemedian.affect_columns ParamUty NA NA Inf
4: missind.which ParamFct NA NA 2
5: missind.type ParamFct NA NA 4
6: missind.affect_columns ParamUty NA NA Inf
7: simple_median.affect_columns ParamUty NA NA Inf
8: classif.rpart.cp ParamDbl 0 1 Inf
9: classif.rpart.keep_model ParamLgl NA NA 2
10: classif.rpart.maxcompete ParamInt 0 Inf Inf
11: classif.rpart.maxdepth ParamInt 1 30 30
12: classif.rpart.maxsurrogate ParamInt 0 Inf Inf
13: classif.rpart.minbucket ParamInt 1 Inf Inf
14: classif.rpart.minsplit ParamInt 1 Inf Inf
15: classif.rpart.surrogatestyle ParamInt 0 1 2
16: classif.rpart.usesurrogate ParamInt 0 2 3
17: classif.rpart.xval ParamInt 0 Inf Inf
We decide to jointly tune the "branch.selection"
, "classif.rpart.cp"
and "classif.rpart.minbucket"
hyperparameters:
= ps(
search_space branch.selection = p_fct(c("impute_missind", "simple_median")),
classif.rpart.cp = p_dbl(0.001, 0.1),
classif.rpart.minbucket = p_int(1, 10))
In order to tune the graph, it needs to be converted to a learner:
= as_learner(graph)
graph_learner
= rsmp("cv", folds = 3)
cv3
$instantiate(task_pima) # to generate folds for cross validation
cv3
= tune(
instance tuner = tnr("random_search"),
task = task_pima,
learner = graph_learner,
resampling = rsmp("cv", folds = 3),
measure = msr("classif.ce"),
search_space = search_space,
term_evals = 5)
as.data.table(instance$archive, unnest = NULL, exclude_columns = c("x_domain", "uhash", "resample_result"))
branch.selection classif.rpart.cp classif.rpart.minbucket classif.ce runtime_learners timestamp batch_nr
1: simple_median 0.02172886 2 0.2799479 2.774 2023-11-02 16:33:12 1
2: impute_missind 0.07525939 1 0.2760417 2.701 2023-11-02 16:33:25 2
3: impute_missind 0.09207969 3 0.2773438 1.031 2023-11-02 16:33:36 3
4: impute_missind 0.03984117 6 0.2721354 2.184 2023-11-02 16:33:47 4
5: impute_missind 0.09872643 7 0.2773438 2.507 2023-11-02 16:33:57 5
warnings errors
1: 0 0
2: 0 0
3: 0 0
4: 0 0
5: 0 0
The best performance in this short tuned experiment was achieved with:
$result instance
branch.selection classif.rpart.cp classif.rpart.minbucket learner_param_vals x_domain classif.ce
1: impute_missind 0.03984117 6 <list[9]> <list[3]> 0.2721354
Conclusion
This post shows ways on how to specify features on which preprocessing steps are to be performed. In addition it shows how to create alternative paths in the learner graph. The preprocessing steps that can be used are not limited to imputation. Check the list of available PipeOp
.