library(mlr3verse)
The following example describes a situation where we aim to remove correlated features. This in essence means, that we drop features until no features have a correlation higher than a given cutoff
. This is often useful when we for example want to use linear models.
Prerequisites
This tutorial assumes familiarity with the basics of mlr3pipelines. Consult the mlr3book if some aspects are not fully understandable. Additionally, we compare different cutoff values via tuning using the mlr3tuning package. Again, the mlr3book has an intro to mlr3tuning and paradox.
The example describes a very involved use-case, where the behavior of PipeOpSelect
is manipulated via a trafo on it’s ParamSet
Getting started
We load the mlr3verse package which pulls in the most important packages for this example.
We initialize the random number generator with a fixed seed for reproducibility, and decrease the verbosity of the logger to keep the output clearly represented.
set.seed(7832)
::get_logger("mlr3")$set_threshold("warn")
lgr::get_logger("bbotk")$set_threshold("warn") lgr
The basic pipeline looks as follows: We use PipeOpSelect
to select a set of variables followed by a rpart learner
.
= po("select") %>>% lrn("classif.rpart") graph_learner
Now we get to the magic:
We want to use the function caret::findCorrelation()
from the caret package in order to select uncorrelated variables. This function has a cutoff
parameter, that specifies the maximum correlation allowed between variables. In order to expose this variable as a numeric
parameter we can tune over we specify the following ParamSet
:
= ps(cutoff = p_dbl(0, 1)) search_space
We define a function select_cutoff
that takes as input a Task
and returns a list of features we aim to keep.
Now we use a trafo
to transform the cutoff
into a set of variables, which is what PipeOpSelect
can work with. Note that we use x$cutoff = NULL
in order to remove the temporary parameter we introduced, as PipeOpSelect
does not know what to do with it.
$trafo = function(x, param_set) {
search_space= x$cutoff
cutoff $select.selector = function(task) {
x= task$feature_names
fn = task$data(cols = fn)
data = caret::findCorrelation(cor(data), cutoff = cutoff, exact = TRUE, names = TRUE)
drop setdiff(fn, drop)
}$cutoff = NULL
x
x }
If you are not sure, you understand the trafo
concept, consult the mlr3book. It has a section on the trafo
concept.
Now we tune over different values for cutoff
.
= tune(
instance tuner = tnr("grid_search"),
task = tsk("iris"),
learner = graph_learner,
resampling = rsmp("cv", folds = 3L),
measure = msr("classif.ce"),
search_space = search_space,
# don't need the following line for optimization, this is for
# demonstration that different features were selected
store_models = TRUE)
In order to demonstrate that different cutoff values result in different features being selected, we can run the following to inspect the trained models. Note this inspects only the trained models of the first CV fold of each evaluated model. The features being excluded depends on the training data seen by the pipeline and may be different in different folds, even at the same cutoff value.
as.data.table(instance$archive)[
order(cutoff),
list(cutoff, classif.ce,
featurenames = lapply(resample_result, function(x) {
$learners[[1]]$model$classif.rpart$train_task$feature_names
x
} ))]
cutoff classif.ce featurenames
1: 0.0000000 0.28666667 Sepal.Length
2: 0.1111111 0.28666667 Sepal.Length
3: 0.2222222 0.28666667 Sepal.Length
4: 0.3333333 0.27333333 Sepal.Length,Sepal.Width
5: 0.4444444 0.27333333 Sepal.Length,Sepal.Width
6: 0.5555556 0.27333333 Sepal.Length,Sepal.Width
7: 0.6666667 0.27333333 Sepal.Length,Sepal.Width
8: 0.7777778 0.27333333 Sepal.Length,Sepal.Width
9: 0.8888889 0.04000000 Petal.Width,Sepal.Length,Sepal.Width
10: 1.0000000 0.06666667 Petal.Length,Petal.Width,Sepal.Length,Sepal.Width
Voila, we created our own PipeOp
, that uses very advanced knowledge of mlr3pipelines and paradox in only few lines of code.