Select Uncorrelated Features

Remove correlated features with a pipeline.

Authors

Martin Binder

Florian Pfisterer

Published

February 25, 2020

The following example describes a situation where we aim to remove correlated features. This in essence means, that we drop features until no features have a correlation higher than a given cutoff. This is often useful when we for example want to use linear models.

Prerequisites

This tutorial assumes familiarity with the basics of mlr3pipelines. Consult the mlr3book if some aspects are not fully understandable. Additionally, we compare different cutoff values via tuning using the mlr3tuning package. Again, the mlr3book has an intro to mlr3tuning and paradox.

The example describes a very involved use-case, where the behavior of PipeOpSelect is manipulated via a trafo on it’s ParamSet

Getting started

We load the mlr3verse package which pulls in the most important packages for this example.

library(mlr3verse)

We initialize the random number generator with a fixed seed for reproducibility, and decrease the verbosity of the logger to keep the output clearly represented.

set.seed(7832)
lgr::get_logger("mlr3")$set_threshold("warn")
lgr::get_logger("bbotk")$set_threshold("warn")

The basic pipeline looks as follows: We use PipeOpSelect to select a set of variables followed by a rpart learner.

graph_learner = po("select") %>>% lrn("classif.rpart")

Now we get to the magic:

We want to use the function caret::findCorrelation() from the caret package in order to select uncorrelated variables. This function has a cutoff parameter, that specifies the maximum correlation allowed between variables. In order to expose this variable as a numeric parameter we can tune over we specify the following ParamSet:

search_space = ps(cutoff = p_dbl(0, 1))

We define a function select_cutoff that takes as input a Task and returns a list of features we aim to keep.

Now we use a trafo to transform the cutoff into a set of variables, which is what PipeOpSelect can work with. Note that we use x$cutoff = NULL in order to remove the temporary parameter we introduced, as PipeOpSelect does not know what to do with it.

search_space$trafo = function(x, param_set) {
  cutoff = x$cutoff
  x$select.selector = function(task) {
    fn = task$feature_names
    data = task$data(cols = fn)
    drop = caret::findCorrelation(cor(data), cutoff = cutoff, exact = TRUE, names = TRUE)
    setdiff(fn, drop)
  }
  x$cutoff = NULL
  x
}

If you are not sure, you understand the trafo concept, consult the mlr3book. It has a section on the trafo concept.

Now we tune over different values for cutoff.

instance = tune(
  tuner = tnr("grid_search"),
  task = tsk("iris"),
  learner = graph_learner,
  resampling = rsmp("cv", folds = 3L),
  measure = msr("classif.ce"),
  search_space = search_space,
  # don't need the following line for optimization, this is for
  # demonstration that different features were selected
  store_models = TRUE)

In order to demonstrate that different cutoff values result in different features being selected, we can run the following to inspect the trained models. Note this inspects only the trained models of the first CV fold of each evaluated model. The features being excluded depends on the training data seen by the pipeline and may be different in different folds, even at the same cutoff value.

as.data.table(instance$archive)[
  order(cutoff),
  list(cutoff, classif.ce,
    featurenames = lapply(resample_result, function(x) {
      x$learners[[1]]$model$classif.rpart$train_task$feature_names
    }
  ))]
       cutoff classif.ce                                      featurenames
 1: 0.0000000 0.28666667                                      Sepal.Length
 2: 0.1111111 0.28666667                                      Sepal.Length
 3: 0.2222222 0.28666667                                      Sepal.Length
 4: 0.3333333 0.27333333                          Sepal.Length,Sepal.Width
 5: 0.4444444 0.27333333                          Sepal.Length,Sepal.Width
 6: 0.5555556 0.27333333                          Sepal.Length,Sepal.Width
 7: 0.6666667 0.27333333                          Sepal.Length,Sepal.Width
 8: 0.7777778 0.27333333                          Sepal.Length,Sepal.Width
 9: 0.8888889 0.04000000              Petal.Width,Sepal.Length,Sepal.Width
10: 1.0000000 0.06666667 Petal.Length,Petal.Width,Sepal.Length,Sepal.Width

Voila, we created our own PipeOp, that uses very advanced knowledge of mlr3pipelines and paradox in only few lines of code.