Feature Selection

Select features from the german credit set and evaluate model performance.

Authors

Goal

After this exercise, you should understand and be able to perform feature selection using wrapper functions with mlr3fselect. You should also be able to integrate various performance measures and calculate the generalization error.

Wrapper Methods

In addition to filtering, wrapper methods are another variant of selecting features. While in filtering conditions for the feature values are set, in wrapper methods the learner is applied to different subsets of the feature set. As models need to be refitted, this method is computationally expensive.

For wrapper methods, we need the package mlr3fselect, at whose heart the following R6 classes are:

  • FSelectInstanceSingleCrit, FSelectInstanceMultiCrit: These two classes describe the feature selection problem and store the results.
  • FSelector: This class is the base class for implementations of feature selection algorithms.

Prerequisites

We load the most important packages and use a fixed seed for reproducibility.

library(mlr3verse)
library(data.table)
library(mlr3fselect)
set.seed(7891)

In this exercise, we will use the german_credit data and the learner classif.ranger:

task_gc = tsk("german_credit")
lrn_ranger = lrn("classif.ranger")

1 Basic Application

1.1 Create the Framework

Create an FSelectInstanceSingleCrit object using fsi(). The instance should use a 3-fold cross validation, classification accuracy as the measure and terminate after 20 evaluations. For simplification only consider the features age, amount, credit_history and duration.

Hint 1:
task_gc$select(...)

instance = fsi(
  task = ...,
  learner = ...,
  resampling = ...,
  measure = ...,
  terminator = ...
)

1.2 Start the Feature Selection

Start the feature selection step by selecting sequential using the FSelector class via fs() and pass the FSelectInstanceSingleCrit object to the $optimize() method of the initialized FSelector object.

Hint 1:
fselector = fs(...)
Hint 2:
fselector = fs(...)
fselector$optimize(...)

1.3 Evaluate

View the four characteristics and the accuracy from the instance archive for each of the first two batches.

Hint 1:
instance$archive$data[...]
Hint 2:
instance$archive$data[batch_nr == ..., ...]

1.4 Model Training

Which feature(s) should be selected? Train the model.

Hint 1:

Compare the accuracy values for the different feature combinations and select the feature(s) accordingly.

Hint 2:
task_gc = ...
task_gc$select(...)
lrn_ranger$train(...)

2 Multiple Performance Measures

To optimize multiple performance metrics, the same steps must be followed as above except that multiple metrics are passed. Create an ´instance´ object as above considering the measures classif.tpr and classif.tnr. For the second step use random search and take a look at the results in a third step.

We again use the german_credit data:

task_gc = tsk("german_credit")
Hint 1:
instance = fsi(...)
fselector = fs(...)
fselector$...(...)
features = unlist(lapply(...))
cbind(features,...)

3 Nested Resampling

Nested resampling enables finding unbiased performance estimators for the selection of features. In mlr3 this is possible with the class AutoFSelector, whose instance can be created by the function auto_fselector().

3.1 Create an AutoFSelector Instance

Implement an AutoFSelector object that uses random search to find a feature selection that gives the highest accuracy for a logistic regression with holdout resampling. It should terminate after 10 evaluations.

Hint 1:
afs = auto_fselector(
  fselector = ...,
  learner = ...,
  resampling = ...,
  measure = ...,
  terminator = ...
)

3.2 Benchmark

Compare the AutoFSelector with a normal logistic regression using 3 fold cross-validation.

Hint 1:

The AutoFSelector inherits from the Learner base class, which is why it can be used like any other learner.

Hint 2:

Implement a benchmark grid and aggregate the result.

Summary

  • Wrapper methods calculate performance measures for various combinations of features in order to perform feature selection.
  • They are computationally expensive since several models need to be fitted.
  • The AutoFSelector inherits from the Learner base class, which is why it can be used like any other learner.