library(mlr3verse)
library(data.table)
library(mlr3fselect)
set.seed(7891)
Goal
After this exercise, you should understand and be able to perform feature selection using wrapper functions with mlr3fselect
. You should also be able to integrate various performance measures and calculate the generalization error.
Wrapper Methods
In addition to filtering, wrapper methods are another variant of selecting features. While in filtering conditions for the feature values are set, in wrapper methods the learner is applied to different subsets of the feature set. As models need to be refitted, this method is computationally expensive.
For wrapper methods, we need the package mlr3fselect
, at whose heart the following R6
classes are:
FSelectInstanceSingleCrit
,FSelectInstanceMultiCrit
: These two classes describe the feature selection problem and store the results.FSelector
: This class is the base class for implementations of feature selection algorithms.
Prerequisites
We load the most important packages and use a fixed seed for reproducibility.
In this exercise, we will use the german_credit
data and the learner classif.ranger
:
= tsk("german_credit")
task_gc = lrn("classif.ranger") lrn_ranger
1 Basic Application
1.1 Create the Framework
Create an FSelectInstanceSingleCrit
object using fsi()
. The instance should use a 3-fold cross validation, classification accuracy as the measure
and terminate after 20 evaluations. For simplification only consider the features age
, amount
, credit_history
and duration
.
Hint 1:
$select(...)
task_gc
= fsi(
instance task = ...,
learner = ...,
resampling = ...,
measure = ...,
terminator = ...
)
1.2 Start the Feature Selection
Start the feature selection step by selecting sequential
using the FSelector
class via fs()
and pass the FSelectInstanceSingleCrit
object to the $optimize()
method of the initialized FSelector
object.
Hint 1:
= fs(...) fselector
Hint 2:
= fs(...)
fselector $optimize(...) fselector
1.3 Evaluate
View the four characteristics and the accuracy from the instance archive for each of the first two batches.
Hint 1:
$archive$data[...] instance
Hint 2:
$archive$data[batch_nr == ..., ...] instance
1.4 Model Training
Which feature(s) should be selected? Train the model.
Hint 1:
Compare the accuracy values for the different feature combinations and select the feature(s) accordingly.
Hint 2:
= ...
task_gc $select(...)
task_gc$train(...) lrn_ranger
2 Multiple Performance Measures
To optimize multiple performance metrics, the same steps must be followed as above except that multiple metrics are passed. Create an ´instance´ object as above considering the measures classif.tpr
and classif.tnr
. For the second step use random search
and take a look at the results in a third step.
We again use the german_credit
data:
= tsk("german_credit") task_gc
Hint 1:
= fsi(...) instance
= fs(...)
fselector $...(...) fselector
= unlist(lapply(...))
features cbind(features,...)
3 Nested Resampling
Nested resampling enables finding unbiased performance estimators for the selection of features. In mlr3
this is possible with the class AutoFSelector
, whose instance can be created by the function auto_fselector()
.
3.1 Create an AutoFSelector
Instance
Implement an AutoFSelector
object that uses random search to find a feature selection that gives the highest accuracy for a logistic regression with holdout resampling. It should terminate after 10 evaluations.
Hint 1:
= auto_fselector(
afs fselector = ...,
learner = ...,
resampling = ...,
measure = ...,
terminator = ...
)
3.2 Benchmark
Compare the AutoFSelector
with a normal logistic regression using 3 fold cross-validation.
Hint 1:
The AutoFSelector
inherits from the Learner
base class, which is why it can be used like any other learner.
Hint 2:
Implement a benchmark grid and aggregate the result.
Summary
- Wrapper methods calculate performance measures for various combinations of features in order to perform feature selection.
- They are computationally expensive since several models need to be fitted.
- The
AutoFSelector
inherits from theLearner
base class, which is why it can be used like any other learner.