library(mlr3verse)
library(mlr3fselect)
Introduction
In this tutorial, we introduce the mlr3fselect package by comparing feature selection methods on the Titanic disaster data set. The objective of feature selection is to enhance the interpretability of models, speed up the learning process and increase the predictive performance.
We load the mlr3verse package which pulls in the most important packages for this example.
We initialize the random number generator with a fixed seed for reproducibility, and decrease the verbosity of the logger to keep the output clearly represented.
set.seed(7832)
::get_logger("mlr3")$set_threshold("warn")
lgr::get_logger("bbotk")$set_threshold("warn") lgr
Titanic Data Set
The Titanic data set contains data for 887 Titanic passengers, including whether they survived when the Titanic sank. Our goal will be to predict the survival of the Titanic passengers.
After loading the data set from the mlr3data package, we impute the missing age values with the median age of the passengers, set missing embarked values to "s"
and remove character
features. We could use feature engineering to create new features from the character
features, however we want to focus on feature selection in this tutorial.
In addition to the survived
column, the reduced data set contains the following attributes for each passenger:
Feature | Description |
---|---|
age |
Age |
sex |
Sex |
sib_sp |
Number of siblings / spouses aboard |
parch |
Number of parents / children aboard |
fare |
Amount paid for the ticket |
pc_class |
Passenger class |
embarked |
Port of embarkation |
library(mlr3data)
data("titanic", package = "mlr3data")
$age[is.na(titanic$age)] = median(titanic$age, na.rm = TRUE)
titanic$embarked[is.na(titanic$embarked)] = "S"
titanic$ticket = NULL
titanic$name = NULL
titanic$cabin = NULL
titanic= titanic[!is.na(titanic$survived),] titanic
We construct a binary classification task.
= as_task_classif(titanic, target = "survived", positive = "yes") task
Model
We use the logistic regression learner provided by the mlr3learners package.
library(mlr3learners)
= lrn("classif.log_reg") learner
To evaluate the predictive performance, we choose a 3-fold cross-validation and the classification error as the measure.
= rsmp("cv", folds = 3)
resampling = msr("classif.ce")
measure
$instantiate(task) resampling
Classes
The FSelectInstanceSingleCrit
class specifies a general feature selection scenario. It includes the ObjectiveFSelect
object that encodes the black box objective function which is optimized by a feature selection algorithm. The evaluated feature sets are stored in an ArchiveFSelect
object. The archive provides a method for querying the best performing feature set.
The Terminator
classes determine when to stop the feature selection. In this example we choose a terminator that stops the feature selection after 10 seconds. The sugar functions trm()
and trms()
can be used to retrieve terminators from the mlr_terminators
dictionary.
= trm("run_time", secs = 10)
terminator $new(
FSelectInstanceSingleCrittask = task,
learner = learner,
resampling = resampling,
measure = measure,
terminator = terminator)
<FSelectInstanceSingleCrit>
* State: Not optimized
* Objective: <ObjectiveFSelect:classif.log_reg_on_titanic>
* Terminator: <TerminatorRunTime>
The FSelector
subclasses describe the feature selection strategy. The sugar function fs()
can be used to retrieve feature selection algorithms from the mlr_fselectors
dictionary.
mlr_fselectors
<DictionaryFSelector> with 8 stored values
Keys: design_points, exhaustive_search, genetic_search, random_search, rfe, rfecv, sequential,
shadow_variable_search
Random search
Random search randomly draws feature sets and evaluates them in batches. We retrieve the FSelectorRandomSearch
class with the fs()
sugar function and choose TerminatorEvals
. We set the n_evals
parameter to 10
which means that 10 feature sets are evaluated.
= trm("evals", n_evals = 10)
terminator = FSelectInstanceSingleCrit$new(
instance task = task,
learner = learner,
resampling = resampling,
measure = measure,
terminator = terminator)
= fs("random_search", batch_size = 5) fselector
The feature selection is started by passing the FSelectInstanceSingleCrit
object to the $optimize()
method of FSelectorRandomSearch
which generates the feature sets. These features set are internally passed to the $eval_batch()
method of FSelectInstanceSingleCrit
which evaluates them with the objective function and stores the results in the archive. This general interaction between the objects of mlr3fselect stays the same for the different feature selection methods. However, the way how new feature sets are generated differs depending on the chosen FSelector
subclass.
$optimize(instance) fselector
age embarked fare parch pclass sex sib_sp features classif.ce
1: TRUE FALSE TRUE TRUE TRUE TRUE TRUE age,fare,parch,pclass,sex,sib_sp 0.2020202
The ArchiveFSelect
stores a data.table::data.table()
which consists of the evaluated feature sets and the corresponding estimated predictive performances.
as.data.table(instance$archive, exclude_columns = c("runtime_learners", "resample_result", "uhash"))
age embarked fare parch pclass sex sib_sp classif.ce timestamp batch_nr warnings errors
1: TRUE TRUE TRUE TRUE TRUE TRUE TRUE 0.2031425 2023-03-03 10:45:07 1 0 0
2: TRUE FALSE FALSE FALSE FALSE FALSE TRUE 0.3838384 2023-03-03 10:45:07 1 0 0
3: FALSE FALSE FALSE TRUE FALSE FALSE TRUE 0.3804714 2023-03-03 10:45:07 1 0 0
4: FALSE FALSE TRUE FALSE FALSE FALSE FALSE 0.3288440 2023-03-03 10:45:07 1 0 0
5: FALSE FALSE TRUE FALSE FALSE TRUE FALSE 0.2188552 2023-03-03 10:45:07 1 0 0
6: FALSE FALSE FALSE FALSE TRUE FALSE FALSE 0.3209877 2023-03-03 10:45:08 2 0 0
7: TRUE FALSE FALSE FALSE FALSE FALSE TRUE 0.3838384 2023-03-03 10:45:08 2 0 0
8: TRUE FALSE TRUE TRUE TRUE TRUE TRUE 0.2020202 2023-03-03 10:45:08 2 0 0
9: TRUE TRUE TRUE TRUE TRUE TRUE TRUE 0.2031425 2023-03-03 10:45:08 2 0 0
10: TRUE FALSE TRUE TRUE FALSE FALSE FALSE 0.3389450 2023-03-03 10:45:08 2 0 0
features
1: age,embarked,fare,parch,pclass,sex,...
2: age,sib_sp
3: parch,sib_sp
4: fare
5: fare,sex
6: pclass
7: age,sib_sp
8: age,fare,parch,pclass,sex,sib_sp
9: age,embarked,fare,parch,pclass,sex,...
10: age,fare,parch
The associated resampling iterations can be accessed in the BenchmarkResult
by calling
$archive$benchmark_result instance
<BenchmarkResult> of 30 rows with 10 resampling runs
nr task_id learner_id resampling_id iters warnings errors
1 titanic classif.log_reg cv 3 0 0
2 titanic classif.log_reg cv 3 0 0
3 titanic classif.log_reg cv 3 0 0
4 titanic classif.log_reg cv 3 0 0
5 titanic classif.log_reg cv 3 0 0
6 titanic classif.log_reg cv 3 0 0
7 titanic classif.log_reg cv 3 0 0
8 titanic classif.log_reg cv 3 0 0
9 titanic classif.log_reg cv 3 0 0
10 titanic classif.log_reg cv 3 0 0
We retrieve the best performing feature set with
$result instance
age embarked fare parch pclass sex sib_sp features classif.ce
1: TRUE FALSE TRUE TRUE TRUE TRUE TRUE age,fare,parch,pclass,sex,sib_sp 0.2020202
Sequential forward selection
We try sequential forward selection. We chose TerminatorStagnation
that stops the feature selection if the predictive performance does not increase anymore.
= trm("stagnation", iters = 5)
terminator = FSelectInstanceSingleCrit$new(
instance task = task,
learner = learner,
resampling = resampling,
measure = measure,
terminator = terminator)
= fs("sequential")
fselector $optimize(instance) fselector
age embarked fare parch pclass sex sib_sp features classif.ce
1: FALSE FALSE FALSE TRUE TRUE TRUE TRUE parch,pclass,sex,sib_sp 0.1964085
The FSelectorSequential
object has a special method for displaying the optimization path of the sequential feature selection.
$optimization_path(instance) fselector
age embarked fare parch pclass sex sib_sp classif.ce batch_nr
1: TRUE FALSE FALSE FALSE FALSE FALSE FALSE 0.3838384 1
2: TRUE FALSE FALSE FALSE FALSE TRUE FALSE 0.2132435 2
3: TRUE FALSE FALSE FALSE FALSE TRUE TRUE 0.2087542 3
4: TRUE FALSE FALSE FALSE TRUE TRUE TRUE 0.2143659 4
5: TRUE FALSE FALSE TRUE TRUE TRUE TRUE 0.2065095 5
6: TRUE FALSE TRUE TRUE TRUE TRUE TRUE 0.2020202 6
Recursive feature elimination
Recursive feature elimination utilizes the $importance()
method of learners. In each iteration the feature(s) with the lowest importance score is dropped. We choose the non-recursive algorithm (recursive = FALSE
) which calculates the feature importance once on the complete feature set. The recursive version (recursive = TRUE
) recomputes the feature importance on the reduced feature set in every iteration.
= lrn("classif.ranger", importance = "impurity")
learner = trm("none")
terminator = FSelectInstanceSingleCrit$new(
instance task = task,
learner = learner,
resampling = resampling,
measure = measure,
terminator = terminator,
store_models = TRUE)
= fs("rfe", recursive = FALSE)
fselector $optimize(instance) fselector
age embarked fare parch pclass sex sib_sp features classif.ce
1: TRUE TRUE TRUE TRUE TRUE TRUE TRUE age,embarked,fare,parch,pclass,sex,... 0.1694725
We access the results.
as.data.table(instance$archive, exclude_columns = c("runtime_learners", "timestamp", "batch_nr", "resample_result", "uhash"))
age embarked fare parch pclass sex sib_sp classif.ce warnings errors importance
1: TRUE TRUE TRUE TRUE TRUE TRUE TRUE 0.1694725 0 0 7,6,5,4,3,2,...
2: TRUE FALSE TRUE FALSE FALSE TRUE FALSE 0.2132435 0 0 7,6,5
features
1: age,embarked,fare,parch,pclass,sex,...
2: age,fare,sex
Nested resampling
It is a common mistake to report the predictive performance estimated on resampling sets during the feature selection as the performance that can be expected from the combined feature selection and model training. The repeated evaluation of the model might leak information about the test sets into the model and thus leads to over-fitting and over-optimistic performance results. Nested resampling uses an outer and inner resampling to separate the feature selection from the performance estimation of the model. We can use the AutoFSelector
class for running nested resampling. The AutoFSelector
essentially combines a given Learner
and feature selection method into a Learner
with internal automatic feature selection. The inner resampling loop that is used to determine the best feature set is conducted internally each time the AutoFSelector
Learner
object is trained.
= rsmp("cv", folds = 5)
resampling_inner = msr("classif.ce")
measure
= AutoFSelector$new(
at learner = learner,
resampling = resampling_inner,
measure = measure,
terminator = terminator,
fselect = fs("sequential"),
store_models = TRUE)
We put the AutoFSelector
into a resample()
call to get the outer resampling loop.
= rsmp("cv", folds = 3)
resampling_outer
= resample(task, at, resampling_outer, store_models = TRUE) rr
The aggregated performance of all outer resampling iterations is the unbiased predictive performance we can expected from the logistic regression model with an optimized feature set found by sequential selection.
$aggregate() rr
classif.ce
0.1829405
We check whether the feature sets that were selected in the inner resampling are stable. The selected feature sets should not differ too much. We might observe unstable models in this example because the small data set and the low number of resampling iterations might introduces too much randomness. Usually, we aim for the selection of similar feature sets for all outer training sets.
extract_inner_fselect_results(rr)
Next, we want to compare the predictive performances estimated on the outer resampling to the inner resampling. Significantly lower predictive performances on the outer resampling indicate that the models with the optimized feature sets overfit the data.
$score()[, .(iteration, task_id, learner_id, resampling_id, classif.ce)] rr
iteration task_id learner_id resampling_id classif.ce
1: 1 titanic classif.ranger.fselector cv 0.1515152
2: 2 titanic classif.ranger.fselector cv 0.1952862
3: 3 titanic classif.ranger.fselector cv 0.2020202
The archives of the AutoFSelector
s gives us all evaluated feature sets with the associated predictive performances.
extract_inner_fselect_archives(rr)
Shortcuts
Selecting a feature subset can be shortened by using the fselect()
-shortcut.
= fselect(
instance method = "random_search",
task = tsk("iris"),
learner = lrn("classif.log_reg"),
resampling = rsmp("cv", folds = 3),
measure = msr("classif.ce"),
term_evals = 10
)
Applying nested resampling can be shortened by using the fselect_nested()
-shortcut.
= fselect_nested(
rr method = "random_search",
task = tsk("iris"),
learner = lrn("classif.log_reg"),
inner_resampling = rsmp ("cv", folds = 3),
outer_resampling = rsmp("cv", folds = 3),
measure = msr("classif.ce"),
term_evals = 10
)