library(mlr3oml)
library(mlr3verse)
library(mlr3learners)
library(mlr3benchmark)
library(tidyverse)
library(ggplot2)
library(PMCMRplus)
set.seed(20220801)
Goal
Our goal for this exercise sheet is to use mlr3
to benchmark models in multiple scenarios, using hypothesis tests as an additional diagnostic tool to make the benchmark more rigorous.
Required packages
1 Two Algorithms on One Data Set
Let’s start with a simple example that compares two different learners on a single data set.
1.1 Train Models
Train a random forest from the ranger
package and a regression tree from the rpart
package using mlr3
with default hyperparameters on the German credit task "german_credit"
. The models are used in the next step to predict class probabilities.
1.2 Get Predictions
Create a data.frame
. For each row in the credit data, it should contain the ground truth label as well as both the predicted probabilities and predicted labels by rpart
and ranger
, respectively.
Hint 1:
You can call $predict_newdata()
on the trained model object to make predictions.
1.3 Evaluate models
Add two new columns with the observation-wise loss value for the Brier score. Compare the performance of both models using these columns.
1.4 Two sample t-test
Use a two sample t-test for an alpha of 5% to evaluate whether both samples of performance scores come from different populations.
Hint 1:
Add another column with the difference between observation-wise loss values. Then run the t-test. The value of the quantile function of the t-distribution can be computed with qt()
.
1.5 McNemar test
Now run the McNemar test for an alpha of 5%. This is a non-parametric test that compares only the labels predicted by two models.
Hint 1:
You will need the total number of observations that are classified correctly by rpart
only and those that are classified correctly by ranger
only. The value of the quantile function of the chi-sqaured-distribution can be computed with qchisq()
.
2 Two Algorithms on Multiple Data Sets
Let us now confirm whether this result holds for other data sets as well. We will first scout OpenML
(package mlr3oml
) for suitable classification tasks.
2.1 Get Tasks from OpenML
Use the function list_oml_tasks()
to look for tasks with the following characteristics:
- binary classification
- number of features between 5 and 10
- number of instances between 500 and 10000
- no missing values
2.2 Filter tasks
Filter out tasks with the same data_id
. Use only tasks with balanced data sets where the ratio between the majority and minority class is smaller than 1.2. Also, remove tasks with a data_id
of 720 and with target “gender” or “Class”. You should receive a total number of 29 tasks.
2.3 Benchmark tasks
Benchmark rpart
and ranger
on all found tasks with mlr3
. Use one-hot encoding and three-fold cross-validation.
Hint 1:
Use po("encode", method = "one-hot")
.
Hint 2:
A benchmark design can be created with design = benchmark_grid(tasklist, learners, resampling)
and evaluated with benchmark(design)
.
2.4 Compare Learners
Apply the $aggregate()
function to the mlr3
benchmark object and compare the ranks of both algorithms on all tasks.
2.5 Wilcoxon test
Run the Wilcoxon signed rank test using the ranks you computed. You can use qsignrank(p = 0.05 / 2, n = M)
to compute the critical value for the lower tail of the two-sided test for a 5% significance level.
3 Multiple Algorithms on Multiple Data Sets
3.1 Benchmark learners
Let us now compare more algorithms on each task. Rerun the benchmark with the learners “classif.featureless”, “classif.cv_glmnet”, “classif.rpart”, “classif.ranger”, “classif.kknn”, and “classif.svm”. As before, use one-hot encodings. :::{.callout-note collapse=“true”}
Solution
:::
3.2 Friedman test
Given multiple algorithms on multiple data sets, we have to use an omnibus test such as the Friedman test. Compute a rank table that tells you the rank of each algorithm for each task. Then, compute the average rank of each algorithm and proceed with the computation of the Friedman statistic.
3.3 Friedman test (stats)
Run a sanity check with the friedman.test
function implemented in the stats
package.
3.4 Nemenyi test
As the Friedman test indicates that at least one algorithm performs differently, we can run pairwise comparisons with post-hoc tests such as the Nemenyi or Bonferroni-Dunn test.
Use the function frdAllPairsNemenyiTest
from the PMCMRplus
package to run all pairwise Nemenyi tests.
3.5 Compute critical difference
Manually compute the critical difference for rpart and ranger.
Hint 1:
Use the qtukey
function.
3.6 Bonferroni-Dunn test
Manually compare rpart
and ranger
with the Bonferroni-Dunn test.
Hint 1:
The probability of observing the test statistic under the null hypothesis is given by pnorm(..., 0, 1, lower.tail = FALSE)
.
3.7 Criticial difference plot
Interestingly, both tests differ in this case. The Nemenyi test lets us reject the null, while the Bonferroni-Dunn test does not let us reject the null. Next, compute a critical difference plot with mlr3
.