Benchmarking Hypothesis

Benchmark models in multiple scenarios, using hypothesis tests as an additional diagnostic tool to make the benchmark more rigorous.

Authors

Goal

Our goal for this exercise sheet is to use mlr3 to benchmark models in multiple scenarios, using hypothesis tests as an additional diagnostic tool to make the benchmark more rigorous.

Required packages

library(mlr3oml)
library(mlr3verse)
library(mlr3learners)
library(mlr3benchmark)
library(tidyverse)
library(ggplot2)
library(PMCMRplus)
set.seed(20220801)

1 Two Algorithms on One Data Set

Let’s start with a simple example that compares two different learners on a single data set.

1.1 Train Models

Train a random forest from the ranger package and a regression tree from the rpart package using mlr3 with default hyperparameters on the German credit task "german_credit". The models are used in the next step to predict class probabilities.

1.2 Get Predictions

Create a data.frame. For each row in the credit data, it should contain the ground truth label as well as both the predicted probabilities and predicted labels by rpart and ranger, respectively.

Hint 1:

You can call $predict_newdata() on the trained model object to make predictions.

1.3 Evaluate models

Add two new columns with the observation-wise loss value for the Brier score. Compare the performance of both models using these columns.

1.4 Two sample t-test

Use a two sample t-test for an alpha of 5% to evaluate whether both samples of performance scores come from different populations.

Hint 1:

Add another column with the difference between observation-wise loss values. Then run the t-test. The value of the quantile function of the t-distribution can be computed with qt().

1.5 McNemar test

Now run the McNemar test for an alpha of 5%. This is a non-parametric test that compares only the labels predicted by two models.

Hint 1:

You will need the total number of observations that are classified correctly by rpart only and those that are classified correctly by ranger only. The value of the quantile function of the chi-sqaured-distribution can be computed with qchisq().

2 Two Algorithms on Multiple Data Sets

Let us now confirm whether this result holds for other data sets as well. We will first scout OpenML (package mlr3oml) for suitable classification tasks.

2.1 Get Tasks from OpenML

Use the function list_oml_tasks() to look for tasks with the following characteristics:

  • binary classification
  • number of features between 5 and 10
  • number of instances between 500 and 10000
  • no missing values

2.2 Filter tasks

Filter out tasks with the same data_id. Use only tasks with balanced data sets where the ratio between the majority and minority class is smaller than 1.2. Also, remove tasks with a data_id of 720 and with target “gender” or “Class”. You should receive a total number of 29 tasks.

2.3 Benchmark tasks

Benchmark rpart and ranger on all found tasks with mlr3. Use one-hot encoding and three-fold cross-validation.

Hint 1:

Use po("encode", method = "one-hot").

Hint 2:

A benchmark design can be created with design = benchmark_grid(tasklist, learners, resampling) and evaluated with benchmark(design).

2.4 Compare Learners

Apply the $aggregate() function to the mlr3 benchmark object and compare the ranks of both algorithms on all tasks.

2.5 Wilcoxon test

Run the Wilcoxon signed rank test using the ranks you computed. You can use qsignrank(p = 0.05 / 2, n = M) to compute the critical value for the lower tail of the two-sided test for a 5% significance level.

3 Multiple Algorithms on Multiple Data Sets

3.1 Benchmark learners

Let us now compare more algorithms on each task. Rerun the benchmark with the learners “classif.featureless”, “classif.cv_glmnet”, “classif.rpart”, “classif.ranger”, “classif.kknn”, and “classif.svm”. As before, use one-hot encodings. :::{.callout-note collapse=“true”}

Solution

:::

3.2 Friedman test

Given multiple algorithms on multiple data sets, we have to use an omnibus test such as the Friedman test. Compute a rank table that tells you the rank of each algorithm for each task. Then, compute the average rank of each algorithm and proceed with the computation of the Friedman statistic.

3.3 Friedman test (stats)

Run a sanity check with the friedman.test function implemented in the stats package.

3.4 Nemenyi test

As the Friedman test indicates that at least one algorithm performs differently, we can run pairwise comparisons with post-hoc tests such as the Nemenyi or Bonferroni-Dunn test.

Use the function frdAllPairsNemenyiTest from the PMCMRplus package to run all pairwise Nemenyi tests.

3.5 Compute critical difference

Manually compute the critical difference for rpart and ranger.

Hint 1:

Use the qtukey function.

3.6 Bonferroni-Dunn test

Manually compare rpart and ranger with the Bonferroni-Dunn test.

Hint 1:

The probability of observing the test statistic under the null hypothesis is given by pnorm(..., 0, 1, lower.tail = FALSE).

3.7 Criticial difference plot

Interestingly, both tests differ in this case. The Nemenyi test lets us reject the null, while the Bonferroni-Dunn test does not let us reject the null. Next, compute a critical difference plot with mlr3.