mlr3 - Runtime and Memory Benchmarks

Scope

This report analyzes the runtime and memory usage of mlr3 across the four most recent package versions. It focuses on the learner methods $train() and $predict() and on the evaluation functions resample() and benchmark(). The benchmarks quantify the runtime overhead introduced by mlr3 and the memory usage. Overhead is reported relative to the training time of the underlying models. The study varies dataset size and the number of resampling iterations. All experiments also assess the effect of parallelization on runtime and memory. The impact of encapsulation is examined by comparing alternative encapsulation methods.

Given the size of the mlr3 ecosystem, performance bottlenecks can arise at multiple stages. This report helps users assess whether observed runtimes fall within expected ranges. Substantial anomalies in runtime or memory should be reported by opening a GitHub issue. Benchmarks are executed on a high‑performance cluster optimized for multi‑core throughput rather than single‑core speed. Consequently, single‑core runtimes may be faster on a modern local machine.

Summary of Latest mlr3 Version

The benchmarks are comprehensive, so we summarize the results for the latest mlr3 version. The runtime overhead of mlr3 must be interpreted relative to model training and prediction times. For instance, if ranger::ranger() takes 100 ms to train and lrn("classif.ranger")$train() takes 110 milliseconds, the overhead is 10%. If the same model requires 1 second to train, the overhead is 1%. The overhead is shown relative to the training time of the models with the factors k_1, k_10, k_100, and k_1000. The subscript denotes the model’s training time in milliseconds. The factors pk_1, pk_10, pk_100, and pk_1000 report the speedup of parallel over sequential execution.

We first consider $train(). For models with training times of 1000 ms and 100 ms, the overhead is minimal. When training takes 10 ms, runtime approximately doubles. For 1 ms models, overhead is roughly ten times the bare model training time.

The overhead of $predict() is comparable to $train(), and dataset size has only a minor effect. $predict_newdata() converts newdata to a task and then predicts, which roughly doubles the overhead relative to $predict(). The recently introduced $predict_newdata_fast() is substantially faster than $predict_newdata(). For models with 10 ms prediction time, the overhead is about 10%. For models with 1 ms prediction time, the overhead is about 50%.

The overhead of resample() and benchmark() is small for 1000 ms and 100 ms models. For 10 ms models, the total runtime is approximately twice the bare training time. For 1 ms models, the total runtime is approximately ten times the bare training time. An empty R session consumes 131 MB of memory. Resampling with 10 iterations uses approximately 164 MB, increasing to about 225 MB for 1000 iterations. Memory usage for benchmark() is comparable to resample().

mlr3 parallelizes over resampling iterations via the future package. Parallel execution adds overhead due to worker initialization. We therefore compare parallel and sequential runtimes. For 1 s models, parallel resample() and benchmark() reduce total runtime. For 100 ms models, parallelization is advantageous primarily for 100 or 1000 iterations. For 10 ms and 1 ms models, parallel execution overtakes sequential execution mainly at 1000 iterations. Memory grows with the number of cores because each core launches a separate R session. Using 10 cores results in a total memory footprint of approximately 1.2 GB.

Encapsulation captures and logs conditions such as messages, warnings, and errors without interrupting control flow. Encapsulation via callr introduces approximately 1 s of additional runtime per model training. Encapsulation via evaluate adds negligible runtime overhead.

Train

The runtime and memory usage of $train() are measured for different mlr3 versions.

task = tsk("spam")
learner = lrn("classif.featureless")

learner$train(task)
Runtime and memory usage of $train() by mlr3 version and dataset size. The k factors indicate how many times longer the total runtime is compared to the model training time. The numbers represent the model training time itself e.g. k100 are models trained for 100 ms. A green background highlights cases where the total runtime is less than three times the model training time. The table includes runtime and memory usage for tasks of size 10, 100, 1,000 and 10,000.
mlr3 Task Size Overhead, ms k1000 k100 k10 k1 Memory, mb
10 Observations
1.3.0 10 5 1.0 1.1 1.5 6.2 147
1.2.0 10 5 1.0 1.1 1.5 6.3 146
1.1.0 10 5 1.0 1.0 1.5 5.9 151
100 Observations
1.3.0 100 5 1.0 1.1 1.5 6.2 150
1.2.0 100 5 1.0 1.1 1.5 6.2 148
1.1.0 100 5 1.0 1.0 1.5 6.0 146
1000 Observations
1.3.0 1000 5 1.0 1.1 1.5 6.5 150
1.2.0 1000 5 1.0 1.1 1.5 6.4 148
1.1.0 1000 5 1.0 1.1 1.5 6.2 146
10000 Observations
1.3.0 10000 6 1.0 1.1 1.6 7.0 174
1.2.0 10000 6 1.0 1.1 1.6 7.1 172
1.1.0 10000 6 1.0 1.1 1.6 6.9 174

Predict

The runtime of $predict() is measured across mlr3 versions.

task = tsk("spam")
learner = lrn("classif.featureless")

learner$train(task)

learner$predict(task)
Runtime and memory usage of $predict() by mlr3 version and dataset size. The k factors indicate how many times longer the total runtime is compared to the model training time. The numbers represent the model training time itself e.g. k100 are models trained for 100 ms. A green background highlights cases where the total runtime is less than three times the model training time. The table includes runtime and memory usage for tasks of size 10, 100, 1,000 and 10,000.
mlr3 Task Size Overhead, ms k1000 k100 k10 k1 Memory, mb
10 Observations
1.3.0 10 4 1.0 1.0 1.4 4.9 148
1.2.0 10 4 1.0 1.0 1.4 4.9 147
1.1.0 10 5 1.0 1.1 1.5 6.2 146
100 Observations
1.3.0 100 4 1.0 1.0 1.4 4.9 146
1.2.0 100 4 1.0 1.0 1.4 5.0 146
1.1.0 100 5 1.0 1.1 1.5 6.2 147
1000 Observations
1.3.0 1000 4 1.0 1.0 1.4 5.0 151
1.2.0 1000 4 1.0 1.0 1.4 5.1 151
1.1.0 1000 5 1.0 1.1 1.5 6.4 148
10000 Observations
1.3.0 10000 5 1.0 1.1 1.5 6.0 174
1.2.0 10000 5 1.0 1.1 1.5 6.1 169
1.1.0 10000 6 1.0 1.1 1.6 7.2 177

Predict Newdata

The runtime of $predict_newdata() is measured across mlr3 versions.

task = tsk("spam")
learner = lrn("classif.featureless")

learner$train(task)

learner$predict_newdata(newdata)
Runtime and memory usage of $predict_newdata() by mlr3 version and dataset size. The k factors indicate how many times longer the total runtime is compared to the model training time. The numbers represent the model training time itself e.g. k100 are models trained for 100 ms. A green background highlights cases where the total runtime is less than three times the model training time. The table includes runtime and memory usage for tasks of size 10, 100, 1,000 and 10,000.
mlr3 Task Size Overhead, ms k1000 k100 k10 k1 Memory, mb
10 Observations
1.3.0 10 20 1.0 1.2 3.0 21 156
1.2.0 10 20 1.0 1.2 3.0 21 152
1.1.0 10 21 1.0 1.2 3.1 22 152
100 Observations
1.3.0 100 20 1.0 1.2 3.0 21 151
1.2.0 100 20 1.0 1.2 3.0 21 153
1.1.0 100 21 1.0 1.2 3.1 22 157
1000 Observations
1.3.0 1000 21 1.0 1.2 3.1 22 153
1.2.0 1000 21 1.0 1.2 3.1 22 154
1.1.0 1000 22 1.0 1.2 3.2 23 154
10000 Observations
1.3.0 10000 33 1.0 1.3 4.3 34 182
1.2.0 10000 33 1.0 1.3 4.3 34 173
1.1.0 10000 34 1.0 1.3 4.4 35 181

Predict Newdata Fast

The runtime of $predict_newdata_fast() is measured across mlr3 versions.

task = tsk("spam")
learner = lrn("classif.featureless")

learner$train(task)

learner$predict_newdata_fast(task)
Runtime and memory usage of $predict_newdata_fast() by mlr3 version and dataset size. The k factors indicate how many times longer the total runtime is compared to the model training time. The numbers represent the model training time itself e.g. k100 are models trained for 100 ms. A green background highlights cases where the total runtime is less than three times the model training time. The table includes runtime and memory usage for tasks of size 10, 100, 1,000 and 10,000.
mlr3 Task Size Overhead, ms k1000 k100 k10 k1 Memory, mb
10 Observations
1.3.0 10 0 1.0 1.0 1.0 1.3 149
1.2.0 10 0 1.0 1.0 1.0 1.3 151
1.1.0 10 0 1.0 1.0 1.0 1.3 150
100 Observations
1.3.0 100 0 1.0 1.0 1.0 1.3 150
1.2.0 100 0 1.0 1.0 1.0 1.3 153
1.1.0 100 0 1.0 1.0 1.0 1.3 150
1000 Observations
1.3.0 1000 0 1.0 1.0 1.0 1.4 154
1.2.0 1000 0 1.0 1.0 1.0 1.4 153
1.1.0 1000 0 1.0 1.0 1.0 1.4 155
10000 Observations
1.3.0 10000 1 1.0 1.0 1.1 2.0 162
1.2.0 10000 1 1.0 1.0 1.1 2.1 165
1.1.0 10000 1 1.0 1.0 1.1 2.0 160

Resampling

The runtime and memory usage of resample() are measured across mlr3 versions. The number of resampling iterations (evals) is set to 1000, 100, and 10. We also measure the runtime of resample() with future::multisession parallelization on 10 cores.

task = tsk("spam")
learner = lrn("classif.featureless")

resampling = rsmp("subsampling", repeats = evals)

resample(task, learner, resampling)
Runtime and memory usage of resample() by mlr3 version and resampling iterations on the spam dataset with 1,000 observations. The k factors indicate how many times longer the total runtime is compared to the model training time. The numbers represent the model training time itself e.g. k100 are models trained for 100 ms. A green background highlights cases where the total runtime is less than three times the model training time. The pk factors indicate how many times faster the parallel runtime is compared to the sequential runtime. No pk factor is shown when the parallel runtime is slower than the sequential runtime.
mlr3 Resampling Iterations Runtime, s k1000 k100 k10 k1 Memory, mb pk1000 pk100 pk10 pk1
10 Resampling Iterations
1.3.0 10 139 1.0 1.1 2.4 15 150 43 4.8 1.6 1.1
1.2.0 10 140 1.0 1.1 2.4 15 147 43 4.8 1.6 1.1
1.1.0 10 135 1.0 1.1 2.4 15 148 43 4.8 1.6 1.1
100 Resampling Iterations
1.3.0 100 1285 1.0 1.1 2.3 14 156 82 9.2 7.0 5.8
1.2.0 100 1261 1.0 1.1 2.3 14 155 83 9.2 7.0 5.8
1.1.0 100 1207 1.0 1.1 2.2 13 156 83 9.2 6.9 5.7
1000 Resampling Iterations
1.3.0 1000 12772 1.0 1.1 2.3 14 268 76 8.5 5.2 4.0
1.2.0 1000 12600 1.0 1.1 2.3 14 264 75 8.4 5.1 3.8
1.1.0 1000 11993 1.0 1.1 2.2 13 274 75 8.3 4.9 3.6

Benchmark

The runtime and memory usage of benchmark() are measured across mlr3 versions. The number of resampling iterations (evals) is set to 1000, 100, and 10. We also measure the runtime of benchmark() with future::multisession parallelization on 10 cores.

task = tsk("spam")
learner = lrn("classif.featureless")
resampling = rsmp("subsampling", repeats = evals / 5)

design = benchmark_grid(task, replicate(5, learner), resampling)

benchmark(design)
Runtime and memory usage of benchmark() by mlr3 version and resampling iterations on the spam dataset with 1,000 observations. The k factors indicate how many times longer the total runtime is compared to the model training time. The numbers represent the model training time itself e.g. k100 are models trained for 100 ms. A green background highlights cases where the total runtime is less than three times the model training time. The pk factors indicate how many times faster the parallel runtime is compared to the sequential runtime. No pk factor is shown when the parallel runtime is slower than the sequential runtime.
mlr3 Resampling Iterations Runtime, s k1000 k100 k10 k1 Memory, mb pk1000 pk100 pk10 pk1
10 Resampling Iterations
1.3.0 10 158 1.0 1.2 2.6 17 151 9.0 1.0
1.2.0 10 161 1.0 1.2 2.6 17 152 9.1 1.0
1.1.0 10 150 1.0 1.2 2.5 16 151 9.1 1.0
100 Resampling Iterations
1.3.0 100 1291 1.0 1.1 2.3 14 154 45 5.0 1.7 1.1
1.2.0 100 1280 1.0 1.1 2.3 14 155 45 5.0 1.7 1.1
1.1.0 100 1236 1.0 1.1 2.2 13 156 46 5.1 1.7 1.1
1000 Resampling Iterations
1.3.0 1000 12601 1.0 1.1 2.3 14 260 61 6.8 3.0 2.1
1.2.0 1000 13883 1.0 1.1 2.4 15 260 61 6.9 3.2 2.2
1.1.0 1000 14129 1.0 1.1 2.4 15 258 62 7.0 3.3 2.3

Encapsulation

The runtime and memory usage of $train() are measured for different encapsulation methods and mlr3 versions.

task = tsk("spam")
learner = lrn("classif.featureless")
learner$encapsulate(method, fallback = lrn("classif.featureless"))

learner$train(task)
Runtime and memory usage of $train() by mlr3 version and encapsulation method. The k factors indicate how many times longer the total runtime is compared to the model training time. The numbers represent the model training time itself e.g. k100 are models trained for 100 ms.
mlr3 Method Runtime, s k1000 k100 k10 k1 Memory, mb
No Encapsulation
1.3.0 none 16 1.0 1.2 2.6 17 150
1.2.0 none 15 1.0 1.1 2.5 16 150
1.1.0 none 13 1.0 1.1 2.3 14 149
Evaluate
1.3.0 evaluate 32 1.0 1.3 4.2 33 152
1.2.0 evaluate 30 1.0 1.3 4.0 31 148
1.1.0 evaluate 27 1.0 1.3 3.7 28 152
Callr
1.3.0 callr 3696 4.7 38 370 3,700 151
1.2.0 callr 3452 4.5 36 350 3,500 149
1.1.0 callr 1935 2.9 20 190 1,900 149