mlr3 - Runtime and Memory Benchmarks

Scope

This report analyzes the runtime and memory usage of mlr3 across the four most recent package versions. It focuses on the learner methods $train() and $predict() and on the evaluation functions resample() and benchmark(). The benchmarks quantify the runtime overhead introduced by mlr3 and the memory usage. Overhead is reported relative to the training time of the underlying models. The study varies dataset size and the number of resampling iterations. All experiments also assess the effect of parallelization on runtime and memory. The impact of encapsulation is examined by comparing alternative encapsulation methods.

Given the size of the mlr3 ecosystem, performance bottlenecks can arise at multiple stages. This report helps users assess whether observed runtimes fall within expected ranges. Substantial anomalies in runtime or memory should be reported by opening a GitHub issue. Benchmarks are executed on a high‑performance cluster optimized for multi‑core throughput rather than single‑core speed. Consequently, single‑core runtimes may be faster on a modern local machine.

Summary of Latest mlr3 Version

The benchmarks are comprehensive, so we summarize the results for the latest mlr3 version. The runtime overhead of mlr3 must be interpreted relative to model training and prediction times. For instance, if ranger::ranger() takes 100 ms to train and lrn("classif.ranger")$train() takes 110 milliseconds, the overhead is 10%. If the same model requires 1 second to train, the overhead is 1%. The overhead is shown relative to the training time of the models with the factors k_1, k_10, k_100, and k_1000. The subscript denotes the model’s training time in milliseconds. The factors pk_1, pk_10, pk_100, and pk_1000 report the speedup of parallel over sequential execution.

We first consider $train(). For models with training times of 1000 ms and 100 ms, the overhead is minimal. When training takes 10 ms, runtime approximately doubles. For 1 ms models, overhead is roughly ten times the bare model training time.

The overhead of $predict() is comparable to $train(), and dataset size has only a minor effect. $predict_newdata() converts newdata to a task and then predicts, which roughly doubles the overhead relative to $predict(). The recently introduced $predict_newdata_fast() is substantially faster than $predict_newdata(). For models with 10 ms prediction time, the overhead is about 10%. For models with 1 ms prediction time, the overhead is about 50%.

The overhead of resample() and benchmark() is small for 1000 ms and 100 ms models. For 10 ms models, the total runtime is approximately twice the bare training time. For 1 ms models, the total runtime is approximately ten times the bare training time. An empty R session consumes 131 MB of memory. Resampling with 10 iterations uses approximately 164 MB, increasing to about 225 MB for 1000 iterations. Memory usage for benchmark() is comparable to resample().

mlr3 parallelizes over resampling iterations via the future package. Parallel execution adds overhead due to worker initialization. We therefore compare parallel and sequential runtimes. For 1 s models, parallel resample() and benchmark() reduce total runtime. For 100 ms models, parallelization is advantageous primarily for 100 or 1000 iterations. For 10 ms and 1 ms models, parallel execution overtakes sequential execution mainly at 1000 iterations. Memory grows with the number of cores because each core launches a separate R session. Using 10 cores results in a total memory footprint of approximately 1.2 GB.

Encapsulation captures and logs conditions such as messages, warnings, and errors without interrupting control flow. Encapsulation via callr introduces approximately 1 s of additional runtime per model training. Encapsulation via evaluate adds negligible runtime overhead.

Train

The runtime and memory usage of $train() are measured for different mlr3 versions.

task = tsk("spam")
learner = lrn("classif.featureless")

learner$train(task)
Runtime and memory usage of $train() by mlr3 version and dataset size. The k factors indicate how many times longer the total runtime is compared to the model training time. The numbers represent the model training time itself e.g. k100 are models trained for 100 ms. A green background highlights cases where the total runtime is less than three times the model training time. The table includes runtime and memory usage for tasks of size 10, 100, 1,000 and 10,000.
mlr3 Task Size Overhead, ms k1000 k100 k10 k1 Memory, mb
10 Observations
1.1.0.9000 10 5 1.0 1.0 1.5 5.8 146
1.1.0 10 5 1.0 1.0 1.5 5.9 147
1.0.1 10 5 1.0 1.0 1.5 5.8 145
1.0.0 10 5 1.0 1.0 1.5 5.9 146
100 Observations
1.1.0.9000 100 5 1.0 1.0 1.5 5.8 146
1.1.0 100 5 1.0 1.0 1.5 5.8 143
1.0.1 100 5 1.0 1.0 1.5 5.9 148
1.0.0 100 5 1.0 1.0 1.5 5.9 147
1000 Observations
1.1.0.9000 1000 5 1.0 1.0 1.5 5.8 147
1.1.0 1000 5 1.0 1.0 1.5 5.9 148
1.0.1 1000 5 1.0 1.1 1.5 6.0 148
1.0.0 1000 5 1.0 1.1 1.5 6.0 146
10000 Observations
1.1.0.9000 10000 6 1.0 1.1 1.6 6.8 178
1.1.0 10000 6 1.0 1.1 1.6 6.8 177
1.0.1 10000 6 1.0 1.1 1.6 6.9 169
1.0.0 10000 6 1.0 1.1 1.6 7.0 167

Predict

The runtime of $predict() is measured across mlr3 versions.

task = tsk("spam")
learner = lrn("classif.featureless")

learner$train(task)

learner$predict(task)
Runtime and memory usage of $predict() by mlr3 version and dataset size. The k factors indicate how many times longer the total runtime is compared to the model training time. The numbers represent the model training time itself e.g. k100 are models trained for 100 ms. A green background highlights cases where the total runtime is less than three times the model training time. The table includes runtime and memory usage for tasks of size 10, 100, 1,000 and 10,000.
mlr3 Task Size Overhead, ms k1000 k100 k10 k1 Memory, mb
10 Observations
1.1.0.9000 10 5 1.0 1.0 1.5 5.8 149
1.1.0 10 5 1.0 1.0 1.5 5.9 147
1.0.1 10 5 1.0 1.1 1.5 6.0 147
1.0.0 10 5 1.0 1.1 1.5 6.1 148
100 Observations
1.1.0.9000 100 5 1.0 1.1 1.5 6.1 147
1.1.0 100 5 1.0 1.1 1.5 6.0 149
1.0.1 100 5 1.0 1.1 1.5 6.1 148
1.0.0 100 5 1.0 1.1 1.5 6.2 151
1000 Observations
1.1.0.9000 1000 5 1.0 1.1 1.5 6.1 152
1.1.0 1000 5 1.0 1.1 1.5 6.2 152
1.0.1 1000 5 1.0 1.1 1.5 6.1 150
1.0.0 1000 5 1.0 1.1 1.5 6.3 150
10000 Observations
1.1.0.9000 10000 6 1.0 1.1 1.6 7.3 173
1.1.0 10000 6 1.0 1.1 1.6 7.2 178
1.0.1 10000 6 1.0 1.1 1.6 7.5 176
1.0.0 10000 7 1.0 1.1 1.7 7.8 177

Predict Newdata

The runtime of $predict_newdata() is measured across mlr3 versions.

task = tsk("spam")
learner = lrn("classif.featureless")

learner$train(task)

learner$predict_newdata(newdata)
Runtime and memory usage of $predict_newdata() by mlr3 version and dataset size. The k factors indicate how many times longer the total runtime is compared to the model training time. The numbers represent the model training time itself e.g. k100 are models trained for 100 ms. A green background highlights cases where the total runtime is less than three times the model training time. The table includes runtime and memory usage for tasks of size 10, 100, 1,000 and 10,000.
mlr3 Task Size Overhead, ms k1000 k100 k10 k1 Memory, mb
10 Observations
1.1.0.9000 10 19 1.0 1.2 2.9 20 158
1.1.0 10 20 1.0 1.2 3.0 21 154
1.0.1 10 20 1.0 1.2 3.0 21 153
1.0.0 10 21 1.0 1.2 3.1 22 153
100 Observations
1.1.0.9000 100 19 1.0 1.2 2.9 20 157
1.1.0 100 20 1.0 1.2 3.0 21 156
1.0.1 100 21 1.0 1.2 3.1 22 155
1.0.0 100 22 1.0 1.2 3.2 23 153
1000 Observations
1.1.0.9000 1000 19 1.0 1.2 2.9 20 159
1.1.0 1000 21 1.0 1.2 3.1 22 161
1.0.1 1000 21 1.0 1.2 3.1 22 158
1.0.0 1000 22 1.0 1.2 3.2 23 158
10000 Observations
1.1.0.9000 10000 27 1.0 1.3 3.7 28 181
1.1.0 10000 31 1.0 1.3 4.1 32 183
1.0.1 10000 29 1.0 1.3 3.9 30 182
1.0.0 10000 35 1.0 1.3 4.5 36 184

Predict Newdata Fast

The runtime of $predict_newdata_fast() is measured across mlr3 versions.

task = tsk("spam")
learner = lrn("classif.featureless")

learner$train(task)

learner$predict_newdata_fast(task)
Runtime and memory usage of $predict_newdata_fast() by mlr3 version and dataset size. The k factors indicate how many times longer the total runtime is compared to the model training time. The numbers represent the model training time itself e.g. k100 are models trained for 100 ms. A green background highlights cases where the total runtime is less than three times the model training time. The table includes runtime and memory usage for tasks of size 10, 100, 1,000 and 10,000.
mlr3 Task Size Overhead, ms k1000 k100 k10 k1 Memory, mb
10 Observations
1.1.0.9000 10 0 1.0 1.0 1.0 1.3 150
1.1.0 10 0 1.0 1.0 1.0 1.3 150
100 Observations
1.1.0.9000 100 NA NA NA NA NA 153
1.1.0 100 NA NA NA NA NA 150
1000 Observations
1.1.0.9000 1000 0 1.0 1.0 1.0 1.4 157
1.1.0 1000 0 1.0 1.0 1.0 1.4 152
10000 Observations
1.1.0.9000 10000 1 1.0 1.0 1.1 2.0 162
1.1.0 10000 1 1.0 1.0 1.1 2.0 159

Resampling

The runtime and memory usage of resample() are measured across mlr3 versions. The number of resampling iterations (evals) is set to 1000, 100, and 10. We also measure the runtime of resample() with future::multisession parallelization on 10 cores.

task = tsk("spam")
learner = lrn("classif.featureless")

resampling = rsmp("subsampling", repeats = evals)

resample(task, learner, resampling)
Runtime and memory usage of resample() by mlr3 version and resampling iterations on the spam dataset with 1,000 observations. The k factors indicate how many times longer the total runtime is compared to the model training time. The numbers represent the model training time itself e.g. k100 are models trained for 100 ms. A green background highlights cases where the total runtime is less than three times the model training time. The pk factors indicate how many times faster the parallel runtime is compared to the sequential runtime. No pk factor is shown when the parallel runtime is slower than the sequential runtime.
mlr3 Resampling Iterations Runtime, s k1000 k100 k10 k1 Memory, mb pk1000 pk100 pk10 pk1
10 Resampling Iterations
1.1.0.9000 10 120 1.0 1.1 2.2 13 149 3.3
1.1.0 10 126 1.0 1.1 2.3 14 150 3.3
1.0.1 10 118 1.0 1.1 2.2 13 149 3.3
1.0.0 10 126 1.0 1.1 2.3 14 151 3.3
100 Resampling Iterations
1.1.0.9000 100 1057 1.0 1.1 2.1 12 156 27 2.9
1.1.0 100 1077 1.0 1.1 2.1 12 153 27 3.0
1.0.1 100 1112 1.0 1.1 2.1 12 154 27 2.9
1.0.0 100 1181 1.0 1.1 2.2 13 157 27 3.0
1000 Resampling Iterations
1.1.0.9000 1000 9710 1.0 1.1 2.0 11 257 58 6.3 2.4 1.4
1.1.0 1000 10847 1.0 1.1 2.1 12 264 58 6.4 2.5 1.6
1.0.1 1000 10580 1.0 1.1 2.1 12 276 58 6.3 2.4 1.5
1.0.0 1000 10745 1.0 1.1 2.1 12 254 58 6.3 2.4 1.6

Benchmark

The runtime and memory usage of benchmark() are measured across mlr3 versions. The number of resampling iterations (evals) is set to 1000, 100, and 10. We also measure the runtime of benchmark() with future::multisession parallelization on 10 cores.

task = tsk("spam")
learner = lrn("classif.featureless")
resampling = rsmp("subsampling", repeats = evals / 5)

design = benchmark_grid(task, replicate(5, learner), resampling)

benchmark(design)
Runtime and memory usage of benchmark() by mlr3 version and resampling iterations on the spam dataset with 1,000 observations. The k factors indicate how many times longer the total runtime is compared to the model training time. The numbers represent the model training time itself e.g. k100 are models trained for 100 ms. A green background highlights cases where the total runtime is less than three times the model training time. The pk factors indicate how many times faster the parallel runtime is compared to the sequential runtime. No pk factor is shown when the parallel runtime is slower than the sequential runtime.
mlr3 Resampling Iterations Runtime, s k1000 k100 k10 k1 Memory, mb pk1000 pk100 pk10 pk1
10 Resampling Iterations
1.1.0.9000 10 130 1.0 1.1 2.3 14 150
1.1.0 10 138 1.0 1.1 2.4 15 149
1.0.1 10 131 1.0 1.1 2.3 14 149
1.0.0 10 141 1.0 1.1 2.4 15 150
100 Resampling Iterations
1.1.0.9000 100 1086 1.0 1.1 2.1 12 155 6.9
1.1.0 100 1085 1.0 1.1 2.1 12 152 6.9
1.0.1 100 1112 1.0 1.1 2.1 12 154 6.9
1.0.0 100 1151 1.0 1.1 2.2 13 156 6.8
1000 Resampling Iterations
1.1.0.9000 1000 9949 1.0 1.1 2.0 11 259 40 4.4 1.2
1.1.0 1000 10668 1.0 1.1 2.1 12 257 41 4.5 1.3
1.0.1 1000 9571 1.0 1.1 2.0 11 255 40 4.4 1.2
1.0.0 1000 10280 1.0 1.1 2.0 11 254 41 4.4 1.3

Encapsulation

The runtime and memory usage of $train() are measured for different encapsulation methods and mlr3 versions.

task = tsk("spam")
learner = lrn("classif.featureless")
learner$encapsulate(method, fallback = lrn("classif.featureless"))

learner$train(task)
Runtime and memory usage of $train() by mlr3 version and encapsulation method. The k factors indicate how many times longer the total runtime is compared to the model training time. The numbers represent the model training time itself e.g. k100 are models trained for 100 ms.
mlr3 Method Runtime, s k1000 k100 k10 k1 Memory, mb
No Encapsulation
1.1.0.9000 none 7 1.0 1.1 1.7 7.5 149
1.1.0 none 8 1.0 1.1 1.8 9.3 148
1.0.1 none 8 1.0 1.1 1.8 9.3 152
1.0.0 none 8 1.0 1.1 1.8 9.3 152
Evaluate
1.1.0.9000 evaluate 20 1.0 1.2 3.0 21 149
1.1.0 evaluate 22 1.0 1.2 3.2 23 150
1.0.1 evaluate 22 1.0 1.2 3.2 23 149
1.0.0 evaluate 25 1.0 1.2 3.5 26 149
Callr
1.1.0.9000 callr 579 1.6 6.8 59 580 148
1.1.0 callr 1311 2.3 14 130 1,300 151
1.0.1 callr 668 1.7 7.7 68 670 151
1.0.0 callr 1401 2.4 15 140 1,400 151