Analyzing the Runtime Performance of tidymodels and mlr3

Compare the runtime performance of tidymodels and mlr3.

Author
Published

October 30, 2023

Scope

In the realm of data science, machine learning frameworks play an important role in streamlining and accelerating the development of analytical workflows. Among these, tidymodels and mlr3 stand out as prominent tools within the R community. They provide a unified interface for data preprocessing, model training, resampling and tuning. The streamlined and accelerated development process, while efficient, typically results in a trade-off concerning runtime performance. This article undertakes a detailed comparison of the runtime efficiency of tidymodels and mlr3, focusing on their performance in training, resampling, and tuning machine learning models. Specifically, we assess the time efficiency of these frameworks in running the rpart::rpart() and ranger::ranger() models, using the Sonar dataset as a test case. Additionally, the study delves into analyzing the runtime overhead of these frameworks by comparing their performance against training the models without a framework. Through this comparative analysis, the article aims to provide valuable insights into the operational trade-offs of using these advanced machine learning frameworks in practical data science applications.

Setup

We employ the microbenchmark package to measure the time required for training, resampling, and tuning models. This benchmarking process is applied to the Sonar dataset using the rpart and ranger algorithms.

library("mlr3verse")
library("tidymodels")
library("microbenchmark")

task = tsk("sonar")
data = task$data()
formula = Class ~ .

To ensure the robustness of our results, each function call within the benchmark is executed 100 times in a randomized sequence. The microbenchmark package then provides us with detailed insights, including the median, lower quartile, and upper quartile of the runtimes. To further enhance the reliability of our findings, we execute the benchmark on a cluster. Each run of microbenchmark is repeated 100 times, with different seeds applied for each iteration. Resulting in a total of 10,000 function calls of each command. The computing environment for each worker in the cluster consists of 3 cores and 12 GB of RAM. For transparency and reproducibility, the examples of the code used for this experiment are provided as snippets in the article. The complete code, along with all details of the experiment, is available in our public repository, mlr-org/mlr-benchmark.

It’s important to note that our cluster setup is not specifically optimized for single-core performance. Consequently, executing the same benchmark on a local machine with might yield faster results.

Benchmark

Train the Models

Our benchmark starts with the fundamental task of model training. To facilitate a direct comparison, we have structured our presentation into two distinct segments. On the left, we demonstrate the initialization of the rpart model, employing both mlr3 and tidymodels frameworks. The rpart model is a decision tree classifier, which is a simple and fast-fitting algorithm for classification tasks. Simultaneously, on the right, we turn our attention to the initialization of the ranger model, known for its efficient implementation of the random forest algorithm. Our aim is to mirror the configuration as closely as possible across both frameworks, maintaining consistency in parameters and settings.

# tidymodels
tm_mod = decision_tree() %>%
  set_engine("rpart",
    xval = 0L) %>%
  set_mode("classification")

# mlr3
learner = lrn("classif.rpart",
  xval = 0L)
# tidymodels
tm_mod = rand_forest(trees = 1000L) %>%
  set_engine("ranger",
    num.threads = 1L,
    seed = 1) %>%
  set_mode("classification")

# mlr3
learner = lrn("classif.ranger",
  num.trees = 1000L,
  num.threads = 1L,
  seed = 1,
  verbose = FALSE,
  predict_type = "prob")

We measure the runtime for the train functions within each framework. The result of the train function is a trained model in both frameworks. In addition, we invoke the rpart() and ranger() functions to establish a baseline for the minimum achievable runtime. This allows us to not only assess the efficiency of the train functions in each framework but also to understand how they perform relative to the base packages.

# tidymodels train
fit(tm_mod, formula, data = data)

# mlr3 train
learner$train(task)

When training an rpart model, tidymodels demonstrates superior speed, outperforming mlr3 (Table 1). Notably, the mlr3 package requires approximately twice the time compared to the baseline.

A key observation from our results is the significant relative overhead when using a framework for rpart model training. Given that rpart inherently requires a shorter training time, the additional processing time introduced by the frameworks becomes more pronounced. This aspect highlights the trade-off between the convenience offered by these frameworks and their impact on runtime for quicker tasks.

Conversely, when we shift our focus to training a ranger model, the scenario changes (Table 2). Here, the runtime performance of ranger is strikingly similar across both tidymodels and mlr3. This equality in execution time can be attributed to the inherently longer training duration required by ranger models. As a result, the relative overhead introduced by either framework becomes minimal, effectively diminishing in the face of the more time-intensive training process. This pattern suggests that for more complex or time-consuming tasks, the choice of framework may have a less significant impact on overall runtime performance.

Table 1: Average runtime in milliseconds of training rpart depending on the framework.
Framework LQ Median UQ
base 11 11 12
mlr3 23 23 24
tidymodels 18 18 19
Table 2: Average runtime in milliseconds of training ranger depending on the framework.
Framework LQ Median UQ
base 286 322 347
mlr3 301 335 357
tidymodels 310 342 362

Resample Sequential

We proceed to evaluate the runtime performance of the resampling functions within both frameworks, specifically under conditions without parallelization. This step involves the generation of resampling splits, including 3-fold, 6-fold, and 9-fold cross-validation. Additionally, we run a 100 times repeated 3-fold cross-validation.

We generate the same resampling splits for both frameworks. This consistency is key to ensuring that any observed differences in runtime are attributable to the frameworks themselves, rather than variations in the resampling process.

In our pursuit of a fair and balanced comparison, we address certain inherent differences between the two frameworks. Notably, tidymodels inherently includes scoring of the resampling results as part of its process. To align the comparison, we replicate this scoring step in mlr3, thus maintaining a level field for evaluation. Furthermore, mlr3 inherently saves predictions during the resampling process. To match this, we activate the saving of the predictions in tidymodels.

# tidymodels resample
control = control_grid(save_pred = TRUE)
metrics = metric_set(accuracy)

tm_wf =
  workflow() %>%
  add_model(tm_mod) %>%
  add_formula(formula)

fit_resamples(tm_wf, folds, metrics = metrics, control = control)

# mlr3 resample
measure = msr("classif.acc")

rr = resample(task, learner, resampling)
rr$score(measure)

When resampling the fast-fitting rpart model, mlr3 demonstrates a notable edge in speed, as detailed in Table 3. In contrast, when it comes to resampling the more computationally intensive ranger models, the performance of tidymodels and mlr3 converges closely (Table 4). This parity in performance is particularly noteworthy, considering the differing internal mechanisms and optimizations of tidymodels and mlr3. A consistent trend observed across both frameworks is a linear increase in runtime proportional to the number of folds in cross-validation (Figure 1).

Table 3: Average runtime in milliseconds of rpart depending on the framework and resampling strategy.
Framework Resampling LQ Median UQ
mlr3 cv3 188 196 210
tidymodels cv3 233 242 257
mlr3 cv6 343 357 379
tidymodels cv6 401 415 436
mlr3 cv9 500 520 548
tidymodels cv9 568 588 616
mlr3 rcv100 15526 16023 16777
tidymodels rcv100 16409 16876 17527
Table 4: Average runtime in milliseconds of ranger depending on the framework and resampling strategy.
Framework Resampling LQ Median UQ
mlr3 cv3 923 1004 1062
tidymodels cv3 916 981 1023
mlr3 cv6 1990 2159 2272
tidymodels cv6 2089 2176 2239
mlr3 cv9 3074 3279 3441
tidymodels cv9 3260 3373 3453
mlr3 rcv100 85909 88642 91381
tidymodels rcv100 87828 88822 89843
Figure 1: Average runtime, measured in milliseconds, for cross-validations using rpart (displayed on the left) and ranger (on the right). The comparison encompasses variations across different frameworks and the number of folds in the cross-validation.

Resample Parallel

We conducted a second set of resampling function tests, this time incorporating parallelization to explore its impact on runtime efficiency. In this phase, we utilized doFuture and doParallel as the primary parallelization packages for tidymodels, recognizing their robust support and compatibility. Meanwhile, for mlr3, the future package was employed to facilitate parallel processing.

Our findings, as presented in the respective tables (Table 5 and Table 6), reveal interesting dynamics about parallelization within the frameworks. When the number of folds in the resampling process is doubled, we observe only a marginal increase in the average runtime. This pattern suggests a significant overhead associated with initializing the parallel workers, a factor that becomes particularly influential in the overall efficiency of the parallelization process.

In the case of the rpart model, the parallelization overhead appears to outweigh the potential speedup benefits, as illustrated in the left section of Figure 2. This result indicates that for less complex models like rpart, where individual training times are relatively short, the initialization cost of parallel workers may not be sufficiently offset by the reduced processing time per fold.

Conversely, for the ranger model, the utilization of parallelization demonstrates a clear advantage over the sequential version, as evidenced in the right section of Figure 2. This finding underscores that for more computationally intensive models like ranger, which have longer individual training times, the benefits of parallel processing significantly overcome the initial overhead of worker setup. This differentiation highlights the importance of considering the complexity and inherent processing time of models when deciding to implement parallelization strategies in these frameworks.

Table 5: Average runtime in milliseconds of mlr3 with future and rpart depending on the resampling strategy.
Resampling LQ Median UQ
cv3 625 655 703
cv6 738 771 817
cv9 831 875 923
rcv100 8620 9043 9532
Table 6: Average runtime in milliseconds of mlr3 with future and ranger depending on the resampling strategy.
Resampling LQ Median UQ
cv3 836 884 943
cv6 1200 1249 1314
cv9 1577 1634 1706
rcv100 32047 32483 33022

When paired with doFuture, tidymodels exhibits significantly slower runtime compared to the mlr3 package utilizing future (Table 7 and Table 8). We observed that tidymodels exports more data to the parallel workers, which notably exceeds that of mlr3. This substantial difference in data export could plausibly account for the observed slower runtime when using tidymodels on small tasks.

Table 7: Average runtime in milliseconds of tidymodels with doFuture and rpart depending on the resampling strategy.
Resampling LQ Median UQ
cv3 2778 2817 3019
cv6 2808 2856 3033
cv9 2935 2975 3170
rcv100 9154 9302 9489
Table 8: Average runtime in milliseconds of tidymodels with doFuture and ranger depending on the resampling strategy.
Resampling LQ Median UQ
cv3 2982 3046 3234
cv6 3282 3366 3543
cv9 3568 3695 3869
rcv100 27546 27843 28166

The utilization of the doParallel package demonstrates a notable improvement in handling smaller resampling tasks. In these scenarios, the resampling process consistently outperforms the mlr3 framework in terms of speed. However, it’s important to note that even with this enhanced performance, the doParallel package does not always surpass the efficiency of the sequential version, especially when working with the rpart model. This specific observation is illustrated in the left section of Figure 2.

Table 9: Average runtime in milliseconds of tidymodels with doParallel and rpart depending on the resampling strategy.
Resampling LQ Median UQ
cv3 557 649 863
cv6 602 714 910
cv9 661 772 968
rcv100 10609 10820 11071
Table 10: Average runtime in milliseconds of tidymodels with doParallel and ranger depending on the resampling strategy.
Resampling LQ Median UQ
cv3 684 756 948
cv6 1007 1099 1272
cv9 1360 1461 1625
rcv100 31205 31486 31793
Figure 2: Average runtime, measured in milliseconds, for cross-validations using rpart (displayed on the left) and ranger (on the right). The comparison encompasses variations across different frameworks, the number of folds in the cross-validation, and the implementation of parallelization.

In the context of repeated cross-validation, our findings underscore the efficacy of parallelization (Figure 3). Across all frameworks tested, the adoption of parallel processing techniques yields a significant increase in speed. This enhancement is particularly noticeable in larger resampling tasks, where the demands on computational resources are more substantial.

Interestingly, within these more extensive resampling scenarios, the doFuture package emerges as a more efficient option compared to doParallel. This distinction is important, as it highlights the relative strengths of different parallelization packages under varying workload conditions. While doParallel shows proficiency in smaller tasks, doFuture demonstrates its capability to handle larger, more complex resampling processes with greater speed and efficiency.

Figure 3: Average runtime, measured in seconds, of a 100 times repeated 3-fold cross-validation using rpart (displayed on the left) and ranger (on the right). The comparison encompasses variations across different frameworks and the implementation of parallelization.

Tune Sequential

We then shift our focus to assessing the runtime performance of the tuning functions. In this phase, the tidymodels package is utilized to evaluate a predefined grid, comprising a specific set of hyperparameter configurations. To ensure a balanced and comparable analysis, we employ the "design_points" tuner from the mlr3tuning package. This approach allows us to evaluate the same grid within the mlr3 framework, maintaining consistency across both platforms. The grid used for this comparison contains 200 hyperparameter configurations each, for both the rpart and ranger models. This approach helps us to understand how each framework handles the optimization of model hyperparameters, a key aspect of building effective and efficient machine learning models.

# tidymodels
tm_mod = decision_tree(
  cost_complexity = tune()) %>%
  set_engine("rpart",
    xval = 0) %>%
  set_mode("classification")

tm_design = data.table(
  cost_complexity = seq(0.1, 0.2, length.out = 200))

# mlr3
learner = lrn("classif.rpart",
  xval = 0,
  cp = to_tune())

mlr3_design = data.table(
  cp = seq(0.1, 0.2, length.out = 200))
# tidymodels
tm_mod = rand_forest(
  trees = tune()) %>%
  set_engine("ranger",
    num.threads = 1L,
    seed = 1) %>%
  set_mode("classification")

tm_design = data.table(
  trees = seq(1000, 1199))

# mlr3
learner = lrn("classif.ranger",
  num.trees = to_tune(1, 10000),
  num.threads = 1L,
  seed = 1,
  verbose = FALSE,
  predict_type = "prob")

mlr3_design = data.table(
  num.trees = seq(1000, 1199))

We measure the runtime of the tune functions within each framework. Both the tidymodels and mlr3 frameworks are tasked with identifying the optimal hyperparameter configuration.

# tidymodels tune
tune::tune_grid(
  tm_wf,
  resamples = resamples,
  grid = design,
  metrics = metrics)

# mlr3 tune
tuner = tnr("design_points", design = design, batch_size = nrow(design))
mlr3tuning::tune(
  tuner = tuner,
  task = task,
  learner = learner,
  resampling = resampling,
  measures = measure,
  store_benchmark_result = FALSE)

In our sequential tuning tests, mlr3 demonstrates a notable advantage in terms of speed. This finding is clearly evidenced in our results, as shown in Table Table 11 for the rpart model and Table Table 12 for the ranger model. The faster performance of mlr3 in these sequential runs highlights its efficiency in handling the tuning process without parallelization.

Table 11: Average runtime in seconds of tuning 200 points of rpart depending on the framework.
Framework LQ Median UQ
mlr3 27 27 28
tidymodels 37 37 39
Table 12: Average runtime in seconds of tuning 200 points of ranger depending on the framework.
Framework LQ Median UQ
mlr3 167 171 175
tidymodels 194 195 196

Tune Parallel

Concluding our analysis, we proceed to evaluate the runtime performance of the tune functions, this time implementing parallelization to enhance efficiency. For these runs, parallelization is executed on 3 cores.

In the case of mlr3, we opt for the largest possible chunk size. This strategic choice means that all points within the tuning grid are sent to the workers in a single batch, effectively minimizing the overhead typically associated with parallelization. This approach is crucial in reducing the time spent in distributing tasks across multiple cores, thereby streamlining the tuning process. On the other hand, the tidymodels package also operates with the same chunk size, but this setting is determined and managed internally within the framework.

By conducting these parallelization tests, we aim to provide a deeper understanding of how each framework handles the distribution and management of computational tasks during the tuning process, particularly in a parallel computing environment. This final set of measurements is important in painting a complete picture of the runtime performance of the tune functions across both tidymodels and mlr3 under different operational settings.

options("mlr3.exec_chunk_size" = 200)

Our analysis of the parallelized tuning functions reveals that the runtimes for mlr3 and tidymodels are remarkably similar. However, subtle differences emerge upon closer inspection. For instance, the mlr3 package exhibits a slightly faster performance when tuning the rpart model, as indicated in Table 13. In contrast, it falls marginally behind tidymodels in tuning the ranger model, as shown in Table 14.

Interestingly, when considering the specific context of a 3-fold cross-validation, the doParallel package outperforms doFuture in terms of speed, as demonstrated in Figure 4. This outcome suggests that the choice of parallelization package can have a significant impact on tuning efficiency, particularly in scenarios with a smaller number of folds.

A key takeaway from our study is the clear benefit of enabling parallelization, regardless of the chosen framework-backend combination. Activating parallelization consistently enhances performance, making it a highly recommended strategy for tuning machine learning models, especially in tasks involving extensive hyperparameter exploration or larger datasets. This conclusion underscores the value of parallel processing in modern machine learning workflows, offering a practical solution for accelerating model tuning across various computational settings.

Table 13: Average runtime in seconds of tuning 200 points of rpart depending on the framework.
Framework Backend LQ Median UQ
mlr3 future 11 12 12
tidymodels doFuture 17 17 17
tidymodels doParallel 13 13 13
Table 14: Average runtime in seconds of tuning 200 points of ranger depending on the framework.
Framework Backend LQ Median UQ
mlr3 future 54 55 55
tidymodels doFuture 58 58 59
tidymodels doParallel 54 54 55
Figure 4: Average runtime, measured in seconds, of a tuning 200 hyperparameter configurations of rpart (displayed on the left) and ranger (on the right). The comparison encompasses variations across different frameworks and the implementation of parallelization.

Conclusion

Our analysis reveals that both tidymodels and mlr3 exhibit comparable runtimes across key processes such as training, resampling, and tuning, each displaying its own set of strengths and efficiencies.

A notable observation is the relative overhead associated with using either framework, particularly when working with fast-fitting models like rpart. In these cases, the additional processing time introduced by the frameworks is more pronounced due to the inherently short training time of rpart models. This results in a higher relative overhead, reflecting the trade-offs between the convenience of a comprehensive framework and the directness of more basic approaches.

Conversely, when dealing with slower-fitting models such as ranger, the scenario shifts. For these more time-intensive models, the relative overhead introduced by the frameworks diminishes significantly. In such instances, the extended training times of the models absorb much of the frameworks’ inherent overhead, rendering it relatively negligible.

In summary, while there is no outright winner in terms of overall performance, the decision to use tidymodels or mlr3 should be informed by the specific requirements of the task at hand.