Parallelization

Set up a large scale benchmark experiment with parallelization

Authors

Goal

The objective of this exercise is to get hands-on experience with conducting large-scale machine learning experiments using MLR3, Batchtools, and the OpenML connector. You will learn how to set up a parallelized benchmark experiment on your laptop.

Prerequisites

We need the following libraries:

library('mlr3verse')
library('mlr3oml')
library('data.table')
library('batchtools')
library('ggplot2')

Exercise 1: Getting Data from OpenML

To draw meaningful conclusions from benchmark experiments, a good choice of data sets and tasks is essential. OpenML is an open-source online platform that facilitates the sharing of machine learning research data, algorithms, and experimental results in a standardized format. Finding data from OpenML is possible via the website or its API. The mlr3oml package offers an elegant connection between mlr3 and OpenML. The function list_oml_tasks() can be used to filter tasks for specific properties. To get started, utilize this to create a list of tasks with 10-20 features, 500-1000 rows and a categorical outcome with two classes. From this list, remove duplicate instances with similar names (sometimes, different versions of more or less the same data set are produced). For example, you could do this by removing instances where the first 3 letters of the name column match those of other instances. Further, exclude instances where the minority class is less than 10% of the overall number of observations.

Exercise 2: Working with OpenML data

Notably, list_oml__data_tasks() only retrieves relevant information about the tasks, and not the data itself. Convert this list of tasks by directly transforming it to mlr3 tasks with applying otsk() to each data_id. Find out how you can load and inspect the data from a single task in the list.

Exercise 3: Batch Tools

Exercise 3.1: OpenML Task and Learners

For this task, look at the german credit data set. Download it from OpenML (task id: 31) and create a task. Define a tree, an SVM and a Random Forest as they shall be benchmarked later.

Exercise 3.2: Batchtools experiment registry

Create and configure a Batchtools experiment registry:

reg = makeExperimentRegistry(
  file.dir = "mlr3_experiments",
  packages = c("mlr3", "mlr3verse"),
  seed = 1
)

You can add problems and algorithms to the registry using the following code:

addProblem("task", data = task)
addAlgorithm(name = "mlr3", fun = function(job, data, instance, learner) {
  learner = lrn(learner)
  task = as_task(data)
  learner$train(task)
})

Define the design of experiments:

prob_design = list(task = data.table())
algo_design = list(mlr3 = data.frame(learner = sapply(learners, function(x) {x$id}),
                                     stringsAsFactors = FALSE))
addExperiments(prob.designs = prob_design, algo.designs = algo_design)
summarizeExperiments()
   problem algorithm .count
    <char>    <char>  <int>
1:    task      mlr3      2

Test a single job to ensure it works correctly:

testJob(1)
### [bt]: Generating problem instance for problem 'task' ...
### [bt]: Applying algorithm 'mlr3' on problem 'task' for job 1 (seed = 2) ...
  1. Add at least two more learners to the benchmark experiment. Choose any classification learners from the mlr3learners package.
  2. Configure and run a resampling strategy (e.g., 10-fold cross-validation) instead of using the whole dataset for training and testing.

Exercise 3.3: Benchmark

Submit jobs to be executed in parallel:

submitJobs()
waitForJobs() # Wait for the jobs to finish

Collect and analyze the results:

res = reduceResultsList()
print(res)

Plot the performance metrics of the different learners using the ggplot2 package.

Summary

We downloaded various data sets from OpenML and used batch tools to parallelize a benchmark study.