library('mlr3verse')
library('mlr3oml')
library('data.table')
library('batchtools')
library('ggplot2')
Goal
The objective of this exercise is to get hands-on experience with conducting large-scale machine learning experiments using MLR3, Batchtools, and the OpenML connector. You will learn how to set up a parallelized benchmark experiment on your laptop.
Prerequisites
We need the following libraries:
Exercise 1: Getting Data from OpenML
To draw meaningful conclusions from benchmark experiments, a good choice of data sets and tasks is essential. OpenML is an open-source online platform that facilitates the sharing of machine learning research data, algorithms, and experimental results in a standardized format. Finding data from OpenML is possible via the website or its API. The mlr3oml
package offers an elegant connection between mlr3
and OpenML. The function list_oml_tasks()
can be used to filter tasks for specific properties. To get started, utilize this to create a list of tasks with 10-20 features, 500-1000 rows and a categorical outcome with two classes. From this list, remove duplicate instances with similar names (sometimes, different versions of more or less the same data set are produced). For example, you could do this by removing instances where the first 3 letters of the name column match those of other instances. Further, exclude instances where the minority class is less than 10% of the overall number of observations.
Exercise 2: Working with OpenML data
Notably, list_oml__data_tasks()
only retrieves relevant information about the tasks, and not the data itself. Convert this list of tasks by directly transforming it to mlr3
tasks with applying otsk()
to each data_id
. Find out how you can load and inspect the data from a single task in the list.
Exercise 3: Batch Tools
Exercise 3.1: OpenML Task and Learners
For this task, look at the german credit data set. Download it from OpenML (task id: 31) and create a task. Define a tree, an SVM and a Random Forest as they shall be benchmarked later.
Exercise 3.2: Batchtools experiment registry
Create and configure a Batchtools experiment registry:
= makeExperimentRegistry(
reg file.dir = "mlr3_experiments",
packages = c("mlr3", "mlr3verse"),
seed = 1
)
You can add problems and algorithms to the registry using the following code:
addProblem("task", data = task)
addAlgorithm(name = "mlr3", fun = function(job, data, instance, learner) {
= lrn(learner)
learner = as_task(data)
task $train(task)
learner })
Define the design of experiments:
= list(task = data.table())
prob_design = list(mlr3 = data.frame(learner = sapply(learners, function(x) {x$id}),
algo_design stringsAsFactors = FALSE))
addExperiments(prob.designs = prob_design, algo.designs = algo_design)
summarizeExperiments()
problem algorithm .count
<char> <char> <int>
1: task mlr3 2
Test a single job to ensure it works correctly:
testJob(1)
### [bt]: Generating problem instance for problem 'task' ...
### [bt]: Applying algorithm 'mlr3' on problem 'task' for job 1 (seed = 2) ...
- Add at least two more learners to the benchmark experiment. Choose any classification learners from the
mlr3learners
package. - Configure and run a resampling strategy (e.g., 10-fold cross-validation) instead of using the whole dataset for training and testing.
Exercise 3.3: Benchmark
Submit jobs to be executed in parallel:
submitJobs()
waitForJobs() # Wait for the jobs to finish
Collect and analyze the results:
= reduceResultsList()
res print(res)
Plot the performance metrics of the different learners using the ggplot2
package.
Summary
We downloaded various data sets from OpenML and used batch tools to parallelize a benchmark study.