library("mlr3verse")
= lrn("classif.xgboost") learner
Scope
Increasingly large data sets and search spaces make hyperparameter optimization a time-consuming task. Hyperband (Li et al. 2018) solves this by approximating the performance of a configuration on a simplified version of the problem such as a small subset of the training data, with just a few training epochs in a neural network, or with only a small number of iterations in a gradient-boosting model. After starting randomly sampled configurations, Hyperband iteratively allocates more resources to promising configurations and terminates low-performing ones. This type of optimization is called multi-fidelity optimization. The fidelity parameter is part of the search space and controls the tradeoff between the runtime and accuracy of the performance approximation. In this post, we will optimize XGBoost and use the number of boosting iterations as the fidelity parameter. This means Hyperband will allocate more boosting iterations to well-performing configurations. The number of boosting iterations increases the time to train a model and improves the performance until the model is overfitting to the training data. It is therefore a suitable fidelity parameter. We assume that you are already familiar with tuning in the mlr3 ecosystem. If not, you should start with the book chapter on optimization or the Hyperparameter Optimization on the Palmer Penguins Data Set post. This is the first part of the Hyperband series. The second part can be found here Hyperband Series - Data Set Subsampling.
Hyperband
Hyperband is an advancement of the Successive Halving algorithm by Jamieson and Talwalkar (2016). Successive Halving is initialized with the number of starting configurations \(n\), the proportion of configurations discarded in each stage \(\eta\), and the minimum \(r{_{min}}\) and maximum \(r{_{max}}\) budget of a single evaluation. The algorithm starts by sampling \(n\) random configurations and allocating the minimum budget \(r{_{min}}\) to them. The configurations are evaluated and \(\frac{1}{\eta}\) of the worst-performing configurations are discarded. The remaining configurations are promoted to the next stage and evaluated on a larger budget. This continues until one or more configurations are evaluated on the maximum budget \(r{_{max}}\) and the best performing configuration is selected. The number of stages is calculated so that each stage consumes approximately the same budget. This sometimes results in the minimum budget having to be slightly adjusted by the algorithm. Successive Halving has the disadvantage that is not clear whether we should choose a large \(n\) and try many configurations on a small budget or choose a small \(n\) and train more configurations on the full budget.
Hyperband solves this problem by running Successive Halving with different numbers of stating configurations. The algorithm is initialized with the same parameters as Successive Halving but without \(n\). Each run of Successive Halving is called a bracket and starts with a different budget \(r{_{0}}\). A smaller starting budget means that more configurations can be tried out. The most explorative bracket allocated the minimum budget \(r{_{min}}\). The next bracket increases the starting budget by a factor of \(\eta\). In each bracket, the starting budget increases further until the last bracket \(s = 0\) essentially performs a random search with the full budget \(r{_{max}}\). The number of brackets \(s{_{max}} + 1\) is calculated with \(s{_{max}} = {\log_\eta \frac{r{_{max}} }{r{_{min}}}}\). Under the condition that \(r{_{0}}\) increases by \(\eta\) with each bracket, \(r{_{min}}\) sometimes has to be adjusted slightly in order not to use more than \(r{_{max}}\) resources in the last bracket. The number of configurations in the base stages is calculated so that each bracket uses approximately the same amount of budget. The following table shows a full run of the Hyperband algorithm. The bracket \(s = 3\) is the most explorative bracket and \(s = 0\) performance a random search on the full budget.
The Hyperband implementation in mlr3hyperband evaluates configurations with the same budget in parallel. This results in all brackets finishing at approximately the same time. The colors in Figure 1 indicate batches that are evaluated in parallel.
Hyperparameter Optimization
In this practical example, we will optimize the hyperparameters of XGBoost on the Spam
data set. We begin by loading the XGBoost learner.
.
The next thing we do is define the search space. The nrounds
parameter controls the number of boosting iterations. We set a range from 16 to 128 boosting iterations. This is used as \(r{_{min}}\) and \(r{_{max}}\) by the Hyperband algorithm. We need to tag the parameter with "budget"
to identify it as a fidelity parameter. For the other hyperparameters, we take the search space for XGBoost from the Bischl et al. (2021) article. This search space works for a wide range of data sets.
$param_set$set_values(
learnernrounds = to_tune(p_int(16, 128, tags = "budget")),
eta = to_tune(1e-4, 1, logscale = TRUE),
max_depth = to_tune(1, 20),
colsample_bytree = to_tune(1e-1, 1),
colsample_bylevel = to_tune(1e-1, 1),
lambda = to_tune(1e-3, 1e3, logscale = TRUE),
alpha = to_tune(1e-3, 1e3, logscale = TRUE),
subsample = to_tune(1e-1, 1)
)
We construct the tuning instance. We use the "none"
terminator because Hyperband terminates itself when all brackets are evaluated.
= ti(
instance task = tsk("spam"),
learner = learner,
resampling = rsmp("holdout"),
measures = msr("classif.ce"),
terminator = trm("none")
) instance
<TuningInstanceSingleCrit>
* State: Not optimized
* Objective: <ObjectiveTuning:classif.xgboost_on_spam>
* Search Space:
id class lower upper nlevels
<char> <char> <num> <num> <num>
1: nrounds ParamInt 16.000000 128.000000 113
2: eta ParamDbl -9.210340 0.000000 Inf
3: max_depth ParamInt 1.000000 20.000000 20
4: colsample_bytree ParamDbl 0.100000 1.000000 Inf
5: colsample_bylevel ParamDbl 0.100000 1.000000 Inf
6: lambda ParamDbl -6.907755 6.907755 Inf
7: alpha ParamDbl -6.907755 6.907755 Inf
8: subsample ParamDbl 0.100000 1.000000 Inf
* Terminator: <TerminatorNone>
We load the Hyperband tuner
and set eta = 2
. Hyperband can start from the beginning when the last bracket is evaluated. We control the number of Hyperband runs with the repetition
argument. The setting repetition = Inf
is useful when a terminator should stop the optimization.
library("mlr3hyperband")
= tnr("hyperband", eta = 2, repetitions = 1) tuner
The Hyperband implementation in mlr3hyperband evaluates configurations with the same budget in parallel. This results in all brackets finishing at approximately the same time. You can think of it as going diagonally through Figure 1. Using eta = 2
and a range from 16 to 128 boosting iterations results in the following schedule.
Now we are ready to start the tuning.
$optimize(instance) tuner
The result of a run is the configuration with the best performance. This does not necessarily have to be a configuration evaluated with the highest budget since we can overfit the data with too many boosting iterations.
$result[, .(nrounds, eta, max_depth, colsample_bytree, colsample_bylevel, lambda, alpha, subsample)] instance
nrounds eta max_depth colsample_bytree colsample_bylevel lambda alpha subsample
<num> <num> <int> <num> <num> <num> <num> <num>
1: 128 -1.985313 8 0.4828647 0.3001509 -3.816118 -4.998944 0.4464856
The archive of a Hyperband run has the additional columns "bracket"
and "stage"
.
as.data.table(instance$archive)[, .(bracket, stage, classif.ce, eta, max_depth, colsample_bytree)]
bracket stage classif.ce eta max_depth colsample_bytree
<int> <num> <num> <num> <int> <num>
1: 3 0 0.09061278 -3.6110452 20 0.2671318
2: 3 0 0.06388527 -1.9853130 8 0.4828647
3: 3 0 0.11147327 -2.8428844 3 0.3152709
4: 3 0 0.07301173 -3.9228523 19 0.5369803
5: 3 0 0.07953064 -3.6486183 17 0.2039727
---
31: 0 0 0.06258149 -1.8578637 9 0.7343156
32: 3 3 0.03976532 -1.9853130 8 0.4828647
33: 2 2 0.03976532 -0.9898459 4 0.2534038
34: 1 1 0.05345502 -2.5345314 13 0.5260946
35: 1 1 0.06714472 -8.5597326 16 0.9890136
Conclusion
The handling of Hyperband in mlr3tuning is very similar to that of other tuners. We only have to select an additional fidelity parameter and tag it with "budget"
. We have tried to keep the runtime of the example low. For your optimization, you should use cross-validation and increase the maximum number of boosting rounds. The Bischl et al. (2021) search space suggests 5000 boosting rounds. Check out our next post on Hyperband which uses the size of the training data set as the fidelity parameter.