Hyperband Series - Iterative Training

Optimize the hyperparameters of an XGBoost model with Hyperband.


January 15, 2023


Increasingly large data sets and search spaces make hyperparameter optimization a time-consuming task. Hyperband (Li et al. 2018) solves this by approximating the performance of a configuration on a simplified version of the problem such as a small subset of the training data, with just a few training epochs in a neural network, or with only a small number of iterations in a gradient-boosting model. After starting randomly sampled configurations, Hyperband iteratively allocates more resources to promising configurations and terminates low-performing ones. This type of optimization is called multi-fidelity optimization. The fidelity parameter is part of the search space and controls the tradeoff between the runtime and accuracy of the performance approximation. In this post, we will optimize XGBoost and use the number of boosting iterations as the fidelity parameter. This means Hyperband will allocate more boosting iterations to well-performing configurations. The number of boosting iterations increases the time to train a model and improves the performance until the model is overfitting to the training data. It is therefore a suitable fidelity parameter. We assume that you are already familiar with tuning in the mlr3 ecosystem. If not, you should start with the book chapter on optimization or the Hyperparameter Optimization on the Palmer Penguins Data Set post. This is the first part of the Hyperband series. The second part can be found here Hyperband Series - Data Set Subsampling.


Hyperband is an advancement of the Successive Halving algorithm by Jamieson and Talwalkar (2016). Successive Halving is initialized with the number of starting configurations \(n\), the proportion of configurations discarded in each stage \(\eta\), and the minimum \(r{_{min}}\) and maximum \(r{_{max}}\) budget of a single evaluation. The algorithm starts by sampling \(n\) random configurations and allocating the minimum budget \(r{_{min}}\) to them. The configurations are evaluated and \(\frac{1}{\eta}\) of the worst-performing configurations are discarded. The remaining configurations are promoted to the next stage and evaluated on a larger budget. This continues until one or more configurations are evaluated on the maximum budget \(r{_{max}}\) and the best performing configuration is selected. The number of stages is calculated so that each stage consumes approximately the same budget. This sometimes results in the minimum budget having to be slightly adjusted by the algorithm. Successive Halving has the disadvantage that is not clear whether we should choose a large \(n\) and try many configurations on a small budget or choose a small \(n\) and train more configurations on the full budget.

Hyperband solves this problem by running Successive Halving with different numbers of stating configurations. The algorithm is initialized with the same parameters as Successive Halving but without \(n\). Each run of Successive Halving is called a bracket and starts with a different budget \(r{_{0}}\). A smaller starting budget means that more configurations can be tried out. The most explorative bracket allocated the minimum budget \(r{_{min}}\). The next bracket increases the starting budget by a factor of \(\eta\). In each bracket, the starting budget increases further until the last bracket \(s = 0\) essentially performs a random search with the full budget \(r{_{max}}\). The number of brackets \(s{_{max}} + 1\) is calculated with \(s{_{max}} = {\log_\eta \frac{r{_{max}} }{r{_{min}}}}\). Under the condition that \(r{_{0}}\) increases by \(\eta\) with each bracket, \(r{_{min}}\) sometimes has to be adjusted slightly in order not to use more than \(r{_{max}}\) resources in the last bracket. The number of configurations in the base stages is calculated so that each bracket uses approximately the same amount of budget. The following table shows a full run of the Hyperband algorithm. The bracket \(s = 3\) is the most explorative bracket and \(s = 0\) performance a random search on the full budget.

Figure 1: Hyperband schedule with \(\eta = 2\) , \(r{_{min}} = 1\) and \(r{_{max}} = 8\)

The Hyperband implementation in mlr3hyperband evaluates configurations with the same budget in parallel. This results in all brackets finishing at approximately the same time. The colors in Figure 1 indicate batches that are evaluated in parallel.

Hyperparameter Optimization

In this practical example, we will optimize the hyperparameters of XGBoost on the Spam data set. We begin by loading the XGBoost learner..


learner = lrn("classif.xgboost")

The next thing we do is define the search space. The nrounds parameter controls the number of boosting iterations. We set a range from 16 to 128 boosting iterations. This is used as \(r{_{min}}\) and \(r{_{max}}\) by the Hyperband algorithm. We need to tag the parameter with "budget" to identify it as a fidelity parameter. For the other hyperparameters, we take the search space for XGBoost from the Bischl et al. (2021) article. This search space works for a wide range of data sets.

  nrounds           = to_tune(p_int(16, 128, tags = "budget")),
  eta               = to_tune(1e-4, 1, logscale = TRUE),
  max_depth         = to_tune(1, 20),
  colsample_bytree  = to_tune(1e-1, 1),
  colsample_bylevel = to_tune(1e-1, 1),
  lambda            = to_tune(1e-3, 1e3, logscale = TRUE),
  alpha             = to_tune(1e-3, 1e3, logscale = TRUE),
  subsample         = to_tune(1e-1, 1)

We construct the tuning instance. We use the "none" terminator because Hyperband terminates itself when all brackets are evaluated.

instance = ti(
  task = tsk("spam"),
  learner = learner,
  resampling = rsmp("holdout"),
  measures = msr("classif.ce"),
  terminator = trm("none")
* State:  Not optimized
* Objective: <ObjectiveTuning:classif.xgboost_on_spam>
* Search Space:
                  id    class     lower      upper nlevels
              <char>   <char>     <num>      <num>   <num>
1:           nrounds ParamInt 16.000000 128.000000     113
2:               eta ParamDbl -9.210340   0.000000     Inf
3:         max_depth ParamInt  1.000000  20.000000      20
4:  colsample_bytree ParamDbl  0.100000   1.000000     Inf
5: colsample_bylevel ParamDbl  0.100000   1.000000     Inf
6:            lambda ParamDbl -6.907755   6.907755     Inf
7:             alpha ParamDbl -6.907755   6.907755     Inf
8:         subsample ParamDbl  0.100000   1.000000     Inf
* Terminator: <TerminatorNone>

We load the Hyperband tuner and set eta = 2. Hyperband can start from the beginning when the last bracket is evaluated. We control the number of Hyperband runs with the repetition argument. The setting repetition = Inf is useful when a terminator should stop the optimization.


tuner = tnr("hyperband", eta = 2, repetitions = 1)

The Hyperband implementation in mlr3hyperband evaluates configurations with the same budget in parallel. This results in all brackets finishing at approximately the same time. You can think of it as going diagonally through Figure 1. Using eta = 2 and a range from 16 to 128 boosting iterations results in the following schedule.

Now we are ready to start the tuning.


The result of a run is the configuration with the best performance. This does not necessarily have to be a configuration evaluated with the highest budget since we can overfit the data with too many boosting iterations.

instance$result[, .(nrounds, eta, max_depth, colsample_bytree, colsample_bylevel, lambda, alpha, subsample)]
   nrounds       eta max_depth colsample_bytree colsample_bylevel    lambda     alpha subsample
     <num>     <num>     <int>            <num>             <num>     <num>     <num>     <num>
1:     128 -1.985313         8        0.4828647         0.3001509 -3.816118 -4.998944 0.4464856

The archive of a Hyperband run has the additional columns "bracket" and "stage".

as.data.table(instance$archive)[, .(bracket, stage, classif.ce, eta, max_depth, colsample_bytree)]
    bracket stage classif.ce        eta max_depth colsample_bytree
      <int> <num>      <num>      <num>     <int>            <num>
 1:       3     0 0.09061278 -3.6110452        20        0.2671318
 2:       3     0 0.06388527 -1.9853130         8        0.4828647
 3:       3     0 0.11147327 -2.8428844         3        0.3152709
 4:       3     0 0.07301173 -3.9228523        19        0.5369803
 5:       3     0 0.07953064 -3.6486183        17        0.2039727
31:       0     0 0.06258149 -1.8578637         9        0.7343156
32:       3     3 0.03976532 -1.9853130         8        0.4828647
33:       2     2 0.03976532 -0.9898459         4        0.2534038
34:       1     1 0.05345502 -2.5345314        13        0.5260946
35:       1     1 0.06714472 -8.5597326        16        0.9890136


The handling of Hyperband in mlr3tuning is very similar to that of other tuners. We only have to select an additional fidelity parameter and tag it with "budget". We have tried to keep the runtime of the example low. For your optimization, you should use cross-validation and increase the maximum number of boosting rounds. The Bischl et al. (2021) search space suggests 5000 boosting rounds. Check out our next post on Hyperband which uses the size of the training data set as the fidelity parameter.


Bischl, Bernd, Martin Binder, Michel Lang, Tobias Pielok, Jakob Richter, Stefan Coors, Janek Thomas, et al. 2021. “Hyperparameter Optimization: Foundations, Algorithms, Best Practices and Open Challenges.” arXiv:2107.05847 [Cs, Stat], July. http://arxiv.org/abs/2107.05847.
Jamieson, Kevin, and Ameet Talwalkar. 2016. “Non-Stochastic Best Arm Identification and Hyperparameter Optimization.” In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, edited by Arthur Gretton and Christian C. Robert, 51:240–48. Proceedings of Machine Learning Research. Cadiz, Spain: PMLR. http://proceedings.mlr.press/v51/jamieson16.html.
Li, Lisha, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. 2018. “Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization.” Journal of Machine Learning Research 18 (185): 1–52. https://jmlr.org/papers/v18/16-558.html.