Hyperband Series - Data Set Subsampling

Optimize the hyperparameters of a Support Vector Machine with Hyperband.

Authors
Published

January 16, 2023

Scope

We continue working with the Hyperband optimization algorithm (Li et al. 2018). The previous post used the number of boosting iterations of an XGBoost model as the resource. However, Hyperband is not limited to machine learning algorithms that are trained iteratively. The resource can also be the number of features, the training time of a model, or the size of the training data set. In this post, we will tune a support vector machine and use the size of the training data set as the fidelity parameter. The time to train a support vector machine and the performance increases with the size of the data set. This makes the data set size a suitable fidelity parameter for Hyperband. This is the second part of the Hyperband series. The first part can be found here Hyperband Series - Iterative Training. If you don’t know much about Hyperband, check out the first post which explains the algorithm in detail. We assume that you are already familiar with tuning in the mlr3 ecosystem. If not, you should start with the book chapter on optimization or the Hyperparameter Optimization on the Palmer Penguins Data Set post. A little knowledge about mlr3pipelines is beneficial but not necessary to understand the example.

Hyperparameter Optimization

In this post, we will optimize the hyperparameters of the support vector machine on the Sonar data set. We begin by constructing a classification machine by setting type to "C-classification".

library("mlr3verse")

learner = lrn("classif.svm", id = "svm", type = "C-classification")

The mlr3pipelines package features a PipeOp for subsampling.

po("subsample")
PipeOp: <subsample> (not trained)
values: <frac=0.6321, stratify=FALSE, replace=FALSE>
Input channels <name [train type, predict type]>:
  input [Task,Task]
Output channels <name [train type, predict type]>:
  output [Task,Task]

The PipeOp controls the size of the training data set with the frac parameter. We connect the PipeOp with the learner and get a GraphLearner.

graph_learner = as_learner(
  po("subsample") %>>%
  learner
)

The graph learner subsamples and then fits a support vector machine on the data subset. The parameter set of the graph learner is a combination of the parameter sets of the PipeOp and learner.

as.data.table(graph_learner$param_set)[, .(id, lower, upper, levels)]
                    id lower upper                             levels
 1:     subsample.frac     0   Inf                                   
 2: subsample.stratify    NA    NA                         TRUE,FALSE
 3:  subsample.replace    NA    NA                         TRUE,FALSE
 4:      svm.cachesize  -Inf   Inf                                   
 5:  svm.class.weights    NA    NA                                   
---                                                                  
15:             svm.nu  -Inf   Inf                                   
16:          svm.scale    NA    NA                                   
17:      svm.shrinking    NA    NA                         TRUE,FALSE
18:      svm.tolerance     0   Inf                                   
19:           svm.type    NA    NA C-classification,nu-classification

Next, we create the search space. We use TuneToken to mark which hyperparameters should be tuned. We have to prefix the hyperparameters with the id of the PipeOps. The subsample.frac is the fidelity parameter that must be tagged with "budget" in the search space. The data set size is increased from 3.7% to 100%. For the other hyperparameters, we took the search space for support vector machines from the Kuehn et al. (2018) article. This search space works for a wide range of data sets.

graph_learner$param_set$set_values(
  subsample.frac  = to_tune(p_dbl(3^-3, 1, tags = "budget")),
  svm.kernel      = to_tune(c("linear", "polynomial", "radial")),
  svm.cost        = to_tune(1e-4, 1e3, logscale = TRUE),
  svm.gamma       = to_tune(1e-4, 1e3, logscale = TRUE),
  svm.tolerance   = to_tune(1e-4, 2, logscale = TRUE),
  svm.degree      = to_tune(2, 5)
)

Support vector machines often crash or never finish the training with certain hyperparameter configurations. We set a timeout of 30 seconds and a fallback learner to handle these cases.

graph_learner$encapsulate = c(train = "evaluate", predict = "evaluate")
graph_learner$timeout = c(train = 30, predict = 30)
graph_learner$fallback = lrn("classif.featureless")

Let’s create the tuning instance. We use the "none" terminator because Hyperband controls the termination itself.

instance = ti(
  task = tsk("sonar"),
  learner = graph_learner,
  resampling = rsmp("cv", folds = 3),
  measures = msr("classif.ce"),
  terminator = trm("none")
)
instance
<TuningInstanceSingleCrit>
* State:  Not optimized
* Objective: <ObjectiveTuning:subsample.svm_on_sonar>
* Search Space:
               id    class       lower     upper nlevels
1: subsample.frac ParamDbl  0.03703704 1.0000000     Inf
2:       svm.cost ParamDbl -9.21034037 6.9077553     Inf
3:     svm.degree ParamInt  2.00000000 5.0000000       4
4:      svm.gamma ParamDbl -9.21034037 6.9077553     Inf
5:     svm.kernel ParamFct          NA        NA       3
6:  svm.tolerance ParamDbl -9.21034037 0.6931472     Inf
* Terminator: <TerminatorNone>

We load the Hyperband tuner and set eta = 3.

library("mlr3hyperband")

tuner = tnr("hyperband", eta = 3)

Using eta = 3 and a lower bound of 3.7% for the data set size, results in the following schedule. Configurations with the same data set size are evaluated in parallel.

Now we are ready to start the tuning.

tuner$optimize(instance)

The best model is a support vector machine with a polynomial kernel.

instance$result[, .(subsample.frac, svm.cost, svm.degree, svm.gamma, svm.kernel, svm.tolerance, classif.ce)]
   subsample.frac svm.cost svm.degree svm.gamma svm.kernel svm.tolerance classif.ce
1:              1 1.871535          3  -2.60663 polynomial     -4.573951  0.1491373

The archive contains all evaluated configurations. We look at the 8 configurations that were evaluated on the complete data set. The configuration with the best classification error on the full data set was sampled in bracket 2. The classification error was estimated to be 26% on 33% of the data set and increased to 19% on the full data set (see green line in Figure 1).

Figure 1: Optimization path of the 8 configurations evaluated on the complete data set.

Conclusion

Using the data set size as the budget parameter in Hyperband allows the tuning of machine learning models that are not trained iteratively. We have tried to keep the runtime of the example low. For your optimization, you should use cross-validation and run multiple iterations of Hyperband.

References

Kuehn, Daniel, Philipp Probst, Janek Thomas, and Bernd Bischl. 2018. “Automatic Exploration of Machine Learning Experiments on OpenML.” https://arxiv.org/abs/1806.10961.
Li, Lisha, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. 2018. “Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization.” Journal of Machine Learning Research 18 (185): 1–52. https://jmlr.org/papers/v18/16-558.html.