library(mlr3)
library(mlr3learners)
library(mlr3pipelines)
library(rpart.plot)
library(mlr3viz)
library(bbotk)
library(mlr3misc)
= tsk("german_credit")
task
set.seed(2409)
Goal
Learn how to do ensembling and model averaging with mlr3pipelines
and optimizing weights with bbotk
.
German Credit Data
Description
- Data from 1973 to 1975 from a large regional bank in southern Germany classifying credits described by a set of attributes to good or bad credit risks.
- Stratified sample of 1000 credits (300 bad ones and 700 good ones).
- Customers with good credit risks perfectly complied with the conditions of the contract while customers with bad credit risks did not comply with the contract as required.
- Available in
tsk("german_credit")
.
Data Dictionary
n = 1,000 observations of credits
credit_risk
: Has the credit contract been complied with (good) or not (bad)?age
: Age of debtor in yearsamount
: Credit amount in DMcredit_history
: History of compliance with previous or concurrent credit contractsduration
: Credit duration in monthsemployment_duration
: Duration of debtor’s employment with current employerforeign_worker
: Whether the debtor is a foreign workerhousing
: Type of housing the debtor lives ininstallment_rate
: Credit installments as a percentage of debtor’s disposable incomejob
: Quality of debtor’s jobnumber_credits
: Number of credits including the current one the debtor has (or had) at this bankother_debtors
: Whether there is another debtor or a guarantor for the creditother_installment_plans
: Installment plans from providers other than the credit-giving bankpeople_liable
: Number of persons who financially depend on the debtorpersonal_status_sex
: Combined information on sex and marital statuspresent_residence
: Length of time (in years) the debtor lives in the present residenceproperty
: The debtor’s most valuable propertypurpose
: Purpose for which the credit is neededsavings
: Debtor’s savingstatus
: Status of the debtor’s checking account with the banktelephone
: Whether there is a telephone landline registered on the debtor’s name
Prerequisites
1 Build a “Random Forest” From Scratch
1.1 Create three Bagged Trees
Use the classifavg PipeOP
to combine predictions of decision trees trained on different subsamples of the german_credit
task similarly as in a Random Forest. Use 3 decision trees with a maximum depth of 3. To create subsamples of data, the subsample PipeOp
comes in handy. Plot the graph learner representing your ensemble. Plot each of the 3 decision trees of the ensemble after training on all data. Compare this to a single decision tree with a maximum depth of 3 trained on all data. What is missing to actually mimic a Random Forest?
Hint 1:
Subsample with the subsamplePipeOp
and pass the output into the decision tree. This is one pipeline part. You can then greplicate
(creates disjoint graph union of copies of a graph) this part three times to create an actual graph and combine the output of the decision trees via the classifavg PipeOp
. To plot the trees, the rpart.plot
package is helpful. If you are unsure how the different parts fit together, maybe plot the intermediate graphs you construct via graph$plot()
.
Hint 2:
= rsmp("cv", folds = 10L)$instantiate(task)
resampling = po("...") %>>% po("...", learner = ...)
pl = ppl("greplicate", ...)
gr = gr %>>% po("classifavg")
gr = as_learner(gr)
grl $id = "bagged_trees"
grl$graph$plot()
grl$train(task)
grl
= lrn("classif.rpart")
dt $id = "tree"
dt$train(task)
dt
rpart.plot(grl$...)
...
1.2 Reset Maximum Depth and compare to Random Forest
Reset the maximum depth hyperparameter values for each tree of your ensemble and the decision tree. Proceed to benchmark the ensemble of three trees against the decision tree and an actual ranger Random Forest with 3, 10 and 100 trees. Use 10-fold CV to evaluate the ROC AUC of the models. As a follow up question, would it make sense to change the weights used by the classifavg PipeOp
to combine the predictions?
Hint 1:
Prior lectures should be helpful where you already benchmarked different learners.Hint 2:
$param_set$set_values(.values = list(maxdepth = NULL))
dt$...
grl
...
= lrn("classif.ranger", ...)
ranger3 $id = "rf_3"
ranger3= lrn("classif.ranger", ...)
ranger10 $id = "rf_10"
ranger10= lrn("classif.ranger", ...)
ranger100 $id = "rf_100"
ranger100
$predict_type = "prob"
grl
...
= benchmark_grid(...)
bg = benchmark(...)
b autoplot(..., measure = ...)
2 Model Averaging
Use the classifavg PipeOP
to combine the predictions of a decision tree, a k-NN (with k = 7) and a logistic regression. Benchmark the ensemble against each learner and a featureless learner. Use 10-fold CV evaluate the ROC AUC of the models. By default classifavg uses equal weights to combine the predictions of the models. Can you manually find a weighting scheme that results in better performance than equal weights?
Hint 1:
If you are not familiar with a k-NN learner, you may catch up here: https://slds-lmu.github.io/i2ml/chapters/05_knn/ You can largely reuse parts of the graph learner you previously constructed (excluding the subsampling part). To manually set a weight vector as a hyperparameter of the graph learner, inspect its$param_set
and make use of the $set_values()
function.
Hint 2:
= lrn(...)
dt $predict_type = "prob"
dt= lrn(...)
kknn $predict_type = "prob"
kknn= lrn(...)
log_reg $predict_type = "prob"
log_reg= lrn(...)
featureless $predict_type = "prob"
featureless
= gunion() %>>% po(...)
gr = as_learner(gr)
grl
= benchmark_grid(...)
bg = benchmark(...)
b $aggregate(...)
b
= grl$clone(deep = TRUE)
grl_weights_adjusted $param_set$set_values(...)
grl_weights_adjusted$id = paste0(grl_weights_adjusted$id, ".weights_adjusted")
grl_weights_adjusted
= benchmark_grid(...)
bg = benchmark(...)
b $aggregate(...) b
3 Optimizing Weights
Building upon the previous exercise, we now want to numerically optimize the weights of the ensemble via bbotk
. To do so, we will have to construct an OptimInstanceSingleCrit
in which we pass a domain, a search space, a codomain and the actual objective function that is optimized. First, we will implement a naive solution, by changing the weighting scheme in the objective function and evaluating the ensemble based on a resampling. Note that when we are optimizing three weights (one for each model), this is in essence a constrained optimization problem with only two degrees of freedom: Given the first weight and the second weight and the constraint that all three weights must sum to 1, we can always calculate the third weight. We will ignore this in the following and simply optimize each weight on a scale from 0 to 1 and will normalize all weights to sum to 1 within the objective.
To construct the OptimInstanceSingleCrit
do the following:
- create the
objective_function
(a standard R function) acceptingxs
(a list) as input:xs
will be the weights in the form of a list- extract the weights and use them within the ensemble (e.g., clone the graph learner from the previous exercise and set the
classifavg.weights
) - use
resample
on this weighted ensemble - extract the ROC AUC and return it in a list
- create the
domain
(the space we optimize over):- we optimize over the three numeric weight parameters with values from 0 to 1 (have a look at
?p_dbl
)
- we optimize over the three numeric weight parameters with values from 0 to 1 (have a look at
- create the codomain (describing the output space):
- we maximize the numeric ROC AUC value
- to make sure that we maximize instead of minimize (the
bbotk
default) settags = "maximize"
of thisp_dbl
- collect everything in an
OptimInstanceSingleCrit
Use random search as an optimizer and terminate after 10 function evaluations. Why is our approach (i.e., how we constructed the objective function) ineffective?
Hint 1:
If you are not yet familiar with thebbotk
package, a good starting point is: https://mlr3book.mlr-org.com/chapters/chapter5/advanced_tuning_methods_and_black_box_optimization.html#sec-black-box-optimization and https://cran.r-project.org/web/packages/bbotk/vignettes/bbotk.html
Hint 2:
= grl$clone(deep = TRUE)
grl_weights_optimized $id = paste0(grl_weights_optimized$id, ".weights_optimized_naive")
grl_weights_optimized
= function(xs) {
objective_function = unlist(xs)
weights = ... # sum to 1 normalization
weights $param_set$set_values(...)
grl_weights_optimized= resample(...)
rr # returning the normalized weights as the second element in the list allows us to also store them in the archive
list(classif.auc = ..., weights = list(weights))
}
= ps(w_dt = ..., w_kknn = ..., w_log_reg = ...)
domain = ps(classif.auc = ...) # make sure to specify `tags = "maximize"`
codomain
= ObjectiveRFun$new(
objective fun = ...,
domain = ...,
codomain = ...,
id = "optimize_grl_weights_random"
)
= OptimInstanceSingleCrit$new(
instance objective = ...
terminator = trm(...)
)
= opt(...)
optimizer $optimize(instance) optimizer
4 Optimizing Weights Efficiently
In the previous exercise, we optimized the weights of our ensemble - but very inefficiently. In this exercise we want to do better. Rewrite the objective function to directly operate on the cross-validated predictions and combine the predicted probabilities directly as in model averaging. Construct an OptimInstanceSingleCrit
and optimize it via CMA-ES and terminate after 100 function evaluations.
Note that you can reuse most logic from the previous exercise and the only interesting part is how to rewrite the objective function.
Hint 1:
Store resampling results of each learner externally and use these results in the objective function. For each fold weight the probability predictions and average them. Then construct a newPredictionClassif
which allows you to calculate the ROC AUC for each fold. Finally, return the average ROC AUC over the folds.
Hint 2:
= resampling$param_set$get_values()[["folds"]]
k
= resample(...)
dt_pred = resample(...)
kknn_pred = resample(...)
log_reg_pred
= function(xs) {
objective_function = unlist(xs)
weights = ... # sum to 1 normalization
weights = map_dbl(seq_len(k), function(fold) {
aucs = dt_pred$...
dt_p = kknn_pred$...
kknn_p = log_reg_pred$...
log_reg_p = dt_p$row_ids
row_ids stopifnot(all(row_ids == kknn_p$row_ids) && all(row_ids == log_reg_p$row_ids))
= dt_p$truth
truth = Reduce("+", list(...))
weighted_probs = ...
weighted_response = PredictionClassif$new(row_ids = row_ids, truth = truth, response = weighted_response, prob = weighted_probs)
weighted_p $score(...)
weighted_p
})list(classif.auc = ..., weights = list(weights))
}
= ps(...)
domain = ps(...)
codomain
= ObjectiveRFun$new(
objective fun = objective_function,
domain = domain,
codomain = codomain,
id = "optimize_grl_weights_cmaes"
)
= OptimInstanceSingleCrit$new(
instance objective = ...,
terminator = ...
)
= opt(...)
optimizer $optimize(instance) optimizer
Summary
We built a bagged ensemble of trees from scratch and compared its performance to a single tree and actual random forests with different numbers of trees. We then performed model averaging of a decision tree, a k-NN and a logistic regression. Choosing weights manually is cumbersome so we optimized them both in a straightforward but inefficient and a slightly more demanding but efficient way.