library(mlr3)
= tsk(...)
task = lrn(...) # create the learner
lrn_rpart $train(...) # train the learner on the task
lrn_rpart$... # access the raw model object that was fitted lrn_rpart
Goal
The goal for this exercise is to familiarize yourself with two very important machine learning methods, the decision tree and random forest. After this exercise, you should be able to train these models and extract important information to understand the model internals.
Exercises
Fit a decision tree
Use task = tsk("german_credit")
to create the classification task for the german_credit
data and create a decision tree learner (e.g., a CART learner). Train the decision tree on the german_credit
classification task. Look at the output of the trained decision tree (you have to access the raw model object).
Hint 1:
The learner we are focusing on here is a decision tree implemented inrpart
. The corresponding mlr3
learner key is "classif.rpart"
. For this exercise, we use the learner with the default hyperparameters. The raw model object can be accessed from the $model
slot of the trained learner.
Hint 2:
Visualize the tree structure
To interpret the model and to gain more information about the decision making of predictions, we decide to take a closer look at the decision tree structure by visualizing it.
Hint 1:
See code example in the help page ?rpart::plot.rpart
which shows how to use the plot
and text
function to the rpart
model object. Note that different packages exist to plot the decision tree structure in a visually more appealing way:
- The
rpart.plot
function from the equally named packagerpart.plot
which is applied on the rawrpart
model object. - The
plot.party
function from the packagepartykit
which is applied to arpart
model object after converting it into aparty
model object using theas.party
function. - The
ggparty
function from the equally named packageggparty
which is applied after converting therpart
model object into aparty
model object using theas.party
function.
Hint 2:
library("rpart")
...(lrn_rpart$...)
text(lrn_rpart$...)
# Alternative using e.g. the rpart.plot package
library("rpart.plot")
...(lrn_rpart$...)
Fit a random forest
To get a more powerful learner we decide to also fit a random forest. Therefore, fit a random forest with default hyperparameters to the german_credit
task.
Reminder
One of the drawbacks of using trees is the instability of the predictor. Small changes in the data may lead to a very different model and therefore a high variance of the predictions. The random forest takes advantages of that and reduces the variance by applying bagging to decision trees.
Hint 1:
Use the mlr3
learner classif.ranger
which uses the ranger
implementation to train a random forest.
Hint 2:
library(mlr3)
library(mlr3learners)
= lrn(...) # create the learner
lrn_ranger $...(...) # train the learner on the task lrn_ranger
ROC Analysis
The bank wants to use a tree-based model to predict the credit risk. Conduct a simple benchmark to assess if a decision tree or a random forest works better for these purposes. Specifically, the bank wants that among credit applications the system predicts to be “good”, it can expect at most 10% to be “bad”. Simultaneously, the bank aims at correctly classifying 90% or more of all applications that are “good”. Visualize the benchmark results in a way that helps answer this question. Can the bank expect the model to fulfil their requirements? Which model performs better?
Hint 1:
A benchmark requires three arguments: a task, a list of learners, and a resampling object.Understand hyperparameters
Use task = tsk("german_credit")
to create the classification task for the german_credit
data. In this exercise, we want to fit decision trees and random forests with different hyperparameters (which can have a significant impact on the performance). Each learner implemented in R
(e.g. ranger
or rpart
) has a lot of control settings that directly influence the model fitting (the so-called hyperparameters). Here, we will consdider the hyperparameters mtry
for the ranger
learner and maxdepth
for the rpart
learner.
Your task is to manually create a list containing multiple rpart
and ranger
learners with different hyperparameter values (e.g., try out increasing maxdepth
values for rpart
). In the next step, we will use this list to see how the model performance changes for different hyperparameter values.
The help page of ranger (
?ranger
) gives a detailed explanation of the hyperparameters:mtry
: Number of variables to possibly split at in each node. Default is the (rounded down) square root of the number variables. Alternatively, a single argument function returning an integer, given the number of independent variables.NOTE: In a
ranger
learner created withmlr3
, you have the possibility to setmtry.ratio
instead ofmtry
which allows you to set the fraction of variables to be used instead of having to set the number of variables.For
rpart
, we have to dig a bit deeper. Looking at?rpart
contains no description about the hyperparameters. To get further information we have to open?rpart.control
:maxdepth
: Set the maximum depth of any node of the final tree, with the root node counted as depth 0. Values greater than 30 rpart will give nonsense results on 32-bit machines.
Hint 1:
The learners we are focusing on here is a decision tree implemented in rpart
and a random forest implemented in ranger
. The corresponding mlr3
learner key is "classif.rpart"
and "classif.ranger"
. In mlr3
, we can get an overview about all hyperparameters in the $param_set
slot. With a mlr3
learner it is possible to get help about the underlying method by using the $help()
method (e.g. ?lrn_ranger$help()
):
lrn("classif.rpart")$help()
lrn("classif.ranger")$help()
?rpart::rpart.control
and ?ranger::ranger
.
Hint 2:
The possible choices for the hyperparameters can also be viewed with $param_set
. Setting the hyperparameters can be done directly in the lrn()
call:
# Define a list of learners for the benchmark:
= list(
lrns lrn("classif.rpart", ...),
lrn("classif.rpart", ...),
lrn("classif.rpart", ...),
lrn("classif.ranger", ...),
lrn("classif.ranger", ...),
lrn("classif.ranger", ...))
Comparison of trees and random forests
Does it make a difference w.r.t. model performance if we use different hyperparameters? Use the learners from the previous exercise and compare them in a benchmark. Use 5-fold cross-validation as resampling technique and the classification error as performance measure. Visualize the results of the benchmark.
Hint 1:
The function to conduct the benchmark isbenchmark
and requires to define the resampling with rsmp
and the benchmark grid with benchmark_grid
.
Hint 2:
set.seed(31415L)
= list(
lrns lrn("classif.rpart", maxdepth = 1),
lrn("classif.rpart", maxdepth = 5),
lrn("classif.rpart", maxdepth = 20),
lrn("classif.ranger", mtry.ratio = 0.2),
lrn("classif.ranger", mtry.ratio = 0.5),
lrn("classif.ranger", mtry.ratio = 0.8))
= rsmp(..., folds = ...)
cv5 $instantiate(...)
cv5
= ...(...(task, lrns, cv5))
bmr
::autoplot(bmr, measure = msr("classif.ce")) mlr3viz
Summary
- We learned how to use two of the most widely used learner for building a tree with
rpart
and a random forest withranger
. - Finally, we looked at different hyperparameter and how they affect the performance in a benchmark.
- The next step would be to use an algorithm to automatically search for good hyperparameter configurations.
Further information
Tree implementations: One of the longest paragraphs in the CRAN Task View about Machine Learning and Statistical Learning gives an overview of existing tree implementations:
“[…] Tree-structured models for regression, classification and survival analysis, following the ideas in the CART book, are implemented in rpart (shipped with base R) and tree. Package rpart is recommended for computing CART-like trees. A rich toolbox of partitioning algorithms is available in Weka, package RWeka provides an interface to this implementation, including the J4.8-variant of C4.5 and M5. The Cubist package fits rule-based models (similar to trees) with linear regression models in the terminal leaves, instance-based corrections and boosting. The C50 package can fit C5.0 classification trees, rule-based models, and boosted versions of these. pre can fit rule-based models for a wider range of response variable types. […]”