library("mlr3verse")
library("mlr3learners")
library("mlr3tuning")
library("data.table")
library("ggplot2")
::get_logger("mlr3")$set_threshold("warn") lgr
Intro
This is the first part in a serial of tutorials. The other parts of this series can be found here:
We will walk through this tutorial interactively. The text is kept short to be followed in real time.
Prerequisites
Ensure all packages used in this tutorial are installed. This includes the mlr3verse package, as well as other packages for data handling, cleaning and visualization which we are going to use (data.table, ggplot2, rchallenge, and skimr).
Then, load the main packages we are going to use:
Machine Learning Use Case: German Credit Data
The German credit data was originally donated in 1994 by Prof. Dr. Hans Hoffman of the University of Hamburg. A description can be found at the UCI repository. The goal is to classify people by their credit risk (good or bad) using 20 personal, demographic and financial features:
Feature Name | Description |
---|---|
age | age in years |
amount | amount asked by applicant |
credit_history | past credit history of applicant at this bank |
duration | duration of the credit in months |
employment_duration | present employment since |
foreign_worker | is applicant foreign worker? |
housing | type of apartment rented, owned, for free / no payment |
installment_rate | installment rate in percentage of disposable income |
job | current job information |
number_credits | number of existing credits at this bank |
other_debtors | other debtors/guarantors present? |
other_installment_plans | other installment plans the applicant is paying |
people_liable | number of people being liable to provide maintenance |
personal_status_sex | combination of sex and personal status of applicant |
present_residence | present residence since |
property | properties that applicant has |
purpose | reason customer is applying for a loan |
savings | savings accounts/bonds at this bank |
status | status/balance of checking account at this bank |
telephone | is there any telephone registered for this customer? |
Importing the Data
The dataset we are going to use is a transformed version of this German credit dataset, as provided by the rchallenge package (this transformed dataset was proposed by Ulrike Grömping, with factors instead of dummy variables and corrected features):
data("german", package = "rchallenge")
First, we’ll do a thorough investigation of the dataset.
Exploring the Data
We can get a quick overview of our dataset using R’s summary function:
dim(german)
[1] 1000 21
str(german)
'data.frame': 1000 obs. of 21 variables:
$ status : Factor w/ 4 levels "no checking account",..: 1 1 2 1 1 1 1 1 4 2 ...
$ duration : int 18 9 12 12 12 10 8 6 18 24 ...
$ credit_history : Factor w/ 5 levels "delay in paying off in the past",..: 5 5 3 5 5 5 5 5 5 3 ...
$ purpose : Factor w/ 11 levels "others","car (new)",..: 3 1 10 1 1 1 1 1 4 4 ...
$ amount : int 1049 2799 841 2122 2171 2241 3398 1361 1098 3758 ...
$ savings : Factor w/ 5 levels "unknown/no savings account",..: 1 1 2 1 1 1 1 1 1 3 ...
$ employment_duration : Factor w/ 5 levels "unemployed","< 1 yr",..: 2 3 4 3 3 2 4 2 1 1 ...
$ installment_rate : Ord.factor w/ 4 levels ">= 35"<"25 <= ... < 35"<..: 4 2 2 3 4 1 1 2 4 1 ...
$ personal_status_sex : Factor w/ 4 levels "male : divorced/separated",..: 2 3 2 3 3 3 3 3 2 2 ...
$ other_debtors : Factor w/ 3 levels "none","co-applicant",..: 1 1 1 1 1 1 1 1 1 1 ...
$ present_residence : Ord.factor w/ 4 levels "< 1 yr"<"1 <= ... < 4 yrs"<..: 4 2 4 2 4 3 4 4 4 4 ...
$ property : Factor w/ 4 levels "unknown / no property",..: 2 1 1 1 2 1 1 1 3 4 ...
$ age : int 21 36 23 39 38 48 39 40 65 23 ...
$ other_installment_plans: Factor w/ 3 levels "bank","stores",..: 3 3 3 3 1 3 3 3 3 3 ...
$ housing : Factor w/ 3 levels "for free","rent",..: 1 1 1 1 2 1 2 2 2 1 ...
$ number_credits : Ord.factor w/ 4 levels "1"<"2-3"<"4-5"<..: 1 2 1 2 2 2 2 1 2 1 ...
$ job : Factor w/ 4 levels "unemployed/unskilled - non-resident",..: 3 3 2 2 2 2 2 2 1 1 ...
$ people_liable : Factor w/ 2 levels "3 or more","0 to 2": 2 1 2 1 2 1 2 1 2 2 ...
$ telephone : Factor w/ 2 levels "no","yes (under customer name)": 1 1 1 1 1 1 1 1 1 1 ...
$ foreign_worker : Factor w/ 2 levels "yes","no": 2 2 2 1 1 1 1 1 2 2 ...
$ credit_risk : Factor w/ 2 levels "bad","good": 2 2 2 2 2 2 2 2 2 2 ...
Our dataset has 1000 observations and 21 columns. The variable we want to predict is credit_risk
(either good or bad), i.e., we aim to classify people by their credit risk.
We also recommend the skimr package as it creates very well readable and understandable overviews:
::skim(german) skimr
Name | german |
Number of rows | 1000 |
Number of columns | 21 |
_______________________ | |
Column type frequency: | |
factor | 18 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
status | 0 | 1 | FALSE | 4 | …: 394, no : 274, …: 269, 0<=: 63 |
credit_history | 0 | 1 | FALSE | 5 | no : 530, all: 293, exi: 88, cri: 49 |
purpose | 0 | 1 | FALSE | 10 | fur: 280, oth: 234, car: 181, car: 103 |
savings | 0 | 1 | FALSE | 5 | unk: 603, …: 183, …: 103, 100: 63 |
employment_duration | 0 | 1 | FALSE | 5 | 1 <: 339, >= : 253, 4 <: 174, < 1: 172 |
installment_rate | 0 | 1 | TRUE | 4 | < 2: 476, 25 : 231, 20 : 157, >= : 136 |
personal_status_sex | 0 | 1 | FALSE | 4 | mal: 548, fem: 310, fem: 92, mal: 50 |
other_debtors | 0 | 1 | FALSE | 3 | non: 907, gua: 52, co-: 41 |
present_residence | 0 | 1 | TRUE | 4 | >= : 413, 1 <: 308, 4 <: 149, < 1: 130 |
property | 0 | 1 | FALSE | 4 | bui: 332, unk: 282, car: 232, rea: 154 |
other_installment_plans | 0 | 1 | FALSE | 3 | non: 814, ban: 139, sto: 47 |
housing | 0 | 1 | FALSE | 3 | ren: 714, for: 179, own: 107 |
number_credits | 0 | 1 | TRUE | 4 | 1: 633, 2-3: 333, 4-5: 28, >= : 6 |
job | 0 | 1 | FALSE | 4 | ski: 630, uns: 200, man: 148, une: 22 |
people_liable | 0 | 1 | FALSE | 2 | 0 t: 845, 3 o: 155 |
telephone | 0 | 1 | FALSE | 2 | no: 596, yes: 404 |
foreign_worker | 0 | 1 | FALSE | 2 | no: 963, yes: 37 |
credit_risk | 0 | 1 | FALSE | 2 | goo: 700, bad: 300 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
duration | 0 | 1 | 20.90 | 12.06 | 4 | 12.0 | 18.0 | 24.00 | 72 | ▇▇▂▁▁ |
amount | 0 | 1 | 3271.25 | 2822.75 | 250 | 1365.5 | 2319.5 | 3972.25 | 18424 | ▇▂▁▁▁ |
age | 0 | 1 | 35.54 | 11.35 | 19 | 27.0 | 33.0 | 42.00 | 75 | ▇▆▃▁▁ |
During an exploratory analysis meaningful discoveries could be:
- Skewed distributions
- Missing values
- Empty / rare factor variables
An explanatory analysis is crucial to get a feeling for your data. On the other hand the data can be validated this way. Non-plausible data can be investigated or outliers can be removed.
After feeling confident with the data, we want to do modeling now.
Modeling
Considering how we are going to tackle the problem of classifying the credit risk relates closely to what mlr3 entities we will use.
The typical questions that arise when building a machine learning workflow are:
- What is the problem we are trying to solve?
- What are appropriate learning algorithms?
- How do we evaluate “good” performance?
More systematically in mlr3 they can be expressed via five components:
- The
Task
definition. - The
Learner
definition. - The training.
- The prediction.
- The evaluation via one or multiple
Measures
.
Task Definition
First, we are interested in the target which we want to model. Most supervised machine learning problems are regression or classification problems. However, note that other problems include unsupervised learning or time-to-event data (covered in mlr3proba).
Within mlr3, to distinguish between these problems, we define Tasks
. If we want to solve a classification problem, we define a classification task – TaskClassif
. For a regression problem, we define a regression task – TaskRegr
.
In our case it is clearly our objective to model or predict the binary factor
variable credit_risk
. Thus, we define a TaskClassif
:
= as_task_classif(german, id = "GermanCredit", target = "credit_risk") task
Note that the German credit data is also given as an example task which ships with the mlr3 package. Thus, you actually don’t need to construct it yourself, just call tsk("german_credit")
to retrieve the object from the dictionary mlr_tasks
.
Learner Definition
After having decided what should be modeled, we need to decide on how. This means we need to decide which learning algorithms, or Learners
are appropriate. Using prior knowledge (e.g. knowing that it is a classification task or assuming that the classes are linearly separable) one ends up with one or more suitable Learner
s.
Many learners can be obtained via the mlr3learners package. Additionally, many learners are provided via the mlr3extralearners package, from GitHub. These two resources combined account for a large fraction of standard learning algorithms. As mlr3 usually only wraps learners from packages, it is even easy to create a formal Learner
by yourself. You may find the section about extending mlr3 in the mlr3book very helpful. If you happen to write your own Learner
in mlr3, we would be happy if you share it with the mlr3 community.
All available Learner
s (i.e. all which you have installed from mlr3, mlr3learners
, mlr3extralearners
, or self-written ones) are registered in the dictionary mlr_learners
:
mlr_learners
<DictionaryLearner> with 134 stored values
Keys: classif.AdaBoostM1, classif.bart, classif.C50, classif.catboost, classif.cforest, classif.ctree,
classif.cv_glmnet, classif.debug, classif.earth, classif.featureless, classif.fnn, classif.gam,
classif.gamboost, classif.gausspr, classif.gbm, classif.glmboost, classif.glmer, classif.glmnet,
classif.IBk, classif.J48, classif.JRip, classif.kknn, classif.ksvm, classif.lda, classif.liblinear,
classif.lightgbm, classif.LMT, classif.log_reg, classif.lssvm, classif.mob, classif.multinom,
classif.naive_bayes, classif.nnet, classif.OneR, classif.PART, classif.qda, classif.randomForest,
classif.ranger, classif.rfsrc, classif.rpart, classif.svm, classif.xgboost, clust.agnes, clust.ap,
clust.cmeans, clust.cobweb, clust.dbscan, clust.diana, clust.em, clust.fanny, clust.featureless,
clust.ff, clust.hclust, clust.kkmeans, clust.kmeans, clust.MBatchKMeans, clust.mclust, clust.meanshift,
clust.pam, clust.SimpleKMeans, clust.xmeans, dens.kde_ks, dens.locfit, dens.logspline, dens.mixed,
dens.nonpar, dens.pen, dens.plug, dens.spline, regr.bart, regr.catboost, regr.cforest, regr.ctree,
regr.cubist, regr.cv_glmnet, regr.debug, regr.earth, regr.featureless, regr.fnn, regr.gam, regr.gamboost,
regr.gausspr, regr.gbm, regr.glm, regr.glmboost, regr.glmnet, regr.IBk, regr.kknn, regr.km, regr.ksvm,
regr.liblinear, regr.lightgbm, regr.lm, regr.lmer, regr.M5Rules, regr.mars, regr.mob, regr.nnet,
regr.randomForest, regr.ranger, regr.rfsrc, regr.rpart, regr.rsm, regr.rvm, regr.svm, regr.xgboost,
surv.akritas, surv.aorsf, surv.blackboost, surv.cforest, surv.coxboost, surv.coxtime, surv.ctree,
surv.cv_coxboost, surv.cv_glmnet, surv.deephit, surv.deepsurv, surv.dnnsurv, surv.flexible,
surv.gamboost, surv.gbm, surv.glmboost, surv.glmnet, surv.loghaz, surv.mboost, surv.nelson,
surv.obliqueRSF, surv.parametric, surv.pchazard, surv.penalized, surv.ranger, surv.rfsrc, surv.svm,
surv.xgboost
For our problem, a suitable learner could be one of the following: Logistic regression, CART, random forest (or many more).
A learner can be initialized with the lrn()
function and the name of the learner, e.g., lrn("classif.xxx")
. Use ?mlr_learners_xxx
to open the help page of a learner named xxx
.
For example, a logistic regression can be initialized in the following manner (logistic regression uses R’s glm()
function and is provided by the mlr3learners
package):
library("mlr3learners")
= lrn("classif.log_reg")
learner_logreg print(learner_logreg)
<LearnerClassifLogReg:classif.log_reg>
* Model: -
* Parameters: list()
* Packages: mlr3, mlr3learners, stats
* Predict Types: [response], prob
* Feature Types: logical, integer, numeric, character, factor, ordered
* Properties: loglik, twoclass
Training
Training is the procedure, where a model is fitted on the (training) data.
Logistic Regression
We start with the example of the logistic regression. However, you will immediately see that the procedure generalizes to any learner very easily.
An initialized learner can be trained on data using $train()
:
$train(task) learner_logreg
Typically, in machine learning, one does not use the full data which is available but a subset, the so-called training data.
To efficiently perform a split of the data one could do the following:
= sample(task$row_ids, 0.8 * task$nrow)
train_set = setdiff(task$row_ids, train_set) test_set
80 percent of the data is used for training. The remaining 20 percent are used for evaluation at a subsequent later point in time. train_set
is an integer vector referring to the selected rows of the original dataset:
head(train_set)
[1] 26 691 588 967 272 694
In mlr3 the training with a subset of the data can be declared by the additional argument row_ids = train_set
:
$train(task, row_ids = train_set) learner_logreg
The fitted model can be accessed via:
$model learner_logreg
Call: stats::glm(formula = task$formula(), family = "binomial", data = data,
model = FALSE)
Coefficients:
(Intercept) age
3.998e-01 -9.061e-03
amount credit_historycritical account/other credits elsewhere
1.189e-04 1.980e-01
credit_historyno credits taken/all credits paid back duly credit_historyexisting credits paid back duly till now
-4.476e-01 -9.791e-01
credit_historyall credits at this bank paid back duly duration
-1.207e+00 2.830e-02
employment_duration< 1 yr employment_duration1 <= ... < 4 yrs
-1.564e-01 -3.380e-01
employment_duration4 <= ... < 7 yrs employment_duration>= 7 yrs
-1.088e+00 -2.863e-01
foreign_workerno housingrent
1.518e+00 -7.129e-01
housingown installment_rate.L
-6.193e-01 5.243e-01
installment_rate.Q installment_rate.C
2.194e-01 6.818e-02
jobunskilled - resident jobskilled employee/official
5.896e-01 7.024e-01
jobmanager/self-empl./highly qualif. employee number_credits.L
5.014e-01 -4.982e-02
number_credits.Q number_credits.C
2.883e-01 7.662e-01
other_debtorsco-applicant other_debtorsguarantor
5.126e-01 -8.283e-01
other_installment_plansstores other_installment_plansnone
-1.015e-01 -5.465e-01
people_liable0 to 2 personal_status_sexfemale : non-single or male : single
-2.614e-01 -2.256e-01
personal_status_sexmale : married/widowed personal_status_sexfemale : single
-7.768e-01 -2.646e-01
present_residence.L present_residence.Q
1.534e-01 -4.138e-01
present_residence.C propertycar or other
1.635e-01 2.310e-01
propertybuilding soc. savings agr./life insurance propertyreal estate
1.537e-01 5.691e-01
purposecar (new) purposecar (used)
-1.858e+00 -1.010e+00
purposefurniture/equipment purposeradio/television
-8.447e-01 -5.071e-01
purposedomestic appliances purposerepairs
4.512e-03 -1.160e-01
purposevacation purposeretraining
-1.526e+01 -6.349e-01
purposebusiness savings... < 100 DM
-1.544e+00 -3.511e-01
savings100 <= ... < 500 DM savings500 <= ... < 1000 DM
-5.633e-01 -1.803e+00
savings... >= 1000 DM status... < 0 DM
-7.896e-01 -6.488e-01
status0<= ... < 200 DM status... >= 200 DM / salary for at least 1 year
-1.332e+00 -1.851e+00
telephoneyes (under customer name)
-2.815e-01
Degrees of Freedom: 799 Total (i.e. Null); 745 Residual
Null Deviance: 954.3
Residual Deviance: 695.1 AIC: 805.1
The stored object is a normal glm
object and all its S3
methods work as expected:
class(learner_logreg$model)
[1] "glm" "lm"
summary(learner_logreg$model)
Call:
stats::glm(formula = task$formula(), family = "binomial", data = data,
model = FALSE)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.3394 -0.7001 -0.3625 0.6415 2.6176
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.998e-01 1.380e+00 0.290 0.771982
age -9.061e-03 1.049e-02 -0.863 0.387886
amount 1.189e-04 5.322e-05 2.233 0.025525 *
credit_historycritical account/other credits elsewhere 1.980e-01 6.421e-01 0.308 0.757827
credit_historyno credits taken/all credits paid back duly -4.476e-01 4.900e-01 -0.913 0.361028
credit_historyexisting credits paid back duly till now -9.791e-01 5.333e-01 -1.836 0.066374 .
credit_historyall credits at this bank paid back duly -1.207e+00 4.935e-01 -2.446 0.014438 *
duration 2.830e-02 1.106e-02 2.559 0.010482 *
employment_duration< 1 yr -1.564e-01 5.218e-01 -0.300 0.764385
employment_duration1 <= ... < 4 yrs -3.380e-01 4.999e-01 -0.676 0.498961
employment_duration4 <= ... < 7 yrs -1.088e+00 5.514e-01 -1.974 0.048385 *
employment_duration>= 7 yrs -2.863e-01 4.971e-01 -0.576 0.564724
foreign_workerno 1.518e+00 7.142e-01 2.126 0.033533 *
housingrent -7.129e-01 2.711e-01 -2.629 0.008561 **
housingown -6.193e-01 5.337e-01 -1.160 0.245964
installment_rate.L 5.243e-01 2.451e-01 2.139 0.032455 *
installment_rate.Q 2.194e-01 2.236e-01 0.981 0.326392
installment_rate.C 6.818e-02 2.342e-01 0.291 0.771009
jobunskilled - resident 5.896e-01 8.630e-01 0.683 0.494489
jobskilled employee/official 7.024e-01 8.307e-01 0.846 0.397781
jobmanager/self-empl./highly qualif. employee 5.014e-01 8.351e-01 0.600 0.548256
number_credits.L -4.982e-02 9.791e-01 -0.051 0.959414
number_credits.Q 2.883e-01 8.303e-01 0.347 0.728370
number_credits.C 7.662e-01 6.017e-01 1.273 0.202913
other_debtorsco-applicant 5.126e-01 5.006e-01 1.024 0.305843
other_debtorsguarantor -8.283e-01 4.640e-01 -1.785 0.074231 .
other_installment_plansstores -1.015e-01 4.775e-01 -0.213 0.831591
other_installment_plansnone -5.465e-01 2.818e-01 -1.939 0.052450 .
people_liable0 to 2 -2.614e-01 2.888e-01 -0.905 0.365368
personal_status_sexfemale : non-single or male : single -2.256e-01 4.454e-01 -0.506 0.612529
personal_status_sexmale : married/widowed -7.768e-01 4.383e-01 -1.772 0.076351 .
personal_status_sexfemale : single -2.646e-01 5.314e-01 -0.498 0.618529
present_residence.L 1.534e-01 2.448e-01 0.627 0.530811
present_residence.Q -4.138e-01 2.277e-01 -1.817 0.069212 .
present_residence.C 1.635e-01 2.330e-01 0.702 0.482846
propertycar or other 2.310e-01 2.937e-01 0.786 0.431605
propertybuilding soc. savings agr./life insurance 1.537e-01 2.739e-01 0.561 0.574698
propertyreal estate 5.691e-01 4.692e-01 1.213 0.225134
purposecar (new) -1.858e+00 4.360e-01 -4.263 2.02e-05 ***
purposecar (used) -1.010e+00 3.054e-01 -3.306 0.000945 ***
purposefurniture/equipment -8.447e-01 2.849e-01 -2.964 0.003034 **
purposeradio/television -5.071e-01 8.005e-01 -0.634 0.526379
purposedomestic appliances 4.512e-03 6.036e-01 0.007 0.994035
purposerepairs -1.160e-01 4.544e-01 -0.255 0.798468
purposevacation -1.526e+01 5.110e+02 -0.030 0.976178
purposeretraining -6.349e-01 3.867e-01 -1.642 0.100601
purposebusiness -1.544e+00 9.434e-01 -1.636 0.101737
savings... < 100 DM -3.511e-01 3.281e-01 -1.070 0.284627
savings100 <= ... < 500 DM -5.633e-01 4.704e-01 -1.197 0.231141
savings500 <= ... < 1000 DM -1.803e+00 6.409e-01 -2.813 0.004903 **
savings... >= 1000 DM -7.896e-01 2.919e-01 -2.705 0.006822 **
status... < 0 DM -6.488e-01 2.535e-01 -2.559 0.010500 *
status0<= ... < 200 DM -1.332e+00 4.432e-01 -3.004 0.002663 **
status... >= 200 DM / salary for at least 1 year -1.851e+00 2.659e-01 -6.961 3.39e-12 ***
telephoneyes (under customer name) -2.815e-01 2.326e-01 -1.210 0.226125
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 954.34 on 799 degrees of freedom
Residual deviance: 695.12 on 745 degrees of freedom
AIC: 805.12
Number of Fisher Scoring iterations: 14
Random Forest
Just like the logistic regression, we could train a random forest instead. We use the fast implementation from the ranger package. For this, we first need to define the learner and then actually train it.
We now additionally supply the importance argument (importance = "permutation"
). Doing so, we override the default and let the learner do feature importance determination based on permutation feature importance:
= lrn("classif.ranger", importance = "permutation")
learner_rf $train(task, row_ids = train_set) learner_rf
We can access the importance values using $importance()
:
$importance() learner_rf
status duration amount credit_history age
3.346907e-02 1.908317e-02 1.230587e-02 7.596011e-03 7.282276e-03
property savings employment_duration other_debtors installment_rate
6.292598e-03 5.483269e-03 4.902358e-03 4.239555e-03 3.603431e-03
purpose present_residence housing number_credits job
3.584035e-03 3.303205e-03 1.744761e-03 1.335679e-03 8.116002e-04
telephone personal_status_sex other_installment_plans people_liable foreign_worker
7.429906e-04 6.619249e-04 6.346839e-04 5.833024e-04 8.777591e-05
In order to obtain a plot for the importance values, we convert the importance to a data.table and then process it with ggplot2:
= as.data.table(learner_rf$importance(), keep.rownames = TRUE)
importance colnames(importance) = c("Feature", "Importance")
ggplot(importance, aes(x = reorder(Feature, Importance), y = Importance)) +
geom_col() + coord_flip() + xlab("")
Prediction
Let’s see what the models predict.
After training a model, the model can be used for prediction. Usually, prediction is the main purpose of machine learning models.
In our case, the model can be used to classify new credit applicants w.r.t. their associated credit risk (good vs. bad) on the basis of the features. Typically, machine learning models predict numeric values. In the regression case this is very natural. For classification, most models predict scores or probabilities. Based on these values, one can derive class predictions.
Predict Classes
First, we directly predict classes:
= learner_logreg$predict(task, row_ids = test_set)
prediction_logreg = learner_rf$predict(task, row_ids = test_set) prediction_rf
prediction_logreg
<PredictionClassif> for 200 observations:
row_ids truth response
2 good bad
7 good good
11 good bad
---
988 bad good
990 bad bad
992 bad bad
prediction_rf
<PredictionClassif> for 200 observations:
row_ids truth response
2 good good
7 good good
11 good good
---
988 bad good
990 bad bad
992 bad bad
The $predict()
method returns a Prediction
object. It can be converted to a data.table
if one wants to use it downstream.
We can also display the prediction results aggregated in a confusion matrix:
$confusion prediction_logreg
truth
response bad good
bad 37 15
good 36 112
$confusion prediction_rf
truth
response bad good
bad 30 7
good 43 120
Predict Probabilities
Most learners may not only predict a class variable (“response”), but also their degree of “belief” / “uncertainty” in a given response. Typically, we achieve this by setting the $predict_type
slot of a Learner
to "prob"
. Sometimes this needs to be done before the learner is trained. Alternatively, we can directly create the learner with this option: lrn("classif.log_reg", predict_type = "prob")
.
$predict_type = "prob" learner_logreg
$predict(task, row_ids = test_set) learner_logreg
<PredictionClassif> for 200 observations:
row_ids truth response prob.bad prob.good
2 good bad 0.62128424 0.3787158
7 good good 0.03659718 0.9634028
11 good bad 0.66442309 0.3355769
---
988 bad good 0.21573591 0.7842641
990 bad bad 0.67057802 0.3294220
992 bad bad 0.58760749 0.4123925
Note that sometimes one needs to be cautious when dealing with the probability interpretation of the predictions.
Performance Evaluation
To measure the performance of a learner on new unseen data, we usually mimic the scenario of unseen data by splitting up the data into training and test set. The training set is used for training the learner, and the test set is only used for predicting and evaluating the performance of the trained learner. Numerous resampling methods (cross-validation, bootstrap) repeat the splitting process in different ways.
Within mlr3, we need to specify the resampling strategy using the rsmp()
function:
= rsmp("holdout", ratio = 2/3)
resampling print(resampling)
<ResamplingHoldout>: Holdout
* Iterations: 1
* Instantiated: FALSE
* Parameters: ratio=0.6667
Here, we use “holdout”, a simple train-test split (with just one iteration). We use the resample()
function to undertake the resampling calculation:
= resample(task, learner = learner_logreg, resampling = resampling)
res res
<ResampleResult> of 1 iterations
* Task: GermanCredit
* Learner: classif.log_reg
* Warnings: 0 in 0 iterations
* Errors: 0 in 0 iterations
The default score of the measure is included in the $aggregate()
slot:
$aggregate() res
classif.ce
0.2402402
The default measure in this scenario is the classification error
. Lower is better.
We can easily run different resampling strategies, e.g. repeated holdout ("subsampling"
), or cross validation. Most methods perform repeated train/predict cycles on different data subsets and aggregate the result (usually as the mean). Doing this manually would require us to write loops. mlr3 does the job for us:
= rsmp("subsampling", repeats = 10)
resampling = resample(task, learner = learner_logreg, resampling = resampling)
rr $aggregate() rr
classif.ce
0.2603604
Instead, we could also run cross-validation:
= resampling = rsmp("cv", folds = 10)
resampling = resample(task, learner = learner_logreg, resampling = resampling)
rr $aggregate() rr
classif.ce
0.25
mlr3 features scores for many more measures. Here, we apply mlr_measures_classif.fpr
for the false positive rate, and mlr_measures_classif.fnr
for the false negative rate. Multiple measures can be provided as a list of measures (which can directly be constructed via msrs()
:
# false positive rate
$aggregate(msr("classif.fpr")) rr
classif.fpr
0.1395529
# false positive rate and false negative
= msrs(c("classif.fpr", "classif.fnr"))
measures $aggregate(measures) rr
classif.fpr classif.fnr
0.1395529 0.5070723
There are a few more resampling methods, and quite a few more measures (implemented in mlr3measures). They are automatically registered in the respective dictionaries:
mlr_resamplings
<DictionaryResampling> with 9 stored values
Keys: bootstrap, custom, custom_cv, cv, holdout, insample, loo, repeated_cv, subsampling
mlr_measures
<DictionaryMeasure> with 67 stored values
Keys: aic, bic, classif.acc, classif.auc, classif.bacc, classif.bbrier, classif.ce, classif.costs,
classif.dor, classif.fbeta, classif.fdr, classif.fn, classif.fnr, classif.fomr, classif.fp, classif.fpr,
classif.logloss, classif.mauc_au1p, classif.mauc_au1u, classif.mauc_aunp, classif.mauc_aunu,
classif.mbrier, classif.mcc, classif.npv, classif.ppv, classif.prauc, classif.precision, classif.recall,
classif.sensitivity, classif.specificity, classif.tn, classif.tnr, classif.tp, classif.tpr, clust.ch,
clust.db, clust.dunn, clust.silhouette, clust.wss, debug, oob_error, regr.bias, regr.ktau, regr.mae,
regr.mape, regr.maxae, regr.medae, regr.medse, regr.mse, regr.msle, regr.pbias, regr.rae, regr.rmse,
regr.rmsle, regr.rrse, regr.rse, regr.rsq, regr.sae, regr.smape, regr.srho, regr.sse, selected_features,
sim.jaccard, sim.phi, time_both, time_predict, time_train
To get help on a resampling method, use ?mlr_resamplings_xxx
, for a measure do ?mlr_measures_xxx
. You can also browse the mlr3 reference online.
Note that some measures, for example AUC
, require the prediction of probabilities.
Performance Comparison and Benchmarks
We could compare Learners
by evaluating resample()
for each of them manually. However, benchmark()
automatically performs resampling evaluations for multiple learners and tasks. benchmark_grid()
creates fully crossed designs: Multiple Learners
for multiple Tasks
are compared w.r.t. multiple Resamplings
.
= lrns(c("classif.log_reg", "classif.ranger"), predict_type = "prob")
learners = benchmark_grid(
grid tasks = task,
learners = learners,
resamplings = rsmp("cv", folds = 10)
)= benchmark(grid) bmr
Careful, large benchmarks may take a long time! This one should take less than a minute, however. In general, we want to use parallelization to speed things up on multi-core machines. For parallelization, mlr3 relies on the future package:
# future::plan("multicore") # uncomment for parallelization
In the benchmark we can compare different measures. Here, we look at the misclassification rate
and the AUC
:
= msrs(c("classif.ce", "classif.auc"))
measures = bmr$aggregate(measures)
performances c("learner_id", "classif.ce", "classif.auc")] performances[,
learner_id classif.ce classif.auc
1: classif.log_reg 0.252 0.7794092
2: classif.ranger 0.230 0.8037874
We see that the two models perform very similarly.
Deviating from hyperparameters defaults
The previously shown techniques build the backbone of a mlr3-featured machine learning workflow. However, in most cases one would never proceed in the way we did. While many R packages have carefully selected default settings, they will not perform optimally in any scenario. Typically, we can select the values of such hyperparameters. The (hyper)parameters of a Learner
can be accessed and set via its ParamSet
$param_set
:
$param_set learner_rf
<ParamSet>
id class lower upper nlevels default parents value
1: alpha ParamDbl -Inf Inf Inf 0.5
2: always.split.variables ParamUty NA NA Inf <NoDefault[3]>
3: class.weights ParamUty NA NA Inf
4: holdout ParamLgl NA NA 2 FALSE
5: importance ParamFct NA NA 4 <NoDefault[3]> permutation
6: keep.inbag ParamLgl NA NA 2 FALSE
7: max.depth ParamInt 0 Inf Inf
8: min.node.size ParamInt 1 Inf Inf
9: min.prop ParamDbl -Inf Inf Inf 0.1
10: minprop ParamDbl -Inf Inf Inf 0.1
11: mtry ParamInt 1 Inf Inf <NoDefault[3]>
12: mtry.ratio ParamDbl 0 1 Inf <NoDefault[3]>
13: num.random.splits ParamInt 1 Inf Inf 1 splitrule
14: num.threads ParamInt 1 Inf Inf 1 1
15: num.trees ParamInt 1 Inf Inf 500
16: oob.error ParamLgl NA NA 2 TRUE
17: regularization.factor ParamUty NA NA Inf 1
18: regularization.usedepth ParamLgl NA NA 2 FALSE
19: replace ParamLgl NA NA 2 TRUE
20: respect.unordered.factors ParamFct NA NA 3 ignore
21: sample.fraction ParamDbl 0 1 Inf <NoDefault[3]>
22: save.memory ParamLgl NA NA 2 FALSE
23: scale.permutation.importance ParamLgl NA NA 2 FALSE importance
24: se.method ParamFct NA NA 2 infjack
25: seed ParamInt -Inf Inf Inf
26: split.select.weights ParamUty NA NA Inf
27: splitrule ParamFct NA NA 3 gini
28: verbose ParamLgl NA NA 2 TRUE
29: write.forest ParamLgl NA NA 2 TRUE
id class lower upper nlevels default parents value
$param_set$values = list(verbose = FALSE) learner_rf
We can choose parameters for our learners in two distinct manners. If we have prior knowledge on how the learner should be (hyper-)parameterized, the way to go would be manually entering the parameters in the parameter set. In most cases, however, we would want to tune the learner so that it can search “good” model configurations itself. For now, we only want to compare a few models.
To get an idea on which parameters can be manipulated, we can investigate the parameters of the original package version or look into the parameter set of the learner:
## ?ranger::ranger
as.data.table(learner_rf$param_set)[, .(id, class, lower, upper)]
id class lower upper
1: alpha ParamDbl -Inf Inf
2: always.split.variables ParamUty NA NA
3: class.weights ParamUty NA NA
4: holdout ParamLgl NA NA
5: importance ParamFct NA NA
6: keep.inbag ParamLgl NA NA
7: max.depth ParamInt 0 Inf
8: min.node.size ParamInt 1 Inf
9: min.prop ParamDbl -Inf Inf
10: minprop ParamDbl -Inf Inf
11: mtry ParamInt 1 Inf
12: mtry.ratio ParamDbl 0 1
13: num.random.splits ParamInt 1 Inf
14: num.threads ParamInt 1 Inf
15: num.trees ParamInt 1 Inf
16: oob.error ParamLgl NA NA
17: regularization.factor ParamUty NA NA
18: regularization.usedepth ParamLgl NA NA
19: replace ParamLgl NA NA
20: respect.unordered.factors ParamFct NA NA
21: sample.fraction ParamDbl 0 1
22: save.memory ParamLgl NA NA
23: scale.permutation.importance ParamLgl NA NA
24: se.method ParamFct NA NA
25: seed ParamInt -Inf Inf
26: split.select.weights ParamUty NA NA
27: splitrule ParamFct NA NA
28: verbose ParamLgl NA NA
29: write.forest ParamLgl NA NA
id class lower upper
For the random forest two meaningful parameters which steer model complexity are num.trees
and mtry
. num.trees
defaults to 500
and mtry
to floor(sqrt(ncol(data) - 1))
, in our case 4.
In the following we aim to train three different learners:
- The default random forest.
- A random forest with low
num.trees
and lowmtry
. - A random forest with high
num.trees
and highmtry
.
We will benchmark their performance on the German credit dataset. For this we construct the three learners and set the parameters accordingly:
= lrn("classif.ranger", id = "med", predict_type = "prob")
rf_med
= lrn("classif.ranger", id = "low", predict_type = "prob",
rf_low num.trees = 5, mtry = 2)
= lrn("classif.ranger", id = "high", predict_type = "prob",
rf_high num.trees = 1000, mtry = 11)
Once the learners are defined, we can benchmark them:
= list(rf_low, rf_med, rf_high)
learners = benchmark_grid(
grid tasks = task,
learners = learners,
resamplings = rsmp("cv", folds = 10)
)
= benchmark(grid)
bmr print(bmr)
<BenchmarkResult> of 30 rows with 3 resampling runs
nr task_id learner_id resampling_id iters warnings errors
1 GermanCredit low cv 10 0 0
2 GermanCredit med cv 10 0 0
3 GermanCredit high cv 10 0 0
We compare misclassification rate and AUC again:
= msrs(c("classif.ce", "classif.auc"))
measures = bmr$aggregate(measures)
performances performances[, .(learner_id, classif.ce, classif.auc)]
learner_id classif.ce classif.auc
1: low 0.269 0.7341043
2: med 0.239 0.7920121
3: high 0.236 0.7889281
autoplot(bmr)
The “low” settings seem to underfit a bit, the “high” setting is comparable to the default setting “med”.
Outlook
This tutorial was a detailed introduction to machine learning workflows within mlr3. Having followed this tutorial you should be able to run your first models yourself. Next to that we spiked into performance evaluation and benchmarking. Furthermore, we showed how to customize learners.
The next parts of the tutorial will go more into depth into additional mlr3 topics:
Part II - Tuning introduces you to the mlr3tuning package
Part III - Pipelines introduces you to the mlr3pipelines package