library(mlr3)
library(mlr3learners)
library(xgboost)
= tsk("german_credit") task
Goal
Learn how to do preprocessing steps directly on a mlr3 Task
object and how to combine a preprocessing with a learner to create a simple linear ML pipeline that first applies the preprocessing and then trains a learner.
German Credit Data
Description
- Data from 1973 to 1975 from a large regional bank in southern Germany classifying credits described by a set of attributes to good or bad credit risks.
- Stratified sample of 1000 credits (300 bad ones and 700 good ones).
- Customers with good credit risks perfectly complied with the conditions of the contract while customers with bad credit risks did not comply with the contract as required.
- Available in
tsk("german_credit")
.
Data Dictionary
n = 1,000 observations of credits
credit_risk
: Has the credit contract been complied with (good) or not (bad)?age
: Age of debtor in yearsamount
: Credit amount in DMcredit_history
: History of compliance with previous or concurrent credit contractsduration
: Credit duration in monthsemployment_duration
: Duration of debtor’s employment with current employerforeign_worker
: Whether the debtor is a foreign workerhousing
: Type of housing the debtor lives ininstallment_rate
: Credit installments as a percentage of debtor’s disposable incomejob
: Quality of debtor’s jobnumber_credits
: Number of credits including the current one the debtor has (or had) at this bankother_debtors
: Whether there is another debtor or a guarantor for the creditother_installment_plans
: Installment plans from providers other than the credit-giving bankpeople_liable
: Number of persons who financially depend on the debtorpersonal_status_sex
: Combined information on sex and marital statuspresent_residence
: Length of time (in years) the debtor lives in the present residenceproperty
: The debtor’s most valuable propertypurpose
: Purpose for which the credit is neededsavings
: Debtor’s savingstatus
: Status of the debtor’s checking account with the banktelephone
: Whether there is a telephone landline registered on the debtor’s name
Recap: mlr3 Tasks
An mlr3
Task
encapsulates data with meta-information, such as the name of the target variable and the type of the learning problem (in our example this would be a classification task, where the target is a factor label with relatively few distinct values).
task
<TaskClassif:german_credit> (1000 x 21): German Credit
* Target: credit_risk
* Properties: twoclass
* Features (20):
- fct (14): credit_history, employment_duration, foreign_worker, housing, job, other_debtors,
other_installment_plans, people_liable, personal_status_sex, property, purpose, savings, status,
telephone
- int (3): age, amount, duration
- ord (3): installment_rate, number_credits, present_residence
We get a short summary of the task: It has 1000 observations and 21 columns of which 20 are features. 17 features are categorical (i.e., factors) and 3 features are integer.
By using the $data()
method, we get access to the data (in the form of a data.table
):
str(task$data())
Classes 'data.table' and 'data.frame': 1000 obs. of 21 variables:
$ credit_risk : Factor w/ 2 levels "good","bad": 1 2 1 1 2 1 1 1 1 2 ...
$ age : int 67 22 49 45 53 35 53 35 61 28 ...
$ amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
$ credit_history : Factor w/ 5 levels "delay in paying off in the past",..: 5 3 5 3 4 3 3 3 3 5 ...
$ duration : int 6 48 12 42 24 36 24 36 12 30 ...
$ employment_duration : Factor w/ 5 levels "unemployed","< 1 yr",..: 5 3 4 4 3 3 5 3 4 1 ...
$ foreign_worker : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
$ housing : Factor w/ 3 levels "for free","rent",..: 2 2 2 3 3 3 2 1 2 2 ...
$ installment_rate : Ord.factor w/ 4 levels ">= 35"<"25 <= ... < 35"<..: 4 2 2 2 3 2 3 2 2 4 ...
$ job : Factor w/ 4 levels "unemployed/unskilled - non-resident",..: 3 3 2 3 3 2 3 4 2 4 ...
$ number_credits : Ord.factor w/ 4 levels "1"<"2-3"<"4-5"<..: 2 1 1 1 2 1 1 1 1 2 ...
$ other_debtors : Factor w/ 3 levels "none","co-applicant",..: 1 1 1 3 1 1 1 1 1 1 ...
$ other_installment_plans: Factor w/ 3 levels "bank","stores",..: 3 3 3 3 3 3 3 3 3 3 ...
$ people_liable : Factor w/ 2 levels "0 to 2","3 or more": 1 1 2 2 2 2 1 1 1 1 ...
$ personal_status_sex : Factor w/ 4 levels "male : divorced/separated",..: 3 2 3 3 3 3 3 3 1 4 ...
$ present_residence : Ord.factor w/ 4 levels "< 1 yr"<"1 <= ... < 4 yrs"<..: 4 2 3 4 4 4 4 2 4 2 ...
$ property : Factor w/ 4 levels "unknown / no property",..: 1 1 1 2 4 4 2 3 1 3 ...
$ purpose : Factor w/ 11 levels "others","car (new)",..: 4 4 7 3 1 7 3 2 4 1 ...
$ savings : Factor w/ 5 levels "unknown/no savings account",..: 5 1 1 1 1 5 3 1 4 1 ...
$ status : Factor w/ 4 levels "no checking account",..: 1 2 4 1 1 4 4 2 4 2 ...
$ telephone : Factor w/ 2 levels "no","yes (under customer name)": 2 1 1 1 1 2 1 2 1 1 ...
- attr(*, ".internal.selfref")=<externalptr>
Note that a mlr3
Task
object comes with plenty of functionality in the form of fields, methods and active bindings, see ?Task
, e.g., to get a summary of all feature names, you can use:
$feature_names task
[1] "age" "amount" "credit_history" "duration"
[5] "employment_duration" "foreign_worker" "housing" "installment_rate"
[9] "job" "number_credits" "other_debtors" "other_installment_plans"
[13] "people_liable" "personal_status_sex" "present_residence" "property"
[17] "purpose" "savings" "status" "telephone"
To obtain information about the types of features of the task (similarly like in the data dictionary above), we can inspect the active binding fields of the task object (see, ?Task
):
$feature_types task
Key: <id>
id type
<char> <char>
1: age integer
2: amount integer
3: credit_history factor
4: duration integer
5: employment_duration factor
6: foreign_worker factor
7: housing factor
8: installment_rate ordered
9: job factor
10: number_credits ordered
11: other_debtors factor
12: other_installment_plans factor
13: people_liable factor
14: personal_status_sex factor
15: present_residence ordered
16: property factor
17: purpose factor
18: savings factor
19: status factor
20: telephone factor
id type
1 Preprocess a Task (with One-Hot Encoding)
Use the one-hot encoding PipeOp
to convert all categorical features from the german_credit
task into a preprocessed task containing 0-1 indicator variables for each category level instead of categorical features.
Hint 1:
Load the mlr3pipelines
package and get an overview of possible PipeOp
that can be used for different preprocessing steps by printing mlr_pipeops
or the first two columns of the corresponding table as.data.table(mlr_pipeops)[,1:2]
. Look for a factor encoding and pass the corresponding key
for factor encoding to the po()
function (see also the help page ?PipeOpEncode
). Then, use the $train()
method of the PipeOp
object which expects a list containing the task to be converted as input and produces a list containing the converted task.
Hint 2:
library(mlr3pipelines)
# Create a PipeOp object that applies one-hot encoding
= po(...)
poe # Apply a created PipeOp to e.g. preprocess an input
= poe$train(input = ...)$output
encoded_task str(...$data())
2 Create a Simple ML Pipeline (with One-Hot Encoding)
Some learners cannot handle categorical features such as the the xgboost
learner (which gives an error message when applied to a task containing categorical features):
library(mlr3verse)
= lrn("classif.xgboost")
lrnxg $train(task) lrnxg
Error: <TaskClassif:german_credit> has the following unsupported feature types: factor, ordered
$predict(task) lrnxg
Error: <TaskClassif:german_credit> has the following unsupported feature types: factor, ordered
Combine the xgboost
learner with a preprocessing step that applies one-hot encoding to create a ML pipeline that first converts all categorical features to 0-1 indicator variables and then applies the xgboost
learner. Train the ML pipeline on the german_credit
task and make predictions on the training data.
Hint 1:
You can create a Graph
that combines a PipeOp
object with a learner object (or further PipeOp
objects) by concatenating them using the %>>%
operator. The Graph
contains all information of a sequential ML pipeline. Convert the Graph
into a GraphLearner
to be able to run the whole ML pipeline like a usual learner object with which we can train, predict, resample, and benchmark the GraphLearner
as we have learned. See also the help page ?GraphLearner
.
Hint 2:
library(mlr3verse)
= lrn("classif.xgboost")
lrnxg = po(...)
poe = ...
graph
= as_learner(...)
glrn $train(...)
...$predict(...) ...
3 Feature Transformation for Decision Trees
The structure of a decision tree is insensitive to monotonic transformations of the features (and scaling is a monotonic transformation). This means that although the scaled features are different to non-scaled features, the decision tree will have the same structure (the values of the split points for numeric feature might be different as the numeric features will have a different scale, but the structure of the decision tree will stay the same).
3.1 Preprocessing
Use the PipeOp
to scale all numeric features from the german_credit
task and create a preprocessed task the scaled numeric features. Do this for standard scaling (i.e., normalization by centering and scaling) and for Yeo-Johnson transformation (i.e., a power transformation to make data more Gaussian-like). You can look up the corresponding keys by inspecting the table as.data.table(mlr_pipeops)[,1:2]
. Create the preprocessed tasks task_scaled
and task_yeojohnson
and check the values of the numeric features. You may have to first install the bestNormalize
package for the Yeo-Johnson transformation.
Hint:
Proceed as in Exercise 1, but usescale
and yeojohnson
instead of encode
as keys in the po()
function. If installing the bestNormalize
package does not work, you can also select a different scaling approach such as scalemaxabs
or scalerange
. Yeo-Johnson transformation is a generalization of the Box-Cox transformation that can be applied to both positive and negative values, while Box-Cox transformation is only applicable to non-negative values.
3.2 Visual Comparison
Create two ML pipelines, one that combines the classif.rpart
learner with the standard scaling and another one that combines classif.rpart
learner with the Yeo-Johnson scaling. Then use the classif.rpart
learner and the two ML pipelines on the german_credit
task to fit 3 different decision trees (one trained on the raw task and the other two trained on the scaled and Yeo-Johnson transformed task). Visualize the decision tree structure using the rpart.plot
function from the rpart.plot
package.
Hint:
Proceed as in Exercise 2 to create twoGraphLearner
s, one with po("scale")
and the other one with po("yeojohnson")
. Then, train the classif.rpart
learner and the two GraphLearner
s on the german_credit
task. Apply the rpart.plot
function to the trained model objects to compare the structure of the decision trees. Note: While for the classif.rpart
learner, the model object is directly contained in the $model
slot of the learner after training, the $model
slot of the two GraphLearners
is a list and you have to access the trained model via $model$classif.rpart$model
.
4 Benchmark k-NN and Decison Tree with Scaling and Yeo-Johnson Transformation
In the previous exercise we saw that scaling does not affect a decision tree structure. That is scaling numeric features of a decision tree will not have any (strong) effect on the performance. However, for some learners, scaling numeric features is important, especially if they are based on computing distances such as the k-NN learner (because scaling will convert all numeric features into a comparable scale).
In this exercise we want to conduct a benchmark that illustrates these claims. Consider the k-NN learner without scaling lrn("classif.kknn", scale = FALSE)
and the decision tree lrn("classif.rpart")
. Combine these two learners once with po("scale")
(for normalization, i.e., subtracting the mean and dividing by the standard deviation) and once with po("yeojohnson")
for Yeo-Johnson transformation of the numeric features. Then, setup a benchmark to compare their performance (including the non-scaled k-NN lrn("classif.kknn", scale = FALSE)
and decision tree lrn("classif.rpart")
) using 10-fold cross-validation. In total, you will benchmark 6 learners, the 4 ML pipelines and the 2 learners. For reproducibility, use the seed set.seed(2023)
.
Hint:
library(mlr3pipelines)
set.seed(2023)
= list(
lrns lrn("classif.kknn", scale = FALSE),
po("scale") %>>% ...,
po("yeojohnson") %>>% ..,
lrn("classif.rpart"),
%>>% lrn(...),
... %>>% ...
...
)
= benchmark_grid(...)
design = benchmark(...)
bmr $aggregate()
bmrautoplot(bmr)
Summary
We learned how to apply preprocessing steps such as factor encoding, standard scaling, or Yeo-Johnson transformation directly on a task. Furthermore, we have also seen how to create a GraphLearner
which applies a ML pipeline on a task that first does all preprocessing steps defined in the Graph
and then trains a learner on the preprocessed task. We also saw that scaling is important for the k-NN learner but not for a decision tree as neither the decision tree structure nor the performance of the decision tree changes.