Encoding and Scaling

Create a pipeline to do feature preprocessing (one-hot-encoding, Yeo-Johnson transformation) for the german credit task.

Authors

Goal

Learn how to do preprocessing steps directly on a mlr3 Task object and how to combine a preprocessing with a learner to create a simple linear ML pipeline that first applies the preprocessing and then trains a learner.

German Credit Data

Description

  • Data from 1973 to 1975 from a large regional bank in southern Germany classifying credits described by a set of attributes to good or bad credit risks.
  • Stratified sample of 1000 credits (300 bad ones and 700 good ones).
  • Customers with good credit risks perfectly complied with the conditions of the contract while customers with bad credit risks did not comply with the contract as required.
  • Available in tsk("german_credit").

Data Dictionary

n = 1,000 observations of credits

  • credit_risk: Has the credit contract been complied with (good) or not (bad)?
  • age: Age of debtor in years
  • amount: Credit amount in DM
  • credit_history: History of compliance with previous or concurrent credit contracts
  • duration: Credit duration in months
  • employment_duration: Duration of debtor’s employment with current employer
  • foreign_worker: Whether the debtor is a foreign worker
  • housing: Type of housing the debtor lives in
  • installment_rate: Credit installments as a percentage of debtor’s disposable income
  • job: Quality of debtor’s job
  • number_credits: Number of credits including the current one the debtor has (or had) at this bank
  • other_debtors: Whether there is another debtor or a guarantor for the credit
  • other_installment_plans: Installment plans from providers other than the credit-giving bank
  • people_liable: Number of persons who financially depend on the debtor
  • personal_status_sex: Combined information on sex and marital status
  • present_residence: Length of time (in years) the debtor lives in the present residence
  • property: The debtor’s most valuable property
  • purpose: Purpose for which the credit is needed
  • savings: Debtor’s saving
  • status: Status of the debtor’s checking account with the bank
  • telephone: Whether there is a telephone landline registered on the debtor’s name
library(mlr3)
library(mlr3learners)
library(xgboost)
task = tsk("german_credit")
Recap: mlr3 Tasks

An mlr3 Task encapsulates data with meta-information, such as the name of the target variable and the type of the learning problem (in our example this would be a classification task, where the target is a factor label with relatively few distinct values).

task
<TaskClassif:german_credit> (1000 x 21): German Credit
* Target: credit_risk
* Properties: twoclass
* Features (20):
  - fct (14): credit_history, employment_duration, foreign_worker, housing, job, other_debtors,
    other_installment_plans, people_liable, personal_status_sex, property, purpose, savings, status,
    telephone
  - int (3): age, amount, duration
  - ord (3): installment_rate, number_credits, present_residence

We get a short summary of the task: It has 1000 observations and 21 columns of which 20 are features. 17 features are categorical (i.e., factors) and 3 features are integer.

By using the $data() method, we get access to the data (in the form of a data.table):

str(task$data())
Classes 'data.table' and 'data.frame':  1000 obs. of  21 variables:
 $ credit_risk            : Factor w/ 2 levels "good","bad": 1 2 1 1 2 1 1 1 1 2 ...
 $ age                    : int  67 22 49 45 53 35 53 35 61 28 ...
 $ amount                 : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
 $ credit_history         : Factor w/ 5 levels "delay in paying off in the past",..: 5 3 5 3 4 3 3 3 3 5 ...
 $ duration               : int  6 48 12 42 24 36 24 36 12 30 ...
 $ employment_duration    : Factor w/ 5 levels "unemployed","< 1 yr",..: 5 3 4 4 3 3 5 3 4 1 ...
 $ foreign_worker         : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
 $ housing                : Factor w/ 3 levels "for free","rent",..: 2 2 2 3 3 3 2 1 2 2 ...
 $ installment_rate       : Ord.factor w/ 4 levels ">= 35"<"25 <= ... < 35"<..: 4 2 2 2 3 2 3 2 2 4 ...
 $ job                    : Factor w/ 4 levels "unemployed/unskilled - non-resident",..: 3 3 2 3 3 2 3 4 2 4 ...
 $ number_credits         : Ord.factor w/ 4 levels "1"<"2-3"<"4-5"<..: 2 1 1 1 2 1 1 1 1 2 ...
 $ other_debtors          : Factor w/ 3 levels "none","co-applicant",..: 1 1 1 3 1 1 1 1 1 1 ...
 $ other_installment_plans: Factor w/ 3 levels "bank","stores",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ people_liable          : Factor w/ 2 levels "0 to 2","3 or more": 1 1 2 2 2 2 1 1 1 1 ...
 $ personal_status_sex    : Factor w/ 4 levels "male : divorced/separated",..: 3 2 3 3 3 3 3 3 1 4 ...
 $ present_residence      : Ord.factor w/ 4 levels "< 1 yr"<"1 <= ... < 4 yrs"<..: 4 2 3 4 4 4 4 2 4 2 ...
 $ property               : Factor w/ 4 levels "unknown / no property",..: 1 1 1 2 4 4 2 3 1 3 ...
 $ purpose                : Factor w/ 11 levels "others","car (new)",..: 4 4 7 3 1 7 3 2 4 1 ...
 $ savings                : Factor w/ 5 levels "unknown/no savings account",..: 5 1 1 1 1 5 3 1 4 1 ...
 $ status                 : Factor w/ 4 levels "no checking account",..: 1 2 4 1 1 4 4 2 4 2 ...
 $ telephone              : Factor w/ 2 levels "no","yes (under customer name)": 2 1 1 1 1 2 1 2 1 1 ...
 - attr(*, ".internal.selfref")=<externalptr> 

Note that a mlr3 Task object comes with plenty of functionality in the form of fields, methods and active bindings, see ?Task, e.g., to get a summary of all feature names, you can use:

task$feature_names
 [1] "age"                     "amount"                  "credit_history"          "duration"               
 [5] "employment_duration"     "foreign_worker"          "housing"                 "installment_rate"       
 [9] "job"                     "number_credits"          "other_debtors"           "other_installment_plans"
[13] "people_liable"           "personal_status_sex"     "present_residence"       "property"               
[17] "purpose"                 "savings"                 "status"                  "telephone"              

To obtain information about the types of features of the task (similarly like in the data dictionary above), we can inspect the active binding fields of the task object (see, ?Task):

task$feature_types
Key: <id>
                         id    type
                     <char>  <char>
 1:                     age integer
 2:                  amount integer
 3:          credit_history  factor
 4:                duration integer
 5:     employment_duration  factor
 6:          foreign_worker  factor
 7:                 housing  factor
 8:        installment_rate ordered
 9:                     job  factor
10:          number_credits ordered
11:           other_debtors  factor
12: other_installment_plans  factor
13:           people_liable  factor
14:     personal_status_sex  factor
15:       present_residence ordered
16:                property  factor
17:                 purpose  factor
18:                 savings  factor
19:                  status  factor
20:               telephone  factor
                         id    type

1 Preprocess a Task (with One-Hot Encoding)

Use the one-hot encoding PipeOp to convert all categorical features from the german_credit task into a preprocessed task containing 0-1 indicator variables for each category level instead of categorical features.

Hint 1:

Load the mlr3pipelines package and get an overview of possible PipeOp that can be used for different preprocessing steps by printing mlr_pipeops or the first two columns of the corresponding table as.data.table(mlr_pipeops)[,1:2]. Look for a factor encoding and pass the corresponding key for factor encoding to the po() function (see also the help page ?PipeOpEncode). Then, use the $train() method of the PipeOp object which expects a list containing the task to be converted as input and produces a list containing the converted task.

Hint 2:
library(mlr3pipelines)
# Create a PipeOp object that applies one-hot encoding
poe = po(...) 
# Apply a created PipeOp to e.g. preprocess an input
encoded_task = poe$train(input = ...)$output
str(...$data())

2 Create a Simple ML Pipeline (with One-Hot Encoding)

Some learners cannot handle categorical features such as the the xgboost learner (which gives an error message when applied to a task containing categorical features):

library(mlr3verse)
lrnxg = lrn("classif.xgboost")
lrnxg$train(task)
Error: <TaskClassif:german_credit> has the following unsupported feature types: factor, ordered
lrnxg$predict(task)
Error: <TaskClassif:german_credit> has the following unsupported feature types: factor, ordered

Combine the xgboost learner with a preprocessing step that applies one-hot encoding to create a ML pipeline that first converts all categorical features to 0-1 indicator variables and then applies the xgboost learner. Train the ML pipeline on the german_credit task and make predictions on the training data.

Hint 1:

You can create a Graph that combines a PipeOp object with a learner object (or further PipeOp objects) by concatenating them using the %>>% operator. The Graph contains all information of a sequential ML pipeline. Convert the Graph into a GraphLearner to be able to run the whole ML pipeline like a usual learner object with which we can train, predict, resample, and benchmark the GraphLearner as we have learned. See also the help page ?GraphLearner.

Hint 2:
library(mlr3verse)
lrnxg = lrn("classif.xgboost")
poe = po(...)
graph = ...

glrn = as_learner(...) 
...$train(...)
...$predict(...)

3 Feature Transformation for Decision Trees

The structure of a decision tree is insensitive to monotonic transformations of the features (and scaling is a monotonic transformation). This means that although the scaled features are different to non-scaled features, the decision tree will have the same structure (the values of the split points for numeric feature might be different as the numeric features will have a different scale, but the structure of the decision tree will stay the same).

3.1 Preprocessing

Use the PipeOp to scale all numeric features from the german_credit task and create a preprocessed task the scaled numeric features. Do this for standard scaling (i.e., normalization by centering and scaling) and for Yeo-Johnson transformation (i.e., a power transformation to make data more Gaussian-like). You can look up the corresponding keys by inspecting the table as.data.table(mlr_pipeops)[,1:2]. Create the preprocessed tasks task_scaled and task_yeojohnson and check the values of the numeric features. You may have to first install the bestNormalize package for the Yeo-Johnson transformation.

Hint: Proceed as in Exercise 1, but use scale and yeojohnson instead of encode as keys in the po() function. If installing the bestNormalize package does not work, you can also select a different scaling approach such as scalemaxabs or scalerange. Yeo-Johnson transformation is a generalization of the Box-Cox transformation that can be applied to both positive and negative values, while Box-Cox transformation is only applicable to non-negative values.

3.2 Visual Comparison

Create two ML pipelines, one that combines the classif.rpart learner with the standard scaling and another one that combines classif.rpart learner with the Yeo-Johnson scaling. Then use the classif.rpart learner and the two ML pipelines on the german_credit task to fit 3 different decision trees (one trained on the raw task and the other two trained on the scaled and Yeo-Johnson transformed task). Visualize the decision tree structure using the rpart.plot function from the rpart.plot package.

Hint: Proceed as in Exercise 2 to create two GraphLearners, one with po("scale") and the other one with po("yeojohnson"). Then, train the classif.rpart learner and the two GraphLearners on the german_credit task. Apply the rpart.plot function to the trained model objects to compare the structure of the decision trees. Note: While for the classif.rpart learner, the model object is directly contained in the $model slot of the learner after training, the $model slot of the two GraphLearners is a list and you have to access the trained model via $model$classif.rpart$model.

4 Benchmark k-NN and Decison Tree with Scaling and Yeo-Johnson Transformation

In the previous exercise we saw that scaling does not affect a decision tree structure. That is scaling numeric features of a decision tree will not have any (strong) effect on the performance. However, for some learners, scaling numeric features is important, especially if they are based on computing distances such as the k-NN learner (because scaling will convert all numeric features into a comparable scale).

In this exercise we want to conduct a benchmark that illustrates these claims. Consider the k-NN learner without scaling lrn("classif.kknn", scale = FALSE) and the decision tree lrn("classif.rpart"). Combine these two learners once with po("scale") (for normalization, i.e., subtracting the mean and dividing by the standard deviation) and once with po("yeojohnson") for Yeo-Johnson transformation of the numeric features. Then, setup a benchmark to compare their performance (including the non-scaled k-NN lrn("classif.kknn", scale = FALSE) and decision tree lrn("classif.rpart")) using 10-fold cross-validation. In total, you will benchmark 6 learners, the 4 ML pipelines and the 2 learners. For reproducibility, use the seed set.seed(2023).

Hint:
library(mlr3pipelines)

set.seed(2023)
lrns = list(
  lrn("classif.kknn", scale = FALSE),
  po("scale") %>>% ...,
  po("yeojohnson") %>>% ..,
  lrn("classif.rpart"),
  ... %>>% lrn(...),
  ... %>>% ...
)

design = benchmark_grid(...)
bmr = benchmark(...)
bmr$aggregate()
autoplot(bmr)

Summary

We learned how to apply preprocessing steps such as factor encoding, standard scaling, or Yeo-Johnson transformation directly on a task. Furthermore, we have also seen how to create a GraphLearner which applies a ML pipeline on a task that first does all preprocessing steps defined in the Graph and then trains a learner on the preprocessed task. We also saw that scaling is important for the k-NN learner but not for a decision tree as neither the decision tree structure nor the performance of the decision tree changes.