Goal

Our goal for this exercise sheet is to learn the basics of mlr3 for supervised learning by training a first simple model on training data and by evaluating its performance on hold-out/test data.

German Credit Dataset

The German credit dataset was donated by Prof. Dr. Hans Hoffman of the University of Hamburg in 1994 and contains 1000 datapoints reflecting bank customers. The goal is to classify people as a good or bad credit risk based on 20 personal, demographic and financial features. The dataset is available at the UCI repository as Statlog (German Credit Data) Data Set.

Motivation of Risk Prediction

Customers who do not repay the distributed loan on time represent an enormous risk for a bank: First, because they create an unintended gap in the bank’s planning, and second, because the collection of the repayment amount additionally causes additional time and cost for the bank.

On the other hand, (interest rates for) loans are an important revenue stream for banks. If a person’s loan is rejected, even though they would have met the repayment deadlines, revenue is lost, as well as potential upselling opportunities.

Banks are therefore highly interested in a risk prediction model that accurately predicts the risk of future customers. This is where supervised learning models come into play.

Data Overview

n = 1,000 observations of bank customers

credit_risk: is the customer a good or bad credit risk?
age: age in years
amount: amount asked by applicant
credit_history: past credit history of applicant at this bank
duration: duration of the credit in months
employment_duration: present employment since
foreign_worker: is applicant foreign worker?
housing: type of apartment rented, owned, for free / no payment
installment_rate: installment rate in percentage of disposable income
job: current job information
number_credits: number of existing credits at this bank
other_debtors: other debtors/guarantors present?
other_installment_plans: other installment plans the applicant is paying
people_liable: number of people being liable to provide maintenance
personal_status_sex: combination of sex and personal status of applicant
present_residence: present residence since
property: properties that applicant has
purpose: reason customer is applying for a loan
savings: savings accounts/bonds at this bank
status: status/balance of checking account at this bank
telephone: is there any telephone registered for this customer?

Preprocessing

We first load the data from the rchallenge package (you may need to install it first) and get a brief overview.

# install.packages("rchallenge")
library("rchallenge")
data("german")
skimr::skim(german)

Data summary
Name	german
Number of rows	1000
Number of columns	21
_______________________
Column type frequency:
factor	18
numeric	3
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
status	1	FALSE	4	…: 394, no : 274, …: 269, 0<=: 63
credit_history	1	FALSE	5	no : 530, all: 293, exi: 88, cri: 49
purpose	1	FALSE	10	fur: 280, oth: 234, car: 181, car: 103
savings	1	FALSE	5	unk: 603, …: 183, …: 103, 100: 63
employment_duration	1	FALSE	5	1 <: 339, >= : 253, 4 <: 174, < 1: 172
installment_rate	1	TRUE	4	< 2: 476, 25 : 231, 20 : 157, >= : 136
personal_status_sex	1	FALSE	4	mal: 548, fem: 310, fem: 92, mal: 50
other_debtors	1	FALSE	3	non: 907, gua: 52, co-: 41
present_residence	1	TRUE	4	>= : 413, 1 <: 308, 4 <: 149, < 1: 130
property	1	FALSE	4	bui: 332, unk: 282, car: 232, rea: 154
other_installment_plans	1	FALSE	3	non: 814, ban: 139, sto: 47
housing	1	FALSE	3	ren: 714, for: 179, own: 107
number_credits	1	TRUE	4	1: 633, 2-3: 333, 4-5: 28, >= : 6
job	1	FALSE	4	ski: 630, uns: 200, man: 148, une: 22
people_liable	1	FALSE	2	0 t: 845, 3 o: 155
telephone	1	FALSE	2	no: 596, yes: 404
foreign_worker	1	FALSE	2	no: 963, yes: 37
credit_risk	1	FALSE	2	goo: 700, bad: 300

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
duration	1	20.90	12.06	4	12.0	18.0	24.00	72	▇▇▂▁▁
amount	1	3271.25	2822.75	250	1365.5	2319.5	3972.25	18424	▇▂▁▁▁
age	1	35.54	11.35	19	27.0	33.0	42.00	75	▇▆▃▁▁

Exercises:

Now, we can start building a model. To do so, we need to address the following questions:

What is the problem we are trying to solve?
What is an appropriate learning algorithm?
How do we evaluate “good” performance?

More systematically in mlr3 they can be expressed via five components:

The Task definition.
The Learner definition.
The training via $train().
The prediction via $predict().
The evaluation via one $score().

Split Data in Training and Test Data

Your task is to split the german dataset into 70 % training data and 30 % test data by randomly sampling rows. Later, we will use the training data to learn an ML model and use the test data to assess its performance.

Recap: Why do we need train and test data?

We use part of the available data (the training data) to train our model. The remaining/hold-out data (test data) is used to evaluate the trained model. This is exactly how we anticipate using the model in practice: We want to fit the model to existing data and then make predictions on new, unseen data points for which we do not know the outcome/target values.

Note: Hold-out splitting requires a dataset that is sufficiently large such that both the training and test dataset are suitable representations of the target population. What “sufficiently large” means depends on the dataset at hand and the complexity of the problem.

The ratio of training to test data is also context dependent. In practice, a 70% to 30% (~ 2:1) ratio is a good starting point.

Hint 1:

Use sample() to sample 70 % of the data ids as training data ids from row.names(german). The remaining row ids are obtained via setdiff(). Based on the ids, set up two datasets, one for training and one for testing/evaluating.

Set a seed (e.g, set.seed(100L)) to make your results reproducible.

Hint 2:

# Sample ids for training and test split
set.seed(100L)
train_ids = sample(row.names(german), 0.7*nrow(...))
test_ids = setdiff(..., train_ids)

# Create two datasets based on ids
train_set = german[...,]
test_set = german[...,]

Create a Classification Task

Install and load the mlr3verse package which is a collection of multiple add-on packages in the mlr3 universe (if you fail installing mlr3verse, try to install and load only the mlr3 and mlr3learners packages). Then, create a classification task using the training data as an input and credit_risk as the target variable (with the class label good as the positive class). By defining an mlr3 task, we conceptualize the ML problem we want to solve (here we face a classification task). As we have a classification task here, make sure you properly specify the class that should be used as the positive class (i.e., the class label for which we would like to predict probabilities - here good if you are interested in predicting a probability for the creditworthiness of customers).

Hint 1:

Use e.g. as_task_classif() to create a classification task.

Hint 2:

library(mlr3verse)
task = as_task_classif(x = ..., target = ..., ... = "good")

Train a Model on the Training Dataset

The created Task contains the data we want to work with. Now that we conceptualized the ML task (i.e., classification) in a Task object, it is time to train our first supervised learning method. We start with a simple classifier: a logistic regression model. During this course, you will, of course, also gain experience with more complex models.

Fit a logistic regression model to the german_credit training task.

Hint 1:

Use lrn() to initialize a Learner object. The short cut and therefore input to this method is "classif.log_reg".

To train a model, use the $train() method of your instantiated learner with the task of the previous exercise as an input.

Hint 2:

logreg = lrn("classif.log_reg")
logreg$train(...)

Inspect the Model

Have a look at the coefficients by using summary(). Name at least two features that have a significant effect on the outcome.

Hint 1:

Use the summary() method of the model field of our trained model. By looking on task$positive, we could see which of the two classes good or bad is used as the positive class (i.e., the class to which the model predictions will refer).

Hint 2:

summary(yourmodel$model)

Predict on the Test Dataset

Use the trained model to predict on the hold-out/test dataset.

Hint 1

Since we have a new tabular dataset as an input (and not a task), we need to use $predict_newdata() (instead of $predict()) to derive a PredictionClassif object.

Hint 2

pred = yourmodel$predict_newdata(...)

Evaluation

What is the classification error on the test data (200 observations)?

Hint 1:

The classification error gives the rate of observations that were misclassified. Use the $score() method on the corresponding PredictionClassif object of the previous exercise.

Hint 2:

pred_logreg$score()

Predicting probabilities instead of labels

Similarly, we can assess the performance of our model using the AUC. However, this requires predicted probabilities instead of predicted labels. Evaluate the model using the AUC. To do so, retrain the model with a learner that returns probabilities.

Hint 1:

You can generate predictions with probabilities by specifying a predict_type argument inside the lrn() function call when constructing a learner.

Hint 2:

You can get an overview of performance measures in mlr3 using as.data.table(msr()).

Summary

In this exercise sheet we learned how to fit a logistic regression model on a training task and how to assess its performance on unseen test data with the help of mlr3. We showed how to split data manually into training and test data, but in most scenarios it is a call to resample or benchmark. We will learn more on this in the next sections.