library(mlr3verse)
library(mlr3learners)
Intro
We load the mlr3verse package which pulls in the most important packages for this example. The mlr3learners package loads additional learners
. The data is part of the mlr3data package.
We initialize the random number generator with a fixed seed for reproducibility, and decrease the verbosity of the logger to keep the output clearly represented.
set.seed(7832)
::get_logger("mlr3")$set_threshold("warn") lgr
The titanic data is very interesting to analyze, even though it is part of many tutorials and showcases. This is because it requires many steps often required in real-world applications of machine learning techniques, such as missing value imputation, handling factors and others.
The following features are illustrated in this use case section:
- Summarizing the data set
- Visualizing data
- Splitting data into train and test data sets
- Defining a task and a learner
Exploratory Data Analysis
With the dataset, we get an explanation of the meanings of the different variables:
Variables | Description |
---|---|
survived |
Survival |
name |
Name |
age |
Age |
sex |
Sex |
sib_sp |
Number of siblings / spouses aboard |
parch |
Number of parents / children aboard |
fare |
Amount paid for the ticket |
pc_class |
Passenger class |
embarked |
Port of embarkation |
ticket |
Ticket number |
cabin |
Cabin |
We can use the skimr package in order to get a first overview of the data:
data("titanic", package = "mlr3data")
::skim(titanic) skimr
Name | titanic |
Number of rows | 1309 |
Number of columns | 11 |
_______________________ | |
Column type frequency: | |
character | 3 |
factor | 4 |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
name | 0 | 1.00 | 12 | 82 | 0 | 1307 | 0 |
ticket | 0 | 1.00 | 3 | 18 | 0 | 929 | 0 |
cabin | 1014 | 0.23 | 1 | 15 | 0 | 186 | 0 |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
survived | 418 | 0.68 | FALSE | 2 | no: 549, yes: 342 |
pclass | 0 | 1.00 | TRUE | 3 | 3: 709, 1: 323, 2: 277 |
sex | 0 | 1.00 | FALSE | 2 | mal: 843, fem: 466 |
embarked | 2 | 1.00 | FALSE | 3 | S: 914, C: 270, Q: 123 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
age | 263 | 0.8 | 29.88 | 14.41 | 0.17 | 21.0 | 28.00 | 39.00 | 80.00 | ▂▇▅▂▁ |
sib_sp | 0 | 1.0 | 0.50 | 1.04 | 0.00 | 0.0 | 0.00 | 1.00 | 8.00 | ▇▁▁▁▁ |
parch | 0 | 1.0 | 0.39 | 0.87 | 0.00 | 0.0 | 0.00 | 0.00 | 9.00 | ▇▁▁▁▁ |
fare | 1 | 1.0 | 33.30 | 51.76 | 0.00 | 7.9 | 14.45 | 31.27 | 512.33 | ▇▁▁▁▁ |
We can now create a Task
from our data. As we want to classify whether the person survived or not, we will create a TaskClassif
. We’ll ignore the ‘titanic_test’ data for now and come back to it later.
A first model
In order to obtain solutions comparable to official leaderboards, such as the ones available from kaggle, we split the data into train and validation set before doing any further analysis. Here we are using the predefined split used by Kaggle.
= as_task_classif(titanic, target = "survived", positive = "yes")
task $set_row_roles(892:1309, "holdout")
task task
<TaskClassif:titanic> (891 x 11)
* Target: survived
* Properties: twoclass
* Features (10):
- chr (3): cabin, name, ticket
- dbl (2): age, fare
- fct (2): embarked, sex
- int (2): parch, sib_sp
- ord (1): pclass
Our Task
currently has \(3\) features of type character
, which we don’t really know how to handle: “Cabin”, “Name”, “Ticket” and “PassengerId”. Additionally, from our skimr::skim()
of the data, we have seen, that they have many unique values (up to 891).
We’ll drop them for now and see how we can deal with them later on.
$select(cols = setdiff(task$feature_names, c("cabin", "name", "ticket"))) task
Additionally, we create a resampling instance that allows to compare data.
= rsmp("cv", folds = 3L)$instantiate(task) cv3
To get a first impression of what performance we can fit a simple decision tree:
= mlr_learners$get("classif.rpart")
learner # or shorter:
= lrn("classif.rpart")
learner
= resample(task, learner, cv3, store_models = TRUE)
rr
$aggregate(msr("classif.acc")) rr
classif.acc
0.8013468
So our model should have a minimal accuracy of 0.80
in order to improve over the simple decision tree. In order to improve more, we might need to do some feature engineering.
Optimizing the model
If we now try to fit a ‘ranger’ random forest model, we will get an error, as ‘ranger’ models can not naturally handle missing values.
= lrn("classif.ranger", num.trees = 250, min.node.size = 4)
learner
= resample(task, learner, cv3, store_models = TRUE) rr
Error: Task 'titanic' has missing values in column(s) 'age', 'embarked', but learner 'classif.ranger' does not support this
This means we have to find a way to impute the missing values. To learn how to use more advanced commands of the mlr3pipelines package see: