Intro

We load the mlr3verse package which pulls in the most important packages for this example. The mlr3learners package loads additional learners. The data is part of the mlr3data package.

library(mlr3verse)
library(mlr3learners)

We initialize the random number generator with a fixed seed for reproducibility, and decrease the verbosity of the logger to keep the output clearly represented.

set.seed(7832)
lgr::get_logger("mlr3")$set_threshold("warn")

The titanic data is very interesting to analyze, even though it is part of many tutorials and showcases. This is because it requires many steps often required in real-world applications of machine learning techniques, such as missing value imputation, handling factors and others.

The following features are illustrated in this use case section:

Summarizing the data set
Visualizing data
Splitting data into train and test data sets
Defining a task and a learner

Exploratory Data Analysis

With the dataset, we get an explanation of the meanings of the different variables:

Variables	Description
`survived`	Survival
`name`	Name
`age`	Age
`sex`	Sex
`sib_sp`	Number of siblings / spouses aboard
`parch`	Number of parents / children aboard
`fare`	Amount paid for the ticket
`pc_class`	Passenger class
`embarked`	Port of embarkation
`ticket`	Ticket number
`cabin`	Cabin

We can use the skimr package in order to get a first overview of the data:

data("titanic", package = "mlr3data")

skimr::skim(titanic)

Data summary
Name	titanic
Number of rows	1309
Number of columns	11
_______________________
Column type frequency:
character	3
factor	4
numeric	4
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
name	0	1.00	12	82	1307
ticket	0	1.00	3	18	929
cabin	1014	0.23	1	15	186

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
survived	418	0.68	FALSE	2	no: 549, yes: 342
pclass	0	1.00	TRUE	3	3: 709, 1: 323, 2: 277
sex	0	1.00	FALSE	2	mal: 843, fem: 466
embarked	2	1.00	FALSE	3	S: 914, C: 270, Q: 123

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
age	263	0.8	29.88	14.41	0.17	21.0	28.00	39.00	80.00	▂▇▅▂▁
sib_sp	0	1.0	0.50	1.04	0.00	0.0	0.00	1.00	8.00	▇▁▁▁▁
parch	0	1.0	0.39	0.87	0.00	0.0	0.00	0.00	9.00	▇▁▁▁▁
fare	1	1.0	33.30	51.76	0.00	7.9	14.45	31.27	512.33	▇▁▁▁▁

We can now create a Task from our data. As we want to classify whether the person survived or not, we will create a TaskClassif. We’ll ignore the ‘titanic_test’ data for now and come back to it later.

A first model

In order to obtain solutions comparable to official leaderboards, such as the ones available from kaggle, we split the data into train and validation set before doing any further analysis. Here we are using the predefined split used by Kaggle.

task = as_task_classif(titanic, target = "survived", positive = "yes")
task$set_row_roles(892:1309, "holdout")
task

<TaskClassif:titanic> (891 x 11)
* Target: survived
* Properties: twoclass
* Features (10):
  - chr (3): cabin, name, ticket
  - dbl (2): age, fare
  - fct (2): embarked, sex
  - int (2): parch, sib_sp
  - ord (1): pclass

Our Task currently has \(3\) features of type character, which we don’t really know how to handle: “Cabin”, “Name”, “Ticket” and “PassengerId”. Additionally, from our skimr::skim() of the data, we have seen, that they have many unique values (up to 891).

We’ll drop them for now and see how we can deal with them later on.

task$select(cols = setdiff(task$feature_names, c("cabin", "name", "ticket")))

Additionally, we create a resampling instance that allows to compare data.

cv3 = rsmp("cv", folds = 3L)$instantiate(task)

To get a first impression of what performance we can fit a simple decision tree:

learner = mlr_learners$get("classif.rpart")
# or shorter:
learner = lrn("classif.rpart")

rr = resample(task, learner, cv3, store_models = TRUE)

rr$aggregate(msr("classif.acc"))

classif.acc 
  0.8013468

So our model should have a minimal accuracy of 0.80 in order to improve over the simple decision tree. In order to improve more, we might need to do some feature engineering.

Optimizing the model

If we now try to fit a ‘ranger’ random forest model, we will get an error, as ‘ranger’ models can not naturally handle missing values.

learner = lrn("classif.ranger", num.trees = 250, min.node.size = 4)

rr = resample(task, learner, cv3, store_models = TRUE)

Error: Task 'titanic' has missing values in column(s) 'age', 'embarked', but learner 'classif.ranger' does not support this

This means we have to find a way to impute the missing values. To learn how to use more advanced commands of the mlr3pipelines package see:

Part II - Pipelines