A Pipeline for the Titanic Data Set - Basics

Build a graph.

Author

Florian Pfisterer

Published

March 12, 2020

Intro

We load the mlr3verse package which pulls in the most important packages for this example. The mlr3learners package loads additional learners. The data is part of the mlr3data package.

library(mlr3verse)
library(mlr3learners)

We initialize the random number generator with a fixed seed for reproducibility, and decrease the verbosity of the logger to keep the output clearly represented.

set.seed(7832)
lgr::get_logger("mlr3")$set_threshold("warn")

The titanic data is very interesting to analyze, even though it is part of many tutorials and showcases. This is because it requires many steps often required in real-world applications of machine learning techniques, such as missing value imputation, handling factors and others.

The following features are illustrated in this use case section:

  • Summarizing the data set
  • Visualizing data
  • Splitting data into train and test data sets
  • Defining a task and a learner

Exploratory Data Analysis

With the dataset, we get an explanation of the meanings of the different variables:

Variables Description
survived Survival
name Name
age Age
sex Sex
sib_sp Number of siblings / spouses aboard
parch Number of parents / children aboard
fare Amount paid for the ticket
pc_class Passenger class
embarked Port of embarkation
ticket Ticket number
cabin Cabin

We can use the skimr package in order to get a first overview of the data:

data("titanic", package = "mlr3data")

skimr::skim(titanic)
Data summary
Name titanic
Number of rows 1309
Number of columns 11
_______________________
Column type frequency:
character 3
factor 4
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1.00 12 82 0 1307 0
ticket 0 1.00 3 18 0 929 0
cabin 1014 0.23 1 15 0 186 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
survived 418 0.68 FALSE 2 no: 549, yes: 342
pclass 0 1.00 TRUE 3 3: 709, 1: 323, 2: 277
sex 0 1.00 FALSE 2 mal: 843, fem: 466
embarked 2 1.00 FALSE 3 S: 914, C: 270, Q: 123

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 263 0.8 29.88 14.41 0.17 21.0 28.00 39.00 80.00 ▂▇▅▂▁
sib_sp 0 1.0 0.50 1.04 0.00 0.0 0.00 1.00 8.00 ▇▁▁▁▁
parch 0 1.0 0.39 0.87 0.00 0.0 0.00 0.00 9.00 ▇▁▁▁▁
fare 1 1.0 33.30 51.76 0.00 7.9 14.45 31.27 512.33 ▇▁▁▁▁

We can now create a Task from our data. As we want to classify whether the person survived or not, we will create a TaskClassif. We’ll ignore the ‘titanic_test’ data for now and come back to it later.

A first model

In order to obtain solutions comparable to official leaderboards, such as the ones available from kaggle, we split the data into train and validation set before doing any further analysis. Here we are using the predefined split used by Kaggle.

task = as_task_classif(titanic, target = "survived", positive = "yes")
task$set_row_roles(892:1309, "holdout")
task
<TaskClassif:titanic> (891 x 11)
* Target: survived
* Properties: twoclass
* Features (10):
  - chr (3): cabin, name, ticket
  - dbl (2): age, fare
  - fct (2): embarked, sex
  - int (2): parch, sib_sp
  - ord (1): pclass

Our Task currently has \(3\) features of type character, which we don’t really know how to handle: “Cabin”, “Name”, “Ticket” and “PassengerId”. Additionally, from our skimr::skim() of the data, we have seen, that they have many unique values (up to 891).

We’ll drop them for now and see how we can deal with them later on.

task$select(cols = setdiff(task$feature_names, c("cabin", "name", "ticket")))

Additionally, we create a resampling instance that allows to compare data.

cv3 = rsmp("cv", folds = 3L)$instantiate(task)

To get a first impression of what performance we can fit a simple decision tree:

learner = mlr_learners$get("classif.rpart")
# or shorter:
learner = lrn("classif.rpart")

rr = resample(task, learner, cv3, store_models = TRUE)

rr$aggregate(msr("classif.acc"))
classif.acc 
  0.8013468 

So our model should have a minimal accuracy of 0.80 in order to improve over the simple decision tree. In order to improve more, we might need to do some feature engineering.

Optimizing the model

If we now try to fit a ‘ranger’ random forest model, we will get an error, as ‘ranger’ models can not naturally handle missing values.

learner = lrn("classif.ranger", num.trees = 250, min.node.size = 4)

rr = resample(task, learner, cv3, store_models = TRUE)
Error: Task 'titanic' has missing values in column(s) 'age', 'embarked', but learner 'classif.ranger' does not support this

This means we have to find a way to impute the missing values. To learn how to use more advanced commands of the mlr3pipelines package see: