# A Pipeline for the Titanic Data Set - Basics

Build a graph.

Author

Florian Pfisterer

Published

March 12, 2020

## Intro

We load the mlr3verse package which pulls in the most important packages for this example. The mlr3learners package loads additional learners. The data is part of the mlr3data package.

library(mlr3verse)
library(mlr3learners)

We initialize the random number generator with a fixed seed for reproducibility, and decrease the verbosity of the logger to keep the output clearly represented.

set.seed(7832)
lgr::get_logger("mlr3")$set_threshold("warn") The titanic data is very interesting to analyze, even though it is part of many tutorials and showcases. This is because it requires many steps often required in real-world applications of machine learning techniques, such as missing value imputation, handling factors and others. The following features are illustrated in this use case section: • Summarizing the data set • Visualizing data • Splitting data into train and test data sets • Defining a task and a learner ## Exploratory Data Analysis With the dataset, we get an explanation of the meanings of the different variables: Variables Description survived Survival name Name age Age sex Sex sib_sp Number of siblings / spouses aboard parch Number of parents / children aboard fare Amount paid for the ticket pc_class Passenger class embarked Port of embarkation ticket Ticket number cabin Cabin We can use the skimr package in order to get a first overview of the data: data("titanic", package = "mlr3data") skimr::skim(titanic)  Name titanic Number of rows 1309 Number of columns 11 _______________________ Column type frequency: character 3 factor 4 numeric 4 ________________________ Group variables None Variable type: character skim_variable n_missing complete_rate min max empty n_unique whitespace name 0 1.00 12 82 0 1307 0 ticket 0 1.00 3 18 0 929 0 cabin 1014 0.23 1 15 0 186 0 Variable type: factor skim_variable n_missing complete_rate ordered n_unique top_counts survived 418 0.68 FALSE 2 no: 549, yes: 342 pclass 0 1.00 TRUE 3 3: 709, 1: 323, 2: 277 sex 0 1.00 FALSE 2 mal: 843, fem: 466 embarked 2 1.00 FALSE 3 S: 914, C: 270, Q: 123 Variable type: numeric skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist age 263 0.8 29.88 14.41 0.17 21.0 28.00 39.00 80.00 ▂▇▅▂▁ sib_sp 0 1.0 0.50 1.04 0.00 0.0 0.00 1.00 8.00 ▇▁▁▁▁ parch 0 1.0 0.39 0.87 0.00 0.0 0.00 0.00 9.00 ▇▁▁▁▁ fare 1 1.0 33.30 51.76 0.00 7.9 14.45 31.27 512.33 ▇▁▁▁▁ We can now create a Task from our data. As we want to classify whether the person survived or not, we will create a TaskClassif. We’ll ignore the ‘titanic_test’ data for now and come back to it later. ## A first model In order to obtain solutions comparable to official leaderboards, such as the ones available from kaggle, we split the data into train and validation set before doing any further analysis. Here we are using the predefined split used by Kaggle. task = as_task_classif(titanic, target = "survived", positive = "yes") task$set_row_roles(892:1309, "holdout")
task
<TaskClassif:titanic> (891 x 11)
* Target: survived
* Properties: twoclass
* Features (10):
- chr (3): cabin, name, ticket
- dbl (2): age, fare
- fct (2): embarked, sex
- int (2): parch, sib_sp
- ord (1): pclass

Our Task currently has $$3$$ features of type character, which we don’t really know how to handle: “Cabin”, “Name”, “Ticket” and “PassengerId”. Additionally, from our skimr::skim() of the data, we have seen, that they have many unique values (up to 891).

We’ll drop them for now and see how we can deal with them later on.

task$select(cols = setdiff(task$feature_names, c("cabin", "name", "ticket")))

Additionally, we create a resampling instance that allows to compare data.

cv3 = rsmp("cv", folds = 3L)$instantiate(task) To get a first impression of what performance we can fit a simple decision tree: learner = mlr_learners$get("classif.rpart")
# or shorter:
learner = lrn("classif.rpart")

rr = resample(task, learner, cv3, store_models = TRUE)

rr\$aggregate(msr("classif.acc"))
classif.acc
0.8013468 

So our model should have a minimal accuracy of 0.80 in order to improve over the simple decision tree. In order to improve more, we might need to do some feature engineering.

# Optimizing the model

If we now try to fit a ‘ranger’ random forest model, we will get an error, as ‘ranger’ models can not naturally handle missing values.

learner = lrn("classif.ranger", num.trees = 250, min.node.size = 4)

rr = resample(task, learner, cv3, store_models = TRUE)
Error: Task 'titanic' has missing values in column(s) 'age', 'embarked', but learner 'classif.ranger' does not support this

This means we have to find a way to impute the missing values. To learn how to use more advanced commands of the mlr3pipelines package see: