JavaScript is required to unlock solutions.
Please enable JavaScript and reload the page,
or download the source files from GitHub and run the code locally.

Goal

Apply what you have learned about using pipelines for efficient pre-processing and model training on a regression problem.

House Prices in King county

In this exercise, we want to model house sale prices in King county in the state of Washington, USA.

set.seed(124)
library(mlr3verse)
library("mlr3tuningspaces")
data("kc_housing", package = "mlr3data")

We do some simple feature pre-processing first:

# Transform time to numeric variable:
library(anytime)
dates = anytime(kc_housing$date)
kc_housing$date = as.numeric(difftime(dates, min(dates), units = "days"))
# Scale prices:
kc_housing$price = kc_housing$price / 1000
# For this task, delete columns containing NAs:
kc_housing[,c(13, 15)] = NULL
# Create factor columns:
kc_housing[,c(8, 14)] = lapply(c(8, 14), function(x) {as.factor(kc_housing[,x])})
# Get an overview:
str(kc_housing)

'data.frame':   21613 obs. of  18 variables:
 $ date         : num  164 221 299 221 292 ...
 $ price        : num  222 538 180 604 510 ...
 $ bedrooms     : int  3 3 2 4 3 4 3 3 3 3 ...
 $ bathrooms    : num  1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
 $ sqft_living  : int  1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
 $ sqft_lot     : int  5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
 $ floors       : num  1 2 1 1 1 1 2 1 1 2 ...
 $ waterfront   : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
 $ view         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ condition    : int  3 3 3 5 3 3 3 3 3 3 ...
 $ grade        : int  7 7 6 7 8 11 7 7 7 7 ...
 $ sqft_above   : int  1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
 $ yr_built     : int  1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
 $ zipcode      : Factor w/ 70 levels "98001","98002",..: 67 56 17 59 38 30 3 69 61 24 ...
 $ lat          : num  47.5 47.7 47.7 47.5 47.6 ...
 $ long         : num  -122 -122 -122 -122 -122 ...
 $ sqft_living15: int  1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
 $ sqft_lot15   : int  5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
 - attr(*, "index")= int(0)

Train-test Split

Before we train a model, let’s reserve some data for evaluating our model later on:

task = as_task_regr(kc_housing, target = "price")
split = partition(task, ratio = 0.6)

tasktrain = task$clone()
tasktrain$filter(split$train)
tasktrain

<TaskRegr:kc_housing> (12968 x 18)
* Target: price
* Properties: -
* Features (17):
  - int (10): bedrooms, condition, grade, sqft_above, sqft_living, sqft_living15, sqft_lot, sqft_lot15,
    view, yr_built
  - dbl (5): bathrooms, date, floors, lat, long
  - fct (2): waterfront, zipcode

tasktest = task$clone()
tasktest$filter(split$test)
tasktest

<TaskRegr:kc_housing> (8645 x 18)
* Target: price
* Properties: -
* Features (17):
  - int (10): bedrooms, condition, grade, sqft_above, sqft_living, sqft_living15, sqft_lot, sqft_lot15,
    view, yr_built
  - dbl (5): bathrooms, date, floors, lat, long
  - fct (2): waterfront, zipcode

XGBoost

XGBoost (Chen and Guestrin, 2016 is a highly performant library for gradient-boosted trees. As some other ML learners, it cannot handle categorical data, so categorical features must be encoded as numerical variables. In the King county data, there are two categorical features encoded as factor:

ft = task$feature_types
ft[ft[[2]] == "factor"]

Key: <id>
           id   type
       <char> <char>
1: waterfront factor
2:    zipcode factor

Categorical features can be grouped by their cardinality, which refers to the number of levels they contain: binary features (two levels), low-cardinality features, and high-cardinality features; there is no universal threshold for when a feature should be considered high-cardinality and this threshold can even be tuned. Low-cardinality features can be handled by one-hot encoding. One-hot encoding is a process of converting categorical features into a binary representation, where each possible category is represented as a separate binary feature. Theoretically, it is sufficient to create one less binary feature than levels. This is typically called dummy or treatment encoding and is required if the learner is a generalized linear model (GLM) or additive model (GAM). For now, let’s check the cardinality of waterfront and zipcode:

lengths(task$levels())

waterfront    zipcode 
         2         70

Obviously, waterfront is a low-cardinality feature suitable for dummy (also called treatment) encoding and zipcode is a very high-cardinality feature. Some learners support handling categorical features but may still crash for high-cardinality features if they internally apply encodings that are only suitable for low-cardinality features, such as one-hot encoding.

Impact encoding

Impact encoding (Micci-Barreca 2001) is a good approach for handling high-cardinality features. Impact encoding converts categorical features into numeric values. The idea behind impact encoding is to use the target feature to create a mapping between the categorical feature and a numerical value that reflects its importance in predicting the target feature. Impact encoding involves the following steps:

Group the target variable by the categorical feature.
Compute the mean of the target variable for each group.
Compute the global mean of the target variable.
Compute the impact score for each group as the difference between the mean of the target variable for the group and the global mean of the target variable.
Replace the categorical feature with the impact scores.

Impact encoding preserves the information of the categorical feature while also creating a numerical representation that reflects its importance in predicting the target. Compared to one-hot encoding, the main advantage is that only a single numeric feature is created regardless of the number of levels of the categorical features, hence it is especially useful for high-cardinality features. As information from the target is used to compute the impact scores, the encoding process must be embedded in cross-validation to avoid leakage between training and testing data.

Exercises

Exercise 1: Create a pipeline

Create a pipeline that pre-processes each factor variable with impact encoding. The pipeline should run an autotuner that automatically conducts hyperparameter optimization (HPO) with an XGBoost learner that learns on the pre-processed features using random search and MSE as performance measure. You can use CV with a suitable number of folds for the resampling stragegy. For the search space, you can use lts("regr.xgboost.default") from the mlr3tuningspaces package. This constructs a search space customized for Xgboost based on theoretically and empirically validated considerations on which variables to tune or not. However, you should set the parameter nrounds = 100 for speed reasons. Further, set nthread = parallel::detectCores() to prepare multi-core computing later on.

Hint 1:

The pipeline must be embedded in the autotuner: the learner supplied to the autotuner must include the feature preprocessing and the XGboost learner.

Hint 2:

# Create xgboost learner:
xgb = lrn(...)

# Set search space from mlr3tuningspaces:
xgb_ts = ...

# Set nrounds and nthread:
xgb_ts$... = ....
xgb_ts$... = ....

# Combine xgb_ts with impact encoding:
xgb_ts_impact = as_learner(...)

# Use random search:
tuner = tnr(...)

#Autotuner pipeline component:
at = auto_tuner(
  tuner = ...,
  learner = ...,
  search_space = ...,
  resampling = ...,
  measure = ...,
  term_time = ...) # Maximum allowed time in seconds.

# Combine pipeline:
glrn_xgb_impact = as_learner(...)
glrn_xgb_impact$id = "XGB_enc_impact"

Solution

Exercise 2: Benchmark a pipeline

Benchmark your impact encoding pipeline from the previous task against a simple one-hot encoding pipeline that uses one-hot encoding for all factor variables. Use the same autotuner setup as element for both. Use two-fold CV as resampling strategy for the benchmark. Afterwards, evaluate the benchmark with MSE. Finally, assess the performance via the “untouched test set principle” by training both autotuners on tasktrain and evaluate their performance on tasktest.

Hint 1:

To conduct the benchmark, use benchmark(benchmark_grid(...)).

Hint 2:

To conduct performance evaluation, use $aggregate() on the benchmark object.

Solution

Summary

We learned how to apply pre-processing steps together with tuning to construct refined pipelines for benchmark experiments.