set.seed(124)
library(mlr3verse)
library("mlr3tuningspaces")
data("kc_housing", package = "mlr3data")
Goal
Apply what you have learned about using pipelines for efficient pre-processing and model training on a regression problem.
House Prices in King county
In this exercise, we want to model house sale prices in King county in the state of Washington, USA.
We do some simple feature pre-processing first:
# Transform time to numeric variable:
library(anytime)
= anytime(kc_housing$date)
dates $date = as.numeric(difftime(dates, min(dates), units = "days"))
kc_housing# Scale prices:
$price = kc_housing$price / 1000
kc_housing# For this task, delete columns containing NAs:
c(13, 15)] = NULL
kc_housing[,# Create factor columns:
c(8, 14)] = lapply(c(8, 14), function(x) {as.factor(kc_housing[,x])})
kc_housing[,# Get an overview:
str(kc_housing)
'data.frame': 21613 obs. of 18 variables:
$ date : num 164 221 299 221 292 ...
$ price : num 222 538 180 604 510 ...
$ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ...
$ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
$ sqft_living : int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
$ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
$ floors : num 1 2 1 1 1 1 2 1 1 2 ...
$ waterfront : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
$ view : int 0 0 0 0 0 0 0 0 0 0 ...
$ condition : int 3 3 3 5 3 3 3 3 3 3 ...
$ grade : int 7 7 6 7 8 11 7 7 7 7 ...
$ sqft_above : int 1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
$ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
$ zipcode : Factor w/ 70 levels "98001","98002",..: 67 56 17 59 38 30 3 69 61 24 ...
$ lat : num 47.5 47.7 47.7 47.5 47.6 ...
$ long : num -122 -122 -122 -122 -122 ...
$ sqft_living15: int 1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
$ sqft_lot15 : int 5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
- attr(*, "index")= int(0)
Train-test Split
Before we train a model, let’s reserve some data for evaluating our model later on:
= as_task_regr(kc_housing, target = "price")
task = partition(task, ratio = 0.6)
split
= task$clone()
tasktrain $filter(split$train)
tasktrain tasktrain
<TaskRegr:kc_housing> (12968 x 18)
* Target: price
* Properties: -
* Features (17):
- int (10): bedrooms, condition, grade, sqft_above, sqft_living, sqft_living15, sqft_lot, sqft_lot15,
view, yr_built
- dbl (5): bathrooms, date, floors, lat, long
- fct (2): waterfront, zipcode
= task$clone()
tasktest $filter(split$test)
tasktest tasktest
<TaskRegr:kc_housing> (8645 x 18)
* Target: price
* Properties: -
* Features (17):
- int (10): bedrooms, condition, grade, sqft_above, sqft_living, sqft_living15, sqft_lot, sqft_lot15,
view, yr_built
- dbl (5): bathrooms, date, floors, lat, long
- fct (2): waterfront, zipcode
XGBoost
XGBoost (Chen and Guestrin, 2016 is a highly performant library for gradient-boosted trees. As some other ML learners, it cannot handle categorical data, so categorical features must be encoded as numerical variables. In the King county data, there are two categorical features encoded as factor
:
= task$feature_types
ft 2]] == "factor"] ft[ft[[
Key: <id>
id type
<char> <char>
1: waterfront factor
2: zipcode factor
Categorical features can be grouped by their cardinality, which refers to the number of levels they contain: binary features (two levels), low-cardinality features, and high-cardinality features; there is no universal threshold for when a feature should be considered high-cardinality and this threshold can even be tuned. Low-cardinality features can be handled by one-hot encoding. One-hot encoding is a process of converting categorical features into a binary representation, where each possible category is represented as a separate binary feature. Theoretically, it is sufficient to create one less binary feature than levels. This is typically called dummy or treatment encoding and is required if the learner is a generalized linear model (GLM) or additive model (GAM). For now, let’s check the cardinality of waterfront
and zipcode
:
lengths(task$levels())
waterfront zipcode
2 70
Obviously, waterfront
is a low-cardinality feature suitable for dummy (also called treatment) encoding and zipcode
is a very high-cardinality feature. Some learners support handling categorical features but may still crash for high-cardinality features if they internally apply encodings that are only suitable for low-cardinality features, such as one-hot encoding.
Impact encoding
Impact encoding (Micci-Barreca 2001) is a good approach for handling high-cardinality features. Impact encoding converts categorical features into numeric values. The idea behind impact encoding is to use the target feature to create a mapping between the categorical feature and a numerical value that reflects its importance in predicting the target feature. Impact encoding involves the following steps:
- Group the target variable by the categorical feature.
- Compute the mean of the target variable for each group.
- Compute the global mean of the target variable.
- Compute the impact score for each group as the difference between the mean of the target variable for the group and the global mean of the target variable.
- Replace the categorical feature with the impact scores.
Impact encoding preserves the information of the categorical feature while also creating a numerical representation that reflects its importance in predicting the target. Compared to one-hot encoding, the main advantage is that only a single numeric feature is created regardless of the number of levels of the categorical features, hence it is especially useful for high-cardinality features. As information from the target is used to compute the impact scores, the encoding process must be embedded in cross-validation to avoid leakage between training and testing data.
Exercises
Exercise 1: Create a pipeline
Create a pipeline that pre-processes each factor variable with impact encoding. The pipeline should run an autotuner
that automatically conducts hyperparameter optimization (HPO) with an XGBoost learner that learns on the pre-processed features using random search and MSE as performance measure. You can use CV with a suitable number of folds for the resampling stragegy. For the search space, you can use lts("regr.xgboost.default")
from the mlr3tuningspaces
package. This constructs a search space customized for Xgboost based on theoretically and empirically validated considerations on which variables to tune or not. However, you should set the parameter nrounds = 100
for speed reasons. Further, set nthread = parallel::detectCores()
to prepare multi-core computing later on.
Hint 1:
The pipeline must be embedded in the autotuner
: the learner supplied to the autotuner
must include the feature preprocessing and the XGboost learner.
Hint 2:
# Create xgboost learner:
= lrn(...)
xgb
# Set search space from mlr3tuningspaces:
= ...
xgb_ts
# Set nrounds and nthread:
$... = ....
xgb_ts$... = ....
xgb_ts
# Combine xgb_ts with impact encoding:
= as_learner(...)
xgb_ts_impact
# Use random search:
= tnr(...)
tuner
#Autotuner pipeline component:
= auto_tuner(
at tuner = ...,
learner = ...,
search_space = ...,
resampling = ...,
measure = ...,
term_time = ...) # Maximum allowed time in seconds.
# Combine pipeline:
= as_learner(...)
glrn_xgb_impact $id = "XGB_enc_impact" glrn_xgb_impact
Exercise 2: Benchmark a pipeline
Benchmark your impact encoding pipeline from the previous task against a simple one-hot encoding pipeline that uses one-hot encoding for all factor variables. Use the same autotuner
setup as element for both. Use two-fold CV as resampling strategy for the benchmark. Afterwards, evaluate the benchmark with MSE. Finally, assess the performance via the “untouched test set principle” by training both autotuners on tasktrain
and evaluate their performance on tasktest
.
Hint 1:
To conduct the benchmark, use benchmark(benchmark_grid(...))
.
Hint 2:
To conduct performance evaluation, use $aggregate()
on the benchmark object.
Summary
We learned how to apply pre-processing steps together with tuning to construct refined pipelines for benchmark experiments.