set.seed(124)
library(mlr3verse)
library(mlr3tuningspaces)
data("kc_housing", package = "mlr3data")
Goal
Apply what you have learned about using pipelines for efficient pre-processing and model training on a regression problem.
House Prices in King county
In this exercise, we want to model house sale prices in King county in the state of Washington, USA.
We do some simple feature pre-processing first:
# Transform time to numeric variable:
library(anytime)
= anytime(kc_housing$date)
dates $date = as.numeric(difftime(dates, min(dates), units = "days"))
kc_housing# Scale prices:
$price = kc_housing$price / 1000
kc_housing# For this task, delete columns containing NAs:
= kc_housing$yr_renovated
yr_renovated = kc_housing$sqft_basement
sqft_basement c(13, 15)] = NULL
kc_housing[,# Create factor columns:
c(8, 14)] = lapply(c(8, 14), function(x) {as.factor(kc_housing[,x])})
kc_housing[,# Get an overview:
str(kc_housing)
'data.frame': 21613 obs. of 18 variables:
$ date : num 164 221 299 221 292 ...
$ price : num 222 538 180 604 510 ...
$ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ...
$ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
$ sqft_living : int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
$ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
$ floors : num 1 2 1 1 1 1 2 1 1 2 ...
$ waterfront : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
$ view : int 0 0 0 0 0 0 0 0 0 0 ...
$ condition : int 3 3 3 5 3 3 3 3 3 3 ...
$ grade : int 7 7 6 7 8 11 7 7 7 7 ...
$ sqft_above : int 1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
$ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
$ zipcode : Factor w/ 70 levels "98001","98002",..: 67 56 17 59 38 30 3 69 61 24 ...
$ lat : num 47.5 47.7 47.7 47.5 47.6 ...
$ long : num -122 -122 -122 -122 -122 ...
$ sqft_living15: int 1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
$ sqft_lot15 : int 5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
- attr(*, "index")= int(0)
Train-test Split
Before we train a model, let’s reserve some data for evaluating our model later on:
= as_task_regr(kc_housing, target = "price")
task = partition(task, ratio = 0.6)
split
= task$clone()
tasktrain $filter(split$train)
tasktrain
= task$clone()
tasktest $filter(split$test) tasktest
Conditional Encoding
In the King county data, there are two categorial features encoded as factor
:
lengths(task$levels())
waterfront zipcode
2 70
Obviously, waterfront
is a low-cardinality feature suitable for one-hot encoding and zipcode
is a very high-cardinality feature. Therefore, it would make sense to create a pipeline that first pre-processes each factor variable with either impact or one-hot encoding, depending on the feature cardinality.
Filters
Filter algorithms select features by assigning numeric scores to each feature, e.g. correlation between features and target variable, use these to rank the features and select a feature subset based on the ranking. Features that are assigned lower scores are then omitted in subsequent modeling steps. All filters are implemented via the package mlr3filters
. A very simple filter approach could look like this:
- Calculate the correlation coefficient between each feature and a numeric target variable
- Select the 10 features with the highest correlation for further modeling steps.
A different strategy could entail selecting only features above a certain threshold of correlation with the outcome. For a full list of all implemented filter methods, take a look at https://mlr3filters.mlr-org.com.
Exercises
Exercise 1: Create a complex pipeline
Create a pipeline with the following sequence of elements:
- Each factor variable gets pre-processed with either one-hot or impact encoding, depending on the cardinality of the feature.
- A filter selector is applied to the features, sorting them by their Pearson correlation coefficient and selecting the 3 features with the highest correlation.
- A random forest (
regr.ranger
) is trained.
The pipeline should be tuned within an autotuner
with random search, two-fold CV and MSE as performance measure, and a search space from mlr3tuningspaces
but without tuning the hyperparameter replace
. Train the autotuner
on the training data, and evaluate the performance on the holdout test data.
Hint 1:
Check out the help page of lts
from mlr3tuningspaces
.
Hint 2:
Since we want to work with the search space right away, it’s recommended to insert the Learner
directly. Ensure that the learner uses the default value for the replace
hyperparameter.
Exercise 2: Information gain
An alternative filter method is information gain (https://mlr3filters.mlr-org.com/reference/mlr_filters_information_gain.html). Recreate the pipeline from exercise 1, but use information gain as filter. Again, select the three features with the highest information gain. Train the autotuner
on the training data, and evaluate the performance on the holdout test data.
Exercise 3: Pearson correlation vs. Information gain
We receive the following performance scores for the two filter methods:
score_rf_cor
regr.mse
24229.17
score_rf_info
regr.mse
30589.75
As you can see, the Pearson correlation filter seems to select features that result in a better model. To investigate why that may have happened, inspect the trained autotuners. Which features have been selected? Given the selected features, reason to what extent which filter methods may be more helpful than others in determining features to select for the model training process.
Exercise 4: Imputation
In the three exercises before, we excluded two variables from the kc_housing
data with missing values. Let’s add them back to the data set and try to impute missing values automatically with a pipeline.
$yr_renovated = yr_renovated
kc_housing$sqft_basement = sqft_basement
kc_housing= as_task_regr(kc_housing, target = "price")
task # Check again which features in the task have NAs:
names(which(task$missings() > 0))
[1] "sqft_basement" "yr_renovated"
Further, we use a simple train-test split as before:
= partition(task, ratio = 0.6)
split
= task$clone()
tasktrain $filter(split$train)
tasktrain
= task$clone()
tasktest $filter(split$test) tasktest
As the two features with missing values are both numeric, we can compare two simple strategies for imputation: mean imputation and histogram imputation. While the former replaces NAs
with the mean value of a feature, the latter samples from the empirical distribution of the non-NA
values of the feature, ensuring that the marginal distribution is preserved. For each imputation method, construct a simple pipeline that trains a random forest without any HPO on the tasktrain
, and evaluate the model on tasktest
using MSE.
Summary
We learned about more complex pipelines, including pre-processing methods for imputation, variable encoding and feature filtering.