JavaScript is required to unlock solutions.
Please enable JavaScript and reload the page,
or download the source files from GitHub and run the code locally.

Goal

Apply what you have learned about using pipelines for efficient pre-processing and model training on a regression problem.

House Prices in King county

In this exercise, we want to model house sale prices in King county in the state of Washington, USA.

set.seed(124)
library(mlr3verse)
library(mlr3tuningspaces)
data("kc_housing", package = "mlr3data")

We do some simple feature pre-processing first:

# Transform time to numeric variable:
library(anytime)
dates = anytime(kc_housing$date)
kc_housing$date = as.numeric(difftime(dates, min(dates), units = "days"))
# Scale prices:
kc_housing$price = kc_housing$price / 1000
# For this task, delete columns containing NAs:
yr_renovated = kc_housing$yr_renovated
sqft_basement = kc_housing$sqft_basement
kc_housing[,c(13, 15)] = NULL
# Create factor columns:
kc_housing[,c(8, 14)] = lapply(c(8, 14), function(x) {as.factor(kc_housing[,x])})
# Get an overview:
str(kc_housing)

'data.frame':   21613 obs. of  18 variables:
 $ date         : num  164 221 299 221 292 ...
 $ price        : num  222 538 180 604 510 ...
 $ bedrooms     : int  3 3 2 4 3 4 3 3 3 3 ...
 $ bathrooms    : num  1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
 $ sqft_living  : int  1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
 $ sqft_lot     : int  5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
 $ floors       : num  1 2 1 1 1 1 2 1 1 2 ...
 $ waterfront   : Factor w/ 2 levels "FALSE","TRUE": 1 1 1 1 1 1 1 1 1 1 ...
 $ view         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ condition    : int  3 3 3 5 3 3 3 3 3 3 ...
 $ grade        : int  7 7 6 7 8 11 7 7 7 7 ...
 $ sqft_above   : int  1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
 $ yr_built     : int  1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
 $ zipcode      : Factor w/ 70 levels "98001","98002",..: 67 56 17 59 38 30 3 69 61 24 ...
 $ lat          : num  47.5 47.7 47.7 47.5 47.6 ...
 $ long         : num  -122 -122 -122 -122 -122 ...
 $ sqft_living15: int  1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
 $ sqft_lot15   : int  5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
 - attr(*, "index")= int(0)

Add uncorrelated features to data

To test different strategies for feature selection in this exercise, we create two artificial features that are (mostly) uncorrelated with the outcome price:

# Uncorrelated feature x1:
kc_housing$x1 <- runif(n = nrow(kc_housing))
cor(kc_housing$x1, kc_housing$price)

[1] 0.002249436

# Uncorrelated feature x2:
kc_housing$x2 <- sin(0.01*kc_housing$price*kc_housing$grade)
cor(kc_housing$x2, kc_housing$price)

[1] 0.01329962

Train-test Split

Before we train a model, let’s reserve some data for evaluating our model later on:

task = as_task_regr(kc_housing, target = "price")
split = partition(task, ratio = 0.6)

tasktrain = task$clone()
tasktrain$filter(split$train)

tasktest = task$clone()
tasktest$filter(split$test)

Conditional Encoding

In the King county data, there are two categorial features encoded as factor:

lengths(task$levels())

waterfront    zipcode 
         2         70

Obviously, waterfront is a low-cardinality feature suitable for one-hot encoding and zipcode is a very high-cardinality feature. Therefore, it would make sense to create a pipeline that first pre-processes each factor variable with either impact or one-hot encoding, depending on the feature cardinality.

Filters

Filter algorithms select features by assigning numeric scores to each feature, e.g. correlation between features and target variable, use these to rank the features and select a feature subset based on the ranking. Features that are assigned lower scores are then omitted in subsequent modeling steps. All filters are implemented via the package mlr3filters. A very simple filter approach could look like this:

Calculate the correlation coefficient between each feature and a numeric target variable
Select the 10 features with the highest correlation for further modeling steps.

A different strategy could entail selecting only features above a certain threshold of correlation with the outcome. For a full list of all implemented filter methods, take a look at https://mlr3filters.mlr-org.com.

Exercises

Exercise 1: Create a complex pipeline

Create a pipeline with the following sequence of elements:

Each factor variable gets pre-processed with either one-hot or impact encoding, depending on the cardinality of the feature.
A filter selector is applied to the features, sorting them by their Pearson correlation coefficient and selecting the 3 features with the highest correlation.
A random forest (regr.ranger) is trained.

The pipeline should be tuned within an autotuner with random search, two-fold CV and MSE as performance measure, and a search space from mlr3tuningspaces but without tuning the hyperparameter replace. Train the autotuner on the training data, and evaluate the performance on the holdout test data.

Hint 1:

Check out the help page of lts from mlr3tuningspaces.

Hint 2:

Since we want to work with the search space right away, it’s recommended to insert the Learner directly. Ensure that the learner uses the default value for the replace hyperparameter.

Solution

Exercise 2: Information gain

An alternative filter method is information gain (https://mlr3filters.mlr-org.com/reference/mlr_filters_information_gain.html). Recreate the pipeline from exercise 1, but use information gain as filter. Again, select the three features with the highest information gain. Train the autotuner on the training data, and evaluate the performance on the holdout test data.

Solution

Exercise 3: Pearson correlation vs. Information gain

We receive the following performance scores for the two filter methods:

score_rf_cor

regr.mse 
24229.17

score_rf_info

regr.mse 
30589.75

As you can see, the Pearson correlation filter seems to select features that result in a better model. To investigate why that may have happened, inspect the trained autotuners. Which features have been selected? Given the selected features, reason to what extent which filter methods may be more helpful than others in determining features to select for the model training process.

Solution

Exercise 4: Imputation

In the three exercises before, we excluded two variables from the kc_housing data with missing values. Let’s add them back to the data set and try to impute missing values automatically with a pipeline.

kc_housing$yr_renovated = yr_renovated
kc_housing$sqft_basement = sqft_basement
task = as_task_regr(kc_housing, target = "price")
# Check again which features in the task have NAs:
names(which(task$missings() > 0))

[1] "sqft_basement" "yr_renovated"

Further, we use a simple train-test split as before:

split = partition(task, ratio = 0.6)

tasktrain = task$clone()
tasktrain$filter(split$train)

tasktest = task$clone()
tasktest$filter(split$test)

As the two features with missing values are both numeric, we can compare two simple strategies for imputation: mean imputation and histogram imputation. While the former replaces NAs with the mean value of a feature, the latter samples from the empirical distribution of the non-NA values of the feature, ensuring that the marginal distribution is preserved. For each imputation method, construct a simple pipeline that trains a random forest without any HPO on the tasktrain, and evaluate the model on tasktest using MSE.

Solution

Summary

We learned about more complex pipelines, including pre-processing methods for imputation, variable encoding and feature filtering.