We show how to do various kinds of target transformations using pipelines.

Transforming the target variable often can lead to predictive improvement and is a widely used tool. Typical transformations are for example the \(\log\) transformation of the target aiming at minimizing (right) skewness, or the Box Cox and Yeo-Johnson transformations being more flexible but having a similar goal.

One option to perform, e.g., a \(\log\) transformation would be to manually transform the target prior to training a `Learner`

(and also predicting from it) and then manually invert this transformation via \(\exp\) after predicting from the `Learner`

. This is quite cumbersome, especially if a transformation and inverse transformation require information about both the training and prediction data.

In this post, we show how to do various kinds of target transformations using mlr3pipelines and explain the design of the target transformation and inversion `PipeOps`

.

You will:

learn how to do simple target transformations using

`PipeOpTargetMutate`

be introduced to the abstract base class to implement custom target transformations,

`PipeOpTargetTrafo`

implement a custom target transformation PipeOp,

`PipeOpTargetTrafoBoxCox`

As a prerequisite, you should be quite familiar with mlr3pipelines, i.e, know about the `$state`

field of `PipeOp`

s, input and output channels, as well as `Graphs`

. We will start with a `PipeOp`

for simple target transformations, `PipeOpTargetMutate`

.

We load the most important packages for this example.

We initialize the random number generator with a fixed seed for reproducibility, and decrease the verbosity of the logger to keep the output clearly represented.

```
set.seed(7832)
lgr::get_logger("mlr3")$set_threshold("warn")
```

In all sections we will use the `mtcars`

regression `task`

with `mpg`

being a numerical, positive target:

```
mpg
Min. :10.40
1st Qu.:15.43
Median :19.20
Mean :20.09
3rd Qu.:22.80
Max. :33.90
```

Moreover, as a `Learner`

we will use an `ordinary linear regression learner`

:

```
learner_lm = lrn("regr.lm")
```

The term *simple* refers to transformations that are given by a function of the target, relying on no other arguments (constants are of course allowed). The most prominent example is given by the \(\log\) transformation which we can later invert by applying the \(\exp\) transformation.

If you are only interested in doing such a transformation and you do not have the time to read more of this post, simply use the following syntactic sugar:

This constructs a `Graph`

that will \(\log\) transform the target prior to training the linear regression learner (or predicting from it) and \(\exp\) transform the target after predicting from it. Note that you can supply any other `Learner`

or even a whole `Graph`

as the `graph`

argument.

Now, we will go into more detail about how this actually works:

We can perform a \(\log\) transformation of our numerical, positive target, `mpg`

, using `PipeOpTargetMutate`

(by default, `ppl("targettrafo")`

uses this target transformation `PipeOp`

):

We have to specify the `trafo`

parameter as a function of `x`

(which will internally be evaluated to be the target of the `Task`

): `trafo = function(x) log(x))`

. In principle, this is all that is needed to transform the target prior to training a `Learner`

(or predicting from it), i.e., if we now train this `PipeOp`

, we see that the target is transformed as specified:

```
trafo$train(list(task))$output$data(cols = task$target_names)
```

```
mpg
1: 3.044522
2: 3.044522
3: 3.126761
4: 3.063391
5: 2.928524
6: 2.895912
7: 2.660260
8: 3.194583
9: 3.126761
10: 2.954910
11: 2.879198
12: 2.797281
13: 2.850707
14: 2.721295
15: 2.341806
16: 2.341806
17: 2.687847
18: 3.478158
19: 3.414443
20: 3.523415
21: 3.068053
22: 2.740840
23: 2.721295
24: 2.587764
25: 2.954910
26: 3.306887
27: 3.258097
28: 3.414443
29: 2.760010
30: 2.980619
31: 2.708050
32: 3.063391
mpg
```

After having predicted from the `Learner`

we could then proceed to apply the inverse transformation function in a similar manner. However, in mlr3pipelines, we decided to go with a more unified design of handling target transformations. In all target transformation `PipeOp`

s also the inverse transformation function of the target has to be specified. Therefore, in `PipeOpTargetMutate`

, the parameter `inverter`

also has to be correctly specified:

Internally, this function will be applied to the `data.table`

downstream of a `Prediction`

object without the `$row_id`

and `$truth`

columns, and we specify that the `$response`

column should be transformed. Note that applying the inverse transformation will typically only be done to the `$response`

column, because transforming standard errors or probabilities is often not straightforward.

To actually carry out the inverse transformation function after predicting from the `Learner`

, we then rely on `PipeOpTargetInvert`

. `PipeOpTargetInvert`

has an empty `ParamSet`

and its sole purpose is to apply the inverse transformation function after having predicted from a `Learner`

(note that this whole design of target transformations may seem somewhat over-engineered at first glance, however, we will learn of its advantages when we later move to the advanced section).

`PipeOpTargetInvert`

has two input channels named `"fun"`

and `"prediction"`

. During training, both take `NULL`

as input (because this is what a `Learner`

’s `"output"`

output and `PipeOpTargetMutate`

’s `"fun"`

output will return during training). During prediction, the `"prediction"`

input takes a `Prediction`

, and the `"fun"`

input takes the `"fun"`

output from `PipeOpTargetMutate`

(you may have noticed already, that `PipeOpTargetMutate`

has actually two outputs, `"fun"`

and `"output"`

, with `"fun"`

returning `NULL`

during training and a function during prediction, while `"output"`

always returns the transformed input `Task`

). We can see this, if we look at:

```
trafo$output
```

```
name train predict
1: fun NULL function
2: output Task Task
```

```
trafo$predict(list(task))
```

```
$fun
function (inputs)
{
assert_list(inputs, len = 1L, types = "Prediction")
list(private$.invert(inputs[[1L]], predict_phase_state))
}
<bytecode: 0x7e1da390>
<environment: 0x7e1dacf8>
$output
<TaskRegr:mtcars> (32 x 11)
* Target: mpg
* Properties: -
* Features (10):
- dbl (10): am, carb, cyl, disp, drat, gear, hp, qsec, vs, wt
```

We will talk more about such technical details in the advanced section. For now, to finally construct our target transformation pipeline, we build a `Graph`

:

Manually connecting the edges is quite cumbersome. First we connect the `"output"`

output of `"targetmutate"`

to the `"input"`

input of `"regr.lm"`

:

```
g$add_edge(src_id = "targetmutate", dst_id = "regr.lm",
src_channel = 2, dst_channel = 1)
```

Then we connect the `"output"`

output of `"regr.lm"`

to the `"prediction"`

input of `"targetinvert"`

:

```
g$add_edge(src_id = "regr.lm", dst_id = "targetinvert",
src_channel = 1, dst_channel = 2)
```

Finally, we connect the `"fun"`

output of `"targetmutate"`

to the `"fun"`

input of `"targetinvert"`

:

```
g$add_edge(src_id = "targetmutate", dst_id = "targetinvert",
src_channel = 1, dst_channel = 1)
```

This graph (which is conceptually the same graph as constructed via the `ppl("targettrafo")`

syntactic sugar above) looks like the following:

```
g$plot()
```

We can then finally call `$train()`

and `$predict()`

(prior to this we wrap the `Graph`

in a `GraphLearner`

):

```
gl = GraphLearner$new(g)
gl$train(task)
gl$state
```

```
$model
$model$targetmutate
list()
$model$regr.lm
$model$regr.lm$model
Call:
stats::lm(formula = task$formula(), data = task$data())
Coefficients:
(Intercept) am carb cyl disp drat gear hp qsec
2.776e+00 4.738e-02 -2.012e-02 7.657e-03 4.989e-05 2.220e-02 5.925e-02 -8.964e-04 3.077e-02
vs wt
-2.874e-03 -1.723e-01
$model$regr.lm$log
Empty data.table (0 rows and 3 cols): stage,class,msg
$model$regr.lm$train_time
[1] 0.007
$model$regr.lm$param_vals
named list()
$model$regr.lm$task_hash
[1] "6ca8c90cdf732078"
$model$regr.lm$data_prototype
Empty data.table (0 rows and 11 cols): mpg,am,carb,cyl,disp,drat...
$model$regr.lm$task_prototype
Empty data.table (0 rows and 11 cols): mpg,am,carb,cyl,disp,drat...
$model$regr.lm$mlr3_version
[1] '0.13.2'
$model$regr.lm$train_task
<TaskRegr:mtcars> (32 x 11)
* Target: mpg
* Properties: -
* Features (10):
- dbl (10): am, carb, cyl, disp, drat, gear, hp, qsec, vs, wt
$model$targetinvert
list()
$log
Empty data.table (0 rows and 3 cols): stage,class,msg
$train_time
[1] 0.058
$param_vals
$param_vals$targetmutate.trafo
function(x) log(x)
<environment: 0xd995bff0>
$param_vals$targetmutate.inverter
function(x) list(response = exp(x$response))
<environment: 0xd995bff0>
$task_hash
[1] "58a137d2055e8406"
$data_prototype
Empty data.table (0 rows and 11 cols): mpg,am,carb,cyl,disp,drat...
$task_prototype
Empty data.table (0 rows and 11 cols): mpg,am,carb,cyl,disp,drat...
$mlr3_version
[1] '0.13.2'
$train_task
<TaskRegr:mtcars> (32 x 11)
* Target: mpg
* Properties: -
* Features (10):
- dbl (10): am, carb, cyl, disp, drat, gear, hp, qsec, vs, wt
```

```
gl$predict(task)
```

```
<PredictionRegr> for 32 observations:
row_ids truth response
1 21.0 21.67976
2 21.0 21.10831
3 22.8 25.73690
---
30 19.7 19.58533
31 15.0 14.11015
32 21.4 23.11105
```

and contrast this with `$train()`

and `$predict()`

of the naive linear regression learner (also look at the estimated coefficients of the linear regression contained in `$state$model`

):

```
learner_lm$train(task)
learner_lm$state
```

```
$model
Call:
stats::lm(formula = task$formula(), data = task$data())
Coefficients:
(Intercept) am carb cyl disp drat gear hp qsec
12.30337 2.52023 -0.19942 -0.11144 0.01334 0.78711 0.65541 -0.02148 0.82104
vs wt
0.31776 -3.71530
$log
Empty data.table (0 rows and 3 cols): stage,class,msg
$train_time
[1] 0.004
$param_vals
named list()
$task_hash
[1] "58a137d2055e8406"
$data_prototype
Empty data.table (0 rows and 11 cols): mpg,am,carb,cyl,disp,drat...
$task_prototype
Empty data.table (0 rows and 11 cols): mpg,am,carb,cyl,disp,drat...
$mlr3_version
[1] '0.13.2'
$train_task
<TaskRegr:mtcars> (32 x 11)
* Target: mpg
* Properties: -
* Features (10):
- dbl (10): am, carb, cyl, disp, drat, gear, hp, qsec, vs, wt
```

```
learner_lm$predict(task)
```

```
<PredictionRegr> for 32 observations:
row_ids truth response
1 21.0 22.59951
2 21.0 22.11189
3 22.8 26.25064
---
30 19.7 19.69383
31 15.0 13.94112
32 21.4 24.36827
```

You should continue reading, if you are interested in more advanced target transformations, i.e., where the transformation and inverse transformation require information about both the training and prediction data.

First we will introduce the abstract base class for doing target transformations, `PipeOpTargetTrafo`

, from which `PipeOpTargetMutate`

inherits.

No matter how “complicated” the actual target transformation and inverse transformation may be, applying the inverse transformation function after having predicted from a `Learner`

will always be done via `PipeOpTargetInvert`

(as already outlined above, `PipeOpTargetInvert`

has an empty `ParamSet`

and its sole purpose is to apply the inverse transformation function after having predicted from a `Learner`

). All `Graph`

s for doing target transformations will therefore look similar like the simple one above, i.e., a target transformation `PipeOp`

followed by some `Learner`

or a whole `Graph`

, followed by `PipeOpTargetInvert`

. Therefore, using `ppl("targettrafo")`

to construct such `Graph`

s is highly recommended.

To allow for more advanced target transformations, we now have a closer look at the abstract base class, `PipeOpTargetTrafo`

:

`PipeOpTargetTrafo`

has one input channel, named `"input"`

taking a `Task`

both during training and prediction. It’s two output channels are named `"fun"`

and `"output"`

. During training `"fun"`

returns `NULL`

and during prediction `"fun"`

returns a function that will be used by `PipeOpTargetInvert`

to perform the inverse target transformation on `PipeOpTargetInvert`

’s `"prediction"`

input. `"output"`

returns the modified input `Task`

both during training and prediction.

Subclasses can overload up to four functions:

`.get_state()`

takes the input`Task`

and returns a`list()`

which will internally be used to set the`$state`

. Typically it is sensible to make use of the`$state`

during`.transform()`

and`.train_invert()`

. The base implementation returns`list()`

and should be overloaded if setting the state is desired.`.transform()`

takes the input`Task`

and returns a modified`Task`

(i.e., the`Task`

with the transformed target). This is the main function for doing the actual target transformation. Note that`.get_state()`

is evaluated a single time during training right before`.transform()`

and therefore, you can rely on the`$state`

that has been set. To update the input`Task`

with respect to the transformed target, subclasses should make use of the`convert_task()`

function and drop the original target from the`Task`

.`.transform()`

also accepts a`phase`

argument that will receive`"train"`

during training and`"predict"`

during prediction. This can be used to enable different behavior during training and prediction.`.transform()`

should always be overloaded by subclasses.`.train_invert()`

takes the input`Task`

and returns a`predict_phase_state`

object. This can be anything. Note that`.train_invert()`

should not modify the input`Task`

. The base implementation returns a list with a single argument, the`$truth`

column of the input`Task`

and should be overloaded if a more training-phase-dependent state is desired.`.invert()`

takes a`Prediction`

and a`predict_phase_state`

object as inputs and returns a`Prediction`

. This is the main function for specifying the actual inverse target transformation that will later be carried out by`PipeOpTargetInvert`

. Internally a private helper function ,`.invert_help()`

will construct the function that will be returned by the`"fun"`

output of`PipeOpTargetTrafo`

so that`PipeOpTargetInvert`

can later simply dispatch this inverse target transformation on its`"prediction"`

input.

The supposed workflow of a class inherited from `PipeOpTargetTrafo`

is given in the following figure:

To solidify our understanding we will design a new target transformation `PipeOp`

in the next section: `PipeOpTargetTrafoBoxCox`

The Box-Cox transformation of a target \(y_{i}\) is given as:

\[y_{i}(\lambda) = \begin{cases} \frac{y_{i}^{\lambda} - 1}{\lambda} & \text{if}~\lambda \neq 0; \\ \log(y_{i}) & \text{if}~\lambda = 0 \end{cases}\]

mlr3pipelines already supports the Box-Cox transformation for numerical, positive features, see `?PipeOpBoxCox`

.

Here we will design a `PipeOp`

to apply the Box-Cox transformation as a target transformation. The \(\lambda\) parameter of the transformation is estimated during training and used for both the training and prediction transformation. After predicting from a `Learner`

we will as always apply the inverse transformation function. To do the actual transformation we will use `bestNormalize::boxcox()`

.

First, we inherit from `PipeOpTargetTrafo`

and overload the `initialize()`

function:

```
PipeOpTargetTrafoBoxCox = R6Class("PipeOpTargetTrafoBoxCox",
inherit = PipeOpTargetTrafo,
public = list(
initialize = function(id = "targettrafoboxcox", param_vals = list()) {
param_set = ps(
standardize = p_lgl(default = TRUE, tags = c("train", "boxcox")),
eps = p_dbl(default = 0.001, lower = 0, tags = c("train", "boxcox")),
lower = p_dbl(default = -1L, tags = c("train", "boxcox")),
upper = p_dbl(default = 2L, tags = c("train", "boxcox"))
)
super$initialize(id = id, param_set = param_set, param_vals = param_vals,
packages = "bestNormalize", task_type_in = "TaskRegr",
task_type_out = "TaskRegr")
}
),
private = list(
.get_state = function(task) {
...
},
.transform = function(task, phase) {
...
},
.train_invert = function(task) {
...
},
.invert = function(prediction, predict_phase_state) {
...
}
)
)
```

As parameters, we allow `"standardize"`

(whether to center and scale the transformed values to attempt a standard normal distribution), `"eps"`

(tolerance parameter to identify if the \(\lambda\) parameter is equal to zero), `"lower"`

(lower value for the estimation of the \(\lambda\) parameter) and `"upper"`

(upper value for the estimation of the \(\lambda\) parameter). Note that we set `task_type_in = "TaskRegr"`

and `task_type_out = "TaskRegr"`

to specify that this `PipeOp`

only works for `regression Tasks`

.

Second, we overload the four functions as mentioned above.

We start with `.get_state()`

. We extract the target and apply the Box-Cox transformation to the target. This yields an object of class `"boxcox"`

which we will wrap in a `list()`

and set as the `$state`

(`bc$x.t = NULL`

and `bc$x = NULL`

is done to save some memory because we do not need the transformed original data and original data later):

```
function(task) {
.get_state = task$data(cols = task$target_names)[[1L]]
target = mlr3misc::invoke(bestNormalize::boxcox, target,
bc =.args = self$param_set$get_values(tags = "boxcox"))
$x.t = NULL
bc$x = NULL
bclist(bc = bc)
},
```

Next, we tackle `.transform()`

. This is quite straightforward, because objects of class `"boxcox"`

have their own predict method which we can use here to carry out the actual Box-Cox transformation based on the learned \(\lambda\) parameter as stored in the `"boxcox"`

object in the `$state`

(both during training and prediction). We then rename the target, add it to the task and finally update the task with respect to this new target:

```
function(task, phase) {
.transform = task$data(cols = task$target_names)[[1L]]
target = as.data.table(predict(self$state$bc, newdata = target))
new_target =colnames(new_target) = paste0(task$target_names, ".bc")
$cbind(new_target)
taskconvert_task(task, target = colnames(new_target),
drop_original_target = TRUE)
},
```

Time to overload `.train_invert()`

. This is even more straightforward, because the prediction method for objects of class `"boxcox"`

directly allows for inverting the transformation via setting the argument `inverse = TRUE`

. Therefore, we only need the `"boxcox"`

object stored in the `$state`

along the `$truth`

column of the input `Task`

(remember that this list will later be available as the `predict_phase_state`

object):

```
function(task) {
.train_invert =list(truth = task$truth(), bc = self$state$bc)
},
```

Finally, we overload `.invert()`

. We extract the truth from the `predict_phase_state`

and the response from the `Prediction`

. We then apply the inverse Box-Cox transformation to the response based on the \(\lambda\) parameter and the mean and standard deviation learned during training, relying on the `predict_phase_state`

object. Finally, we construct a new `Prediction`

object:

```
.invert = function(prediction, predict_phase_state) {
truth = predict_phase_state$truth
response = predict(predict_phase_state$bc, newdata = prediction$response,
inverse = TRUE)
PredictionRegr$new(row_ids = prediction$row_ids, truth = truth,
response = response)
}
```

Note that this `PipeOp`

is ill-equipped to handle the case of `predict_type = "se"`

, i.e., we always only return a `response`

prediction (as outlined above, this is the case for most target transformations, because transforming standard errors or probabilities of a prediction is often not straightforward). We could of course check whether the `predict_type`

is set to `"se"`

and if this is the case, return `NA`

as the standard errors.

To construct our final target transformation `Graph`

with our linear regression learner, we again simply make use of `ppl("targettrafo")`

:

```
g_bc = ppl("targettrafo", graph = learner_lm,
trafo_pipeop = PipeOpTargetTrafoBoxCox$new())
```

The following plot should already look quite familiar:

```
g_bc$plot()
```

Finally we `$train()`

and `$predict()`

on the task (again, we wrap the `Graph`

in a `GraphLearner`

):

```
gl_bc = GraphLearner$new(g_bc)
gl_bc$train(task)
gl_bc$state
```

```
$model
$model$regr.lm
$model$regr.lm$model
Call:
stats::lm(formula = task$formula(), data = task$data())
Coefficients:
(Intercept) am carb cyl disp drat gear hp qsec
-0.6272999 0.1670950 -0.0663126 0.0237529 0.0002376 0.0759944 0.1963335 -0.0030367 0.1043210
vs wt
-0.0080166 -0.5800635
$model$regr.lm$log
Empty data.table (0 rows and 3 cols): stage,class,msg
$model$regr.lm$train_time
[1] 0.006
$model$regr.lm$param_vals
named list()
$model$regr.lm$task_hash
[1] "612ab4e0ad596159"
$model$regr.lm$data_prototype
Empty data.table (0 rows and 11 cols): mpg.bc,am,carb,cyl,disp,drat...
$model$regr.lm$task_prototype
Empty data.table (0 rows and 11 cols): mpg.bc,am,carb,cyl,disp,drat...
$model$regr.lm$mlr3_version
[1] '0.13.2'
$model$regr.lm$train_task
<TaskRegr:mtcars> (32 x 11)
* Target: mpg.bc
* Properties: -
* Features (10):
- dbl (10): am, carb, cyl, disp, drat, gear, hp, qsec, vs, wt
$model$targettrafoboxcox
$model$targettrafoboxcox$bc
Standardized Box Cox Transformation with 32 nonmissing obs.:
Estimated statistics:
- lambda = 0.02955701
- mean (before standardization) = 3.092016
- sd (before standardization) = 0.324959
$model$targetinvert
list()
$log
Empty data.table (0 rows and 3 cols): stage,class,msg
$train_time
[1] 0.061
$param_vals
named list()
$task_hash
[1] "58a137d2055e8406"
$data_prototype
Empty data.table (0 rows and 11 cols): mpg,am,carb,cyl,disp,drat...
$task_prototype
Empty data.table (0 rows and 11 cols): mpg,am,carb,cyl,disp,drat...
$mlr3_version
[1] '0.13.2'
$train_task
<TaskRegr:mtcars> (32 x 11)
* Target: mpg
* Properties: -
* Features (10):
- dbl (10): am, carb, cyl, disp, drat, gear, hp, qsec, vs, wt
```

```
gl_bc$predict(task)
```

```
<PredictionRegr> for 32 observations:
row_ids truth response
1 21.0 21.70854
2 21.0 21.13946
3 22.8 25.75242
---
30 19.7 19.58934
31 15.0 14.10658
32 21.4 23.15263
```

We could now proceed to benchmark our different target transformations:

```
bmr$aggregate(msr("regr.mse"))
```

```
nr resample_result task_id learner_id resampling_id iters regr.mse
1: 1 <ResampleResult[22]> mtcars regr.lm cv 10 11.866071
2: 2 <ResampleResult[22]> mtcars targetmutate.regr.lm.targetinvert cv 10 7.793303
3: 3 <ResampleResult[22]> mtcars targettrafoboxcox.regr.lm.targetinvert cv 10 8.230192
```

For attribution, please cite this work as

Schneider (2020, June 15). mlr-org: Target Transformations via Pipelines. Retrieved from https://mlr-org.github.io/mlr-org-website/gallery/2020-06-15-target-transformations-via-pipelines/

BibTeX citation

@misc{schneider2020target, author = {Schneider, Lennart}, title = {mlr-org: Target Transformations via Pipelines}, url = {https://mlr-org.github.io/mlr-org-website/gallery/2020-06-15-target-transformations-via-pipelines/}, year = {2020} }