library(mlr3verse)
Loading required package: mlr3
The mlr3book has a new chapter on validation and internal tuning
Encode factor variables in a task.
Michel Lang
January 31, 2020
The package xgboost unfortunately does not support handling of categorical features. Therefore, it is required to manually convert factor columns to numerical dummy features. We show how to use mlr3pipelines to augment the xgboost learner
with an automatic factor encoding.
We load the mlr3verse package which pulls in the most important packages for this example.
We initialize the random number generator with a fixed seed for reproducibility, and decrease the verbosity of the logger to keep the output clearly represented.
First, we take an example task with factors (german_credit
) and create the xgboost learner
:
<TaskClassif:german_credit> (1000 x 21): German Credit
* Target: credit_risk
* Properties: twoclass
* Features (20):
- fct (14): credit_history, employment_duration, foreign_worker, housing, job, other_debtors,
other_installment_plans, people_liable, personal_status_sex, property, purpose, savings, status,
telephone
- int (3): age, amount, duration
- ord (3): installment_rate, number_credits, present_residence
<LearnerClassifXgboost:classif.xgboost>
* Model: -
* Parameters: nrounds=100, nthread=1, verbose=0, early_stopping_set=none
* Packages: mlr3, mlr3learners, xgboost
* Predict Types: [response], prob
* Feature Types: logical, integer, numeric
* Properties: hotstart_forward, importance, missings, multiclass, twoclass, weights
We now compare the feature types of the task and the supported feature types:
[1] "integer" "factor" "ordered"
[1] "logical" "integer" "numeric"
[1] "factor" "ordered"
In this example, we have to convert factors and ordered factors to numeric columns to apply the xgboost learner. Because xgboost is based on decision trees (at least in its default settings), it is perfectly fine to convert the ordered factors to integer. Unordered factors must still be encoded though.
The factor encoder’s man page can be found under mlr_pipeops_encode
. Here, we decide to use “treatment” encoding (first factor level serves as baseline, and there will be a new binary column for each additional level). We restrict the operator to factor columns using the respective Selector
selector_type()
:
We can manually trigger the PipeOp
to test the operator on our task:
$output
<TaskClassif:german_credit> (1000 x 50): German Credit
* Target: credit_risk
* Properties: twoclass
* Features (49):
- dbl (43): credit_history.all.credits.at.this.bank.paid.back.duly,
credit_history.critical.account.other.credits.elsewhere,
credit_history.existing.credits.paid.back.duly.till.now,
credit_history.no.credits.taken.all.credits.paid.back.duly, employment_duration....7.yrs,
employment_duration...1.yr, employment_duration.1..........4.yrs, employment_duration.4..........7.yrs,
foreign_worker.yes, housing.own, housing.rent, job.manager.self.empl.highly.qualif..employee,
job.skilled.employee.official, job.unskilled...resident, other_debtors.co.applicant,
other_debtors.guarantor, other_installment_plans.none, other_installment_plans.stores,
people_liable.3.or.more, personal_status_sex.female...non.single.or.male...single,
personal_status_sex.female...single, personal_status_sex.male...married.widowed,
property.building.soc..savings.agr....life.insurance, property.car.or.other, property.real.estate,
purpose.business, purpose.car..new., purpose.car..used., purpose.domestic.appliances,
purpose.education, purpose.furniture.equipment, purpose.radio.television, purpose.repairs,
purpose.retraining, purpose.vacation, savings........1000.DM, savings.......100.DM,
savings.100..........500.DM, savings.500..........1000.DM,
status........200.DM...salary.for.at.least.1.year, status.......0.DM, status.0.........200.DM,
telephone.yes..under.customer.name.
- int (3): age, amount, duration
- ord (3): installment_rate, number_credits, present_residence
The ordered factor remained untouched, all other factors have been converted to numeric columns. To also convert the ordered variables installment_rate
, number_credits
, and present_residence
, we construct the colapply
operator with the converter as.integer()
:
Applied on the original task, it changes factor columns to integer
:
$output
<TaskClassif:german_credit> (1000 x 21): German Credit
* Target: credit_risk
* Properties: twoclass
* Features (20):
- fct (14): credit_history, employment_duration, foreign_worker, housing, job, other_debtors,
other_installment_plans, people_liable, personal_status_sex, property, purpose, savings, status,
telephone
- int (6): age, amount, duration, installment_rate, number_credits, present_residence
Finally, we construct a linear pipeline consisting of
fencoder
,ord_to_int
, andGraph with 3 PipeOps:
ID State sccssors prdcssors
encode <list> colapply
colapply <list> classif.xgboost encode
classif.xgboost <<UNTRAINED>> colapply
The pipeline is wrapped in a GraphLearner
so that it behaves like a regular learner:
We can now apply the new learner on the task, here with a 3-fold cross validation:
Success! We augmented xgboost with handling of factors and ordered factors. If we combine this learner with a tuner from mlr3tuning, we get a universal and competitive learner.