Encode Factor Levels for xgboost

Encode factor variables in a task.

Author

Michel Lang

Published

January 31, 2020

The package xgboost unfortunately does not support handling of categorical features. Therefore, it is required to manually convert factor columns to numerical dummy features. We show how to use mlr3pipelines to augment the xgboost learner with an automatic factor encoding.

We load the mlr3verse package which pulls in the most important packages for this example.

library(mlr3verse)
Loading required package: mlr3

We initialize the random number generator with a fixed seed for reproducibility, and decrease the verbosity of the logger to keep the output clearly represented.

set.seed(7832)
lgr::get_logger("mlr3")$set_threshold("warn")

Construct the Base Objects

First, we take an example task with factors (german_credit) and create the xgboost learner:

library(mlr3learners)

task = tsk("german_credit")
print(task)
<TaskClassif:german_credit> (1000 x 21): German Credit
* Target: credit_risk
* Properties: twoclass
* Features (20):
  - fct (14): credit_history, employment_duration, foreign_worker, housing, job, other_debtors,
    other_installment_plans, people_liable, personal_status_sex, property, purpose, savings, status,
    telephone
  - int (3): age, amount, duration
  - ord (3): installment_rate, number_credits, present_residence
learner = lrn("classif.xgboost", nrounds = 100)
print(learner)
<LearnerClassifXgboost:classif.xgboost>
* Model: -
* Parameters: nrounds=100, nthread=1, verbose=0, early_stopping_set=none
* Packages: mlr3, mlr3learners, xgboost
* Predict Types:  [response], prob
* Feature Types: logical, integer, numeric
* Properties: hotstart_forward, importance, missings, multiclass, twoclass, weights

We now compare the feature types of the task and the supported feature types:

unique(task$feature_types$type)
[1] "integer" "factor"  "ordered"
learner$feature_types
[1] "logical" "integer" "numeric"
setdiff(task$feature_types$type, learner$feature_types)
[1] "factor"  "ordered"

In this example, we have to convert factors and ordered factors to numeric columns to apply the xgboost learner. Because xgboost is based on decision trees (at least in its default settings), it is perfectly fine to convert the ordered factors to integer. Unordered factors must still be encoded though.

Construct Operators

The factor encoder’s man page can be found under mlr_pipeops_encode. Here, we decide to use “treatment” encoding (first factor level serves as baseline, and there will be a new binary column for each additional level). We restrict the operator to factor columns using the respective Selector selector_type():

fencoder = po("encode", method = "treatment", affect_columns = selector_type("factor"))

We can manually trigger the PipeOp to test the operator on our task:

fencoder$train(list(task))
$output
<TaskClassif:german_credit> (1000 x 50): German Credit
* Target: credit_risk
* Properties: twoclass
* Features (49):
  - dbl (43): credit_history.all.credits.at.this.bank.paid.back.duly,
    credit_history.critical.account.other.credits.elsewhere,
    credit_history.existing.credits.paid.back.duly.till.now,
    credit_history.no.credits.taken.all.credits.paid.back.duly, employment_duration....7.yrs,
    employment_duration...1.yr, employment_duration.1..........4.yrs, employment_duration.4..........7.yrs,
    foreign_worker.yes, housing.own, housing.rent, job.manager.self.empl.highly.qualif..employee,
    job.skilled.employee.official, job.unskilled...resident, other_debtors.co.applicant,
    other_debtors.guarantor, other_installment_plans.none, other_installment_plans.stores,
    people_liable.3.or.more, personal_status_sex.female...non.single.or.male...single,
    personal_status_sex.female...single, personal_status_sex.male...married.widowed,
    property.building.soc..savings.agr....life.insurance, property.car.or.other, property.real.estate,
    purpose.business, purpose.car..new., purpose.car..used., purpose.domestic.appliances,
    purpose.education, purpose.furniture.equipment, purpose.radio.television, purpose.repairs,
    purpose.retraining, purpose.vacation, savings........1000.DM, savings.......100.DM,
    savings.100..........500.DM, savings.500..........1000.DM,
    status........200.DM...salary.for.at.least.1.year, status.......0.DM, status.0.........200.DM,
    telephone.yes..under.customer.name.
  - int (3): age, amount, duration
  - ord (3): installment_rate, number_credits, present_residence

The ordered factor remained untouched, all other factors have been converted to numeric columns. To also convert the ordered variables installment_rate, number_credits, and present_residence, we construct the colapply operator with the converter as.integer():

ord_to_int = po("colapply", applicator = as.integer, affect_columns = selector_type("ordered"))

Applied on the original task, it changes factor columns to integer:

ord_to_int$train(list(task))
$output
<TaskClassif:german_credit> (1000 x 21): German Credit
* Target: credit_risk
* Properties: twoclass
* Features (20):
  - fct (14): credit_history, employment_duration, foreign_worker, housing, job, other_debtors,
    other_installment_plans, people_liable, personal_status_sex, property, purpose, savings, status,
    telephone
  - int (6): age, amount, duration, installment_rate, number_credits, present_residence

Construct Pipeline

Finally, we construct a linear pipeline consisting of

  1. the factor encoder fencoder,
  2. the ordered factor converter ord_to_int, and
  3. the xgboost base learner.
graph = fencoder %>>% ord_to_int %>>% learner
print(graph)
Graph with 3 PipeOps:
              ID         State        sccssors prdcssors
          encode        <list>        colapply          
        colapply        <list> classif.xgboost    encode
 classif.xgboost <<UNTRAINED>>                  colapply

The pipeline is wrapped in a GraphLearner so that it behaves like a regular learner:

graph_learner = as_learner(graph)

We can now apply the new learner on the task, here with a 3-fold cross validation:

rr = resample(task, graph_learner, rsmp("cv", folds = 3))
rr$aggregate()
classif.ce 
 0.2620435 

Success! We augmented xgboost with handling of factors and ordered factors. If we combine this learner with a tuner from mlr3tuning, we get a universal and competitive learner.