Encode Factor Levels for xgboost

classification mlr3pipelines factor encoding german credit data set classification

We show how to encode factor levels for the xgboost learner with mlr3pipelines.

Michel Lang

The package xgboost unfortunately does not support handling of categorical features. Therefore, it is required to manually convert factor columns to numerical dummy features. We show how to use mlr3pipelines to augment the xgboost learner with an automatic factor encoding.

We load the mlr3verse package which pulls in the most important packages for this example.

We initialize the random number generator with a fixed seed for reproducibility, and decrease the verbosity of the logger to keep the output clearly represented.


Construct the Base Objects

First, we take an example task with factors (german_credit) and create the xgboost learner:


task = tsk("german_credit")
<TaskClassif:german_credit> (1000 x 21)
* Target: credit_risk
* Properties: twoclass
* Features (20):
  - fct (14): credit_history, employment_duration, foreign_worker, housing, job, other_debtors,
    other_installment_plans, people_liable, personal_status_sex, property, purpose, savings, status,
  - int (3): age, amount, duration
  - ord (3): installment_rate, number_credits, present_residence
learner = lrn("classif.xgboost", nrounds = 100)
* Model: -
* Parameters: nrounds=100, nthread=1, verbose=0
* Packages: mlr3, mlr3learners, xgboost
* Predict Type: response
* Feature types: logical, integer, numeric
* Properties: hotstart_forward, importance, missings, multiclass, twoclass, weights

We now compare the feature types of the task and the supported feature types:

[1] "integer" "factor"  "ordered"
[1] "logical" "integer" "numeric"
setdiff(task$feature_types$type, learner$feature_types)
[1] "factor"  "ordered"

In this example, we have to convert factors and ordered factors to numeric columns to apply the xgboost learner. Because xgboost is based on decision trees (at least in its default settings), it is perfectly fine to convert the ordered factors to integer. Unordered factors must still be encoded though.

Construct Operators

The factor encoder’s man page can be found under mlr_pipeops_encode. Here, we decide to use “treatment” encoding (first factor level serves as baseline, and there will be a new binary column for each additional level). We restrict the operator to factor columns using the respective Selector selector_type():

fencoder = po("encode", method = "treatment", affect_columns = selector_type("factor"))

We can manually trigger the PipeOp to test the operator on our task:

<TaskClassif:german_credit> (1000 x 50)
* Target: credit_risk
* Properties: twoclass
* Features (49):
  - dbl (43): credit_history.all.credits.at.this.bank.paid.back.duly,
    credit_history.no.credits.taken.all.credits.paid.back.duly, employment_duration....7.yrs,
    employment_duration...1.yr, employment_duration.1..........4.yrs, employment_duration.4..........7.yrs,
    foreign_worker.yes, housing.own, housing.rent, job.manager.self.empl.highly.qualif..employee,
    job.skilled.employee.official, job.unskilled...resident, other_debtors.co.applicant,
    other_debtors.guarantor, other_installment_plans.none, other_installment_plans.stores,
    people_liable.3.or.more, personal_status_sex.female...non.single.or.male...single,
    personal_status_sex.female...single, personal_status_sex.male...married.widowed,
    property.building.soc..savings.agr....life.insurance, property.car.or.other, property.real.estate,
    purpose.business, purpose.car..new., purpose.car..used., purpose.domestic.appliances,
    purpose.education, purpose.furniture.equipment, purpose.radio.television, purpose.repairs,
    purpose.retraining, purpose.vacation, savings........1000.DM, savings.......100.DM,
    savings.100..........500.DM, savings.500..........1000.DM,
    status........200.DM...salary.for.at.least.1.year, status.......0.DM, status.0.........200.DM,
  - int (3): age, amount, duration
  - ord (3): installment_rate, number_credits, present_residence

The ordered factor remained untouched, all other factors have been converted to numeric columns. To also convert the ordered variables installment_rate, number_credits, and present_residence, we construct the colapply operator with the converter as.integer():

ord_to_int = po("colapply", applicator = as.integer, affect_columns = selector_type("ordered"))

Applied on the original task, it changes factor columns to integer:

<TaskClassif:german_credit> (1000 x 21)
* Target: credit_risk
* Properties: twoclass
* Features (20):
  - fct (14): credit_history, employment_duration, foreign_worker, housing, job, other_debtors,
    other_installment_plans, people_liable, personal_status_sex, property, purpose, savings, status,
  - int (6): age, amount, duration, installment_rate, number_credits, present_residence

Construct Pipeline

Finally, we construct a linear pipeline consisting of

  1. the factor encoder fencoder,
  2. the ordered factor converter ord_to_int, and
  3. the xgboost base learner.
graph = fencoder %>>% ord_to_int %>>% learner
Graph with 3 PipeOps:
              ID         State        sccssors prdcssors
          encode        <list>        colapply          
        colapply        <list> classif.xgboost    encode
 classif.xgboost <<UNTRAINED>>                  colapply

The pipeline is wrapped in a GraphLearner so that it behaves like a regular learner:

graph_learner = as_learner(graph)

We can now apply the new learner on the task, here with a 3-fold cross validation:

rr = resample(task, graph_learner, rsmp("cv", folds = 3))

Success! We augmented xgboost with handling of factors and ordered factors. If we combine this learner with a tuner from mlr3tuning, we get a universal and competitive learner.


For attribution, please cite this work as

Lang (2020, Jan. 31). mlr-org: Encode Factor Levels for xgboost. Retrieved from https://mlr-org.github.io/mlr-org-website/gallery/2020-01-31-encode-factors-for-xgboost/

BibTeX citation

  author = {Lang, Michel},
  title = {mlr-org: Encode Factor Levels for xgboost},
  url = {https://mlr-org.github.io/mlr-org-website/gallery/2020-01-31-encode-factors-for-xgboost/},
  year = {2020}