We just released mlr v2.15.0 to CRAN. This version includes some breaking changes and the usual bug fixes from the last three months.
We made good progress on the goal of cleaning up the Github repo. We processed nearly all open pull requests (around 40). In the next months we will focus on cleaning up the issue tracker even though most of our time will go into improving the successor package mlr3 and its extension packages.
Unless there are active contributions from the user side, we do not expect much feature additions for the next version(s) of mlr.
benchmark() function does not store the tuning results (stored in the
$extract slot) anymore by default.
This change was made to prevent BenchmarkResult (BMR) objects from getting huge in size (~ GB) when multiple models are compared with extensive tuning.
Unless you want to do a analysis on the tuning effects, you do not need the tuning results to compare the performance of the algorithms.
Huge BMR objects can cause various troubles.
One of them (which was the inital root for this change) appears when benchmarking is done on a HPC using multiple workers.
Each worker has a limited amount of memory and expecting a huge BMR might limit the amount of workers that can be spawned.
In addition, loading the large resulting BMR into the global environment (or merging it using
mergeBenchmarkResults()) for post-analysis will become a pain.
To save users from all of these troubles in the first place, we decided to change the default.
To store the tuning results, you have to actively set
keep.extract = TRUE from now on.
Not storing the tuning was actually already implicitly the default in
resample() since the user had to set the
extract argument manually to save certain results (tuning, feature importance).
With the new change the package became more consistent.
Changes to Filters
New ensemble filters
With this release it is possible to calculate ensemble filters with mlr (Seijo-Pardo et al. 2017).
“Ensemble filters” are similar to ensemble models in the way that multiple filters are used to generate the ranking of features.
Multiple aggregations functions are supported (
median(), “Borda”) with the latter being the most used one in literature while writing this.
To our knowledge there is no other package/framework in R currently that supports ensemble filters in a similar way mlr does. Since mlr makes it possible to use filters from a variety of different packages, the user is able to create powerful ensemble filters. Note however that currently you cannot tune the selection of simple filters since tuning a character vector param is not supported by ParamHelpers. See this discussion for more information.
Here is a simple toy example how to create ensemble filters in mlr from
library(mlr) ## Loading required package: ParamHelpers filterFeatures(iris.task, method = "E-min", base.methods = c("FSelectorRcpp_gain.ratio", "FSelectorRcpp_information.gain"), abs = 2) ## Supervised task: iris-example ## Type: classif ## Target: Species ## Observations: 150 ## Features: ## numerics factors ordered functionals ## 2 0 0 0 ## Missings: FALSE ## Has weights: FALSE ## Has blocking: FALSE ## Has coordinates: FALSE ## Classes: 3 ## setosa versicolor virginica ## 50 50 50 ## Positive class: NA
New return structure for filter values
With the added support for ensemble filters we also changes the return structure of calculated filter values.
The new makes it easier to apply post-analysis tasks like grouping and filtering. The “method” of each row is now grouped into one column and the filter values are stored in a separate one. We also added a default sorting of the results by the “value” of each “method”.
Below is a comparison of the old and new output:
# new generateFilterValuesData(iris.task, method = c("FSelectorRcpp_gain.ratio", "FSelectorRcpp_information.gain")) ## FilterValues: ## Task: iris-example ## name type method value ## 4 Petal.Width numeric FSelectorRcpp_gain.ratio 0.8713692 ## 3 Petal.Length numeric FSelectorRcpp_gain.ratio 0.8584937 ## 1 Sepal.Length numeric FSelectorRcpp_gain.ratio 0.4196464 ## 2 Sepal.Width numeric FSelectorRcpp_gain.ratio 0.2472972 ## 8 Petal.Width numeric FSelectorRcpp_information.gain 0.9554360 ## 7 Petal.Length numeric FSelectorRcpp_information.gain 0.9402853 ## 5 Sepal.Length numeric FSelectorRcpp_information.gain 0.4521286 ## 6 Sepal.Width numeric FSelectorRcpp_information.gain 0.2672750
# old generateFilterValuesData(iris.task, method = c('gain.ratio','information.gain') ## FilterValues: ## Task: iris-example ## name type gain.ratio information.gain ## 1 Sepal.Length numeric 0.4196464 0.4521286 ## 2 Sepal.Width numeric 0.2472972 0.2672750 ## 3 Petal.Length numeric 0.8584937 0.9402853 ## 4 Petal.Width numeric 0.8713692 0.9554360
Besides the integration of new learners and some added options for integrated ones (check the NEWS file), we fixed a bug that caused an incorrect aggregation of probabilities in certain cases.
This bug was around undetected for quite some time and was revealed due to a change in data.table’s
Thankfully @danielhorn reported this issue and we could fix it within a few days.
Another mentionable change is that the commonly used
e1071::svm() learner now only uses the formula interface internally if factors are present in the data.
This aims to prevent “stack overflow” problems that some user encountered with large datasets.
With PR #1784 we added more support for estimating standard errors using the internal methods of the “Random Forest” algorithm. Please check the NEWS file for more detailed information about the implemented RF learners.
Seijo-Pardo, B., I. Porto-Díaz, V. Bolón-Canedo, and A. Alonso-Betanzos. 2017. “Ensemble Feature Selection: Homogeneous and Heterogeneous Approaches.” Knowledge-Based Systems 118 (February): 124–39. https://doi.org/10/f9qgrv.