Most Popular Learners in mlr

R r-bloggers

An attempt to asses the popularity of mlr learners

For the development of mlr as well as for an “machine learning expert” it can be handy to know what are the most popular learners used. Not necessarily to see, what are the top notch performing methods but to see what is used “out there” in the real world. Thanks to the nice little package cranlogs from metacran you can at least get a slight estimate as I will show in the following…

First we need to install the cranlogs package using devtools:

devtools::install_github("metacran/cranlogs")

Now let’s load all the packages we will need:

Do obtain a neat table of all available learners in mlr we can call listLearners(). This table also contains a column with the needed packages for each learner separated with a ,.

# obtain used packages for all learners
lrns = as.data.table(listLearners())
all.pkgs = stri_split(lrns$package, fixed = ",")

Note: You might get some warnings here because you likely did not install all packages that mlr suggests – which is totally fine.

Now we can obtain the download counts from the rstudio cran mirror, i.e. from the last month. We use data.table to easily sum up the download counts of each day.

all.downloads = cran_downloads(packages = unique(unlist(all.pkgs)), 
                               when = "last-month")
all.downloads = as.data.table(all.downloads)
monthly.downloads = all.downloads[, list(monthly = sum(count)), by = package]

As some learners need multiple packages we will use the download count of the package with the least downloads.

lrn.downloads = sapply(all.pkgs, function(pkgs) {
  monthly.downloads[package %in% pkgs, min(monthly)]
})

Let’s put these numbers in our table:

lrns$downloads = lrn.downloads
lrns = lrns[order(downloads, decreasing = TRUE),]
lrns[, .(class, name, package, downloads)]

Here are the first 5 rows of the table:

class name package downloads
classif.naiveBayes Naive Bayes e1071 415683
classif.svm Support Vector Machines (libsvm) e1071 415683
regr.svm Support Vector Machines (libsvm) e1071 415683
surv.coxph Cox Proportional Hazard Model survival 238143
classif.lda Linear Discriminant Analysis MASS 226022

Now let’s get rid of the duplicates introduced by the distinction of the type classif, regr and we already have our…

Nearly final table

lrns.small = lrns[, .SD[1,], by = .(name, package)]
lrns.small[, .(class, name, package, downloads)]

The top 20 according to the rstudio cran mirror:

class name package downloads
classif.naiveBayes Naive Bayes e1071 415683
classif.svm Support Vector Machines (libsvm) e1071 415683
surv.coxph Cox Proportional Hazard Model survival 238143
classif.lda Linear Discriminant Analysis MASS 226022
classif.qda Quadratic Discriminant Analysis MASS 226022
classif.rpart Decision Tree rpart 109348
surv.rpart Survival Tree rpart 109348
classif.cvglmnet GLM with Lasso or Elasticnet Regularization (Cross Validated Lambda) glmnet 103373
classif.glmnet GLM with Lasso or Elasticnet Regularization glmnet 103373
surv.cvglmnet GLM with Regularization (Cross Validated Lambda) glmnet 103373
surv.glmnet GLM with Regularization glmnet 103373
classif.ranger Random Forests ranger 101836
classif.xgboost eXtreme Gradient Boosting xgboost 94824
classif.randomForest Random Forest randomForest 87654
classif.gausspr Gaussian Processes kernlab 82866
classif.ksvm Support Vector Machines kernlab 82866
classif.lssvm Least Squares Support Vector Machine kernlab 82866
cluster.kkmeans Kernel K-Means kernlab 82866
regr.rvm Relevance Vector Machine kernlab 82866
classif.multinom Multinomial Regression nnet 80540

As we are just looking for the packages let’s compress the table a bit further and come to our…

Final table

lrns.pgks = lrns[,list(learners = paste(class, collapse = ",")),
                 by = .(package, downloads)]
lrns.pgks

Here are the first 20 rows of the table:

package downloads learners
e1071 415683 classif.naiveBayes,classif.svm,regr.svm
survival 238143 surv.coxph
MASS 226022 classif.lda,classif.qda
rpart 109348 classif.rpart,regr.rpart,surv.rpart
glmnet 103373 classif.cvglmnet,classif.glmnet,regr.cvglmnet,regr.glmnet,surv.cvglmnet,surv.glmnet
ranger 101836 classif.ranger,regr.ranger,surv.ranger
xgboost 94824 classif.xgboost,regr.xgboost
randomForest 87654 classif.randomForest,regr.randomForest
kernlab 82866 classif.gausspr,classif.ksvm,classif.lssvm,cluster.kkmeans,regr.gausspr,regr.ksvm,regr.rvm
nnet 80540 classif.multinom,classif.nnet,regr.nnet
class 78541 classif.knn,classif.lvq1
FNN 77727 classif.fnn,regr.fnn
GPfit 49688 regr.GPfit
e1071,clue 41688 cluster.cmeans
klaR 40686 classif.rda
gbm 40096 classif.gbm,regr.gbm,surv.gbm
caret,pls 32999 classif.plsdaCaret
pls 32999 regr.pcr,regr.plsr
party 30504 classif.cforest,classif.ctree,multilabel.cforest,regr.cforest,regr.ctree
party,modeltools 30504 regr.mob

And of course we want to have a small visualization:

library(ggplot2)
library(forcats)
lrns.pgks$learners = factor(lrns.pgks$learners, lrns.pgks$learners)
g = ggplot(lrns.pgks[20:1], aes(x = fct_inorder(stri_sub(
  paste0(package,": ",learners), 0, 64)), y = downloads, fill = downloads))
g + geom_bar(stat = "identity") + 
  coord_flip() + 
  xlab("") + 
  scale_fill_continuous(guide=FALSE)

Remarks

This is not really representative of how popular each learner is, as some packages have multiple purposes (e.g. multiple learners). Furthermore it would be great to have access to the trending list. Also most stars at GitHub gives a better view of what the developers are interested in. Looking for machine learning packages we see there e.g: xgboost, h2o and tensorflow.

Citation

For attribution, please cite this work as

Richter (2017, March 30). mlr-org: Most Popular Learners in mlr. Retrieved from https://mlr-org.github.io/mlr-org-website/posts/2017-03-30-mostpopularlearnersinmlr/

BibTeX citation

@misc{richter2017most,
  author = {Richter, Jabok},
  title = {mlr-org: Most Popular Learners in mlr},
  url = {https://mlr-org.github.io/mlr-org-website/posts/2017-03-30-mostpopularlearnersinmlr/},
  year = {2017}
}