library(data.table)
library(mlr3)
library(mlr3pipelines)
Production with R has come a long way. In this tutorial, we give a brief example on how to write a REST API and deploy it (relying on the mlr3
ecosystem for the actual training and predicting). Most of this tutorial was inspired by other excellent posts and vignettes:
- R can API and So Can You!
- Using docker to deploy an R plumber API
AzureContainer
’s vignette
All files presented in this tutorial are available here.
Modeling Background
We will use a subset of the boston_housing
Task
. Our goal is to predict the median value of owner-occupied homes in USD 1000’s (target medv
), using the features crim
, tax
and town
(just to have factor
, integer
, and numeric
feature types):
= tsk("boston_housing")$data()
data = data[, c("cmedv", "crim", "tax", "town")] data
= TaskRegr$new("boston", backend = data, target = "cmedv") task
Let’s create a toy pipeline:
Regarding modeling, we will keep it very simple and use the rpart learner
. Missing numerical features (which could happen during prediction) will be imputed by their median via PipeOpImputeMedian
, while missing factorial features will be imputed using a new level via PipeOpImputeOOR
. As PipeOpImputeOOR
will introduce a new level
, ".MISSING"
to impute missing values, we also use PipeOpFixFactors
:
= po("imputemedian") %>>%
g po("imputeoor") %>>%
po("fixfactors") %>>%
lrn("regr.rpart")
We wrap this Graph
in a GraphLearner
and can train on the Task
:
= GraphLearner$new(g)
gl $train(task) gl
We can inspect the trained pipeline looking at:
$model gl
Furthermore, we can save the trained pipeline, i.e., as "gl.rds"
:
saveRDS(gl, "gl.rds")
We will also store some information regarding the features, i.e., the feature names, types and levels (you will later see, why we need to do this):
= list(
feature_info feature_names = task$feature_names,
feature_types = task$feature_types,
levels = task$levels()
)saveRDS(feature_info, "feature_info.rds")
Putting everything in a file, train_gl.R
looks like the following, which we can then source before moving on:
# train_gl.R
library(mlr3)
library(mlr3pipelines)
= tsk("boston_housing")$data()
data = data[, c("medv", "crim", "tax", "town")]
data = TaskRegr$new("boston", backend = data, target = "medv")
task
= po("imputemedian") %>>%
g po("imputeoor") %>>%
po("fixfactors") %>>%
lrn("regr.rpart")
= GraphLearner$new(g)
gl
$train(task)
gl
saveRDS(gl, "gl.rds")
= list(
feature_info feature_names = task$feature_names,
feature_types = task$feature_types,
levels = task$levels()
)
saveRDS(feature_info, "feature_info.rds")
Our goal of our REST (representational state transfer) API (application programming interface) will be to predict the medv
of a new observation, i.e., it should do something like the following:
= data.table(crim = 3.14, tax = 691, town = "Newton")
newdata $predict_newdata(newdata) gl
<PredictionRegr> for 1 observations:
row_ids truth response
1 NA 33.30106
However, in our REST API, the newdata
will be received at an endpoint that accepts a particular input. In the next section we will use plumber
to set up our web service.
Using plumber to set up our REST API
The package plumber allows us to create a REST API by simply commenting existing R code. plumber
makes use of these comments to define the web service. Running plumber::plumb
on the commented R file then results in a runnable web service that other systems can interact with over a network.
As an endpoint for predicting the medv
, we will use a POST
request. This will allow us to enclose data in the body of the request message. More precisely, we assume that the data will be provided in the JSON format.
When a POST
request containing the data (in JSON format) is received our code must then:
convert the input (in JSON format) to a
data.table
with all feature columns matching their feature typepredict the
medv
based on the input using our trained pipeline and provide an output that can be understood by the client
To make sure that all features match their feature type, we will later use the following function stored in the R file fix_feature_types.R
:
# fix_feature_types.R
= function(feature, feature_name, feature_info) {
fix_feature_types = match(feature_name, feature_info$feature_names)
id = feature_info$feature_types$type[id]
feature_type switch(feature_type,
"logical" = as.logical(feature),
"integer" = as.integer(feature),
"numeric" = as.numeric(feature),
"character" = as.character(feature),
"factor" = factor(feature, levels = feature_info$levels[[feature_name]],
ordered = FALSE),
"ordered" = factor(feature, levels = feature_info$levels[[feature_name]],
ordered = TRUE),
"POSIXct" = as.POSIXct(feature)
) }
fix_feature_types()
can later be applied to the newdata
, and will make sure, that all incoming features are converted to their expected feature type as in the original Task
we used for training our pipeline (and this is the reason, why we stored the information about the features earlier). Note that in our tutorial we only have factor
, integer
, and numeric
features, but fix_feature_types()
should also work for all other supported feature_types
listed in mlr_reflections$task_feature_types
. However, it may need some customization depending on your own production environment to make the conversions meaningful.
The following R file, predict_gl.R
loads our trained pipepline and feature information and provides an endpoint for a POST
request, "/predict_medv"
. The incoming data then is converted using jsonlite::fromJSON
. We expect the incoming data to either be JSON objects in an array or nested JSON objects and therefore we bind the converted vectors row-wise to a data.table
using data.table::rbindlist
. We then convert all features to their expected feature_types
(using the fix_feature_types()
function as defined above) and can finally predict the medv
using our trained pipeline. As no default serialization from R6
objects to JSON objects exists (yet), we wrap the Prediction
in a data.table
(of course we could also only return the numeric prediction values):
# predict_gl.R
library(data.table)
library(jsonlite)
library(mlr3)
library(mlr3pipelines)
source("fix_feature_types.R")
= readRDS("gl.rds")
gl
= readRDS("feature_info.rds")
feature_info
#* @post /predict_medv
function(req) {
# get the JSON string from the post body
= fromJSON(req$postBody, simplifyVector = FALSE)
newdata # expect either JSON objects in an array or nested JSON objects
= rbindlist(newdata, use.names = TRUE)
newdata # convert all features in place to their expected feature_type
colnames(newdata) := mlr3misc::pmap(
newdata[, list(.SD, colnames(newdata)),
fix_feature_types,feature_info = feature_info)]
# predict and return as a data.table
as.data.table(gl$predict_newdata(newdata))
# or only the numeric values
# gl$predict_newdata(newdata)$response
}
Note that the only difference to a regular R file is the comment
#* @post /predict_medv`
telling plumber
to construct the endpoint "/predict_medv"
for a POST
request.
We can then run plumber::plumb
. The following code sets up the web service locally on your personal machine at port 1030 (we use such a high number because some systems require administrator rights to allow processes to listen to lower ports):
library(plumber)
= plumb("predict_gl.R")
r $run(port = 1030, host = "0.0.0.0") r
Congratulations, your first REST API is running on your local machine. We can test it by providing some data, using curl
via the command line:
curl --data '[{"crim":3.14, "tax":691, "town":"Newton"}]' "http://127.0.0.1:1030/predict_medv"
This should return the predicted medv
:
[{"row_id":1,"response":"32.2329"}]
Alternatively, we can also use the httr::POST
function within R:
= '[{"crim":3.14, "tax":691, "town":"Newton"}]'
newdata = httr::POST(url = "http://127.0.0.1:1030/predict_medv",
resp body = newdata, encode = "json")
::content(resp) httr
We can further play around a bit more and provide more than a single new observation and also check whether our feature type conversion and missing value imputation works:
= '[
newdata {"crim":3.14, "tax":691, "town":"Newton"},
{"crim":"not_a_number", "tax":3.14, "town":"Munich"},
{"tax":"not_a_number", "town":31, "crim":99}
]'
= httr::POST(url = "http://127.0.0.1:1030/predict_medv",
resp body = newdata, encode = "json")
::content(resp) httr
Note that you can also use jsonlite::toJSON
to convert a data.frame
to JSON data for your toy examples here.
In the following final section we want to use Docker
to run a virtual machine as a container (an instance of a snapshot of a machine at a moment in time).
Using Docker to Deploy our REST API
A Docker
container image is a lightweight, standalone, executable package of software that includes everything needed to run an application. Suppose we want to run our REST API on an Amazon Web Service or Microsoft’s Azure Cloud. Then we can use a Docker
container to easily set up our web service without going through the hassle of configuring manually our hosting instance.
We are going to need two things: An image and a container. An image defines the OS and software while the container is the actual running instance of the image. To build a Docker
image we have to specify a Dockerfile
. Note that it is sensible to set up the whole project in its own directory, e.g., ~/mlr3_api
.
Every Dockerfile
starts with a FROM
statement describing the image we are building our image from. In our case we want an R based image that ideally already has plumber
and its dependencies installed. Luckily, the trestletech/plumber
image exists:
FROM trestletech/plumber
We then install the R packages needed to set up our REST API (note that we can skip jsonlite
, because plumber
already depends on it):
RUN R -e 'install.packages(c("data.table", "mlr3", "mlr3pipelines"))'
Next, we copy our trained pipeline (gl.rds
), our stored feature information (feature_info.rds
), our R file to convert features, (fix_feature_types.R
) and our R file to predict (predict_gl.R
) to a new directory /data
and set this as the working directory:
RUN mkdir /data
COPY gl.rds /data
COPY feature_info.rds /data
COPY fix_feature_types.R /data
COPY predict_gl.R /data
WORKDIR /data
Finally, we listen on port 1030 and start the server (this is analogously done as manually calling plumber::plumb
on the predict_gl.R
file and running it):
EXPOSE 1030
ENTRYPOINT ["R", "-e", \
"r = plumber::plumb('/data/predict_gl.R'); r$run(port = 1030, host = '0.0.0.0')"]
The complete Dockerfile
looks like the following:
FROM trestletech/plumber
RUN R -e 'install.packages(c("data.table", "mlr3", "mlr3misc", "mlr3pipelines"))'
RUN mkdir /data
COPY gl.rds /data
COPY feature_info.rds /data
COPY fix_feature_types.R /data
COPY predict_gl.R /data
WORKDIR /data
EXPOSE 1030
ENTRYPOINT ["R", "-e", \
"r = plumber::plumb('/data/predict_gl.R'); r$run(port = 1030, host = '0.0.0.0')"]
To build the image we open a terminal in the mlr3_api
directory and run:
docker build -t mlr3-plumber-demo .
This may take quite some time.
To finally run the container, simply use:
docker run --rm -p 1030:1030 mlr3-plumber-demo
You can then proceed to provide some data via curl
or httr::POST
(to the same local address, because the Docker
container is still running on your local machine).
To stop all running containers use:
docker stop $(docker ps -a -q)
Finally, you can proceed to deploy your container to an Amazon Web Service or an Azure Cloud. For the latter, the package AzureContainers is especially helpful. If you do plan to do this note that the plumber
service above is exposed over HTTP, and there is no authentication layer making it insecure. You may think about adding a layer of authentification and restricting the service to HTTPS.
Session Information
::session_info(info = "packages") sessioninfo
═ Session info ═══════════════════════════════════════════════════════════════════════════════════════════════════════
─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
! package * version date (UTC) lib source
backports 1.5.0 2024-05-23 [1] CRAN (R 4.4.1)
checkmate 2.3.2 2024-07-29 [1] CRAN (R 4.4.1)
cli 3.6.3 2024-06-21 [1] CRAN (R 4.4.1)
P codetools 0.2-20 2024-03-31 [?] CRAN (R 4.4.0)
crayon 1.5.3 2024-06-20 [1] CRAN (R 4.4.1)
data.table * 1.16.2 2024-10-10 [1] CRAN (R 4.4.1)
digest 0.6.37 2024-08-19 [1] CRAN (R 4.4.1)
evaluate 1.0.1 2024-10-10 [1] CRAN (R 4.4.1)
fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.1)
future 1.34.0 2024-07-29 [1] CRAN (R 4.4.1)
globals 0.16.3 2024-03-08 [1] CRAN (R 4.4.1)
htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.1)
htmlwidgets 1.6.4 2023-12-06 [1] CRAN (R 4.4.1)
jsonlite 1.8.9 2024-09-20 [1] CRAN (R 4.4.1)
knitr 1.48 2024-07-07 [1] CRAN (R 4.4.1)
lgr 0.4.4 2022-09-05 [1] CRAN (R 4.4.1)
listenv 0.9.1 2024-01-29 [1] CRAN (R 4.4.1)
mlr3 * 0.21.0 2024-09-24 [1] CRAN (R 4.4.1)
mlr3misc 0.15.1 2024-06-24 [1] CRAN (R 4.4.1)
mlr3pipelines * 0.7.0 2024-09-24 [1] CRAN (R 4.4.1)
mlr3website * 0.0.0.9000 2024-10-18 [1] Github (mlr-org/mlr3website@20d1ddf)
palmerpenguins 0.1.1 2022-08-15 [1] CRAN (R 4.4.1)
paradox 1.0.1 2024-07-09 [1] CRAN (R 4.4.1)
parallelly 1.38.0 2024-07-27 [1] CRAN (R 4.4.1)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.4.1)
renv 1.0.11 2024-10-12 [1] CRAN (R 4.4.1)
rlang 1.1.4 2024-06-04 [1] CRAN (R 4.4.1)
rmarkdown 2.28 2024-08-17 [1] CRAN (R 4.4.1)
P rpart 4.1.23 2023-12-05 [?] CRAN (R 4.4.0)
sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.1)
stringi 1.8.4 2024-05-06 [1] CRAN (R 4.4.1)
uuid 1.2-1 2024-07-29 [1] CRAN (R 4.4.1)
withr 3.0.1 2024-07-31 [1] CRAN (R 4.4.1)
xfun 0.48 2024-10-03 [1] CRAN (R 4.4.1)
yaml 2.3.10 2024-07-26 [1] CRAN (R 4.4.1)
[1] /home/marc/repositories/mlr3website/mlr-org/renv/library/linux-ubuntu-noble/R-4.4/x86_64-pc-linux-gnu
[2] /home/marc/.cache/R/renv/sandbox/linux-ubuntu-noble/R-4.4/x86_64-pc-linux-gnu/9a444a72
P ── Loaded and on-disk path mismatch.
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────