ds/dx - a data science & ml engineering blog

Dockerizing a R machine learning model with s3 connection and end-to-end tests on Travis

In this post I want to give a short introduction in how to set up a dockerized R process or script which reads and writes data from and to AWS S3 and is tested from end to end via Travis. As an example process we will set up a simple random forest model which we will use to compute the importance of features for a classification problem.

Parallelized batch time series forecasting and forecast blending with purrr and multidplyr

Remark from May 2022: This post has been updated to work with newer versions of dplyr (1.0.9), multidplyr (0.1.1) and forecast (8.16), but may use some features flagged for deprecation. As of May 2022, there are better alternatives for batch forecasting, like fable in combination with tsibble. This post will be another quick intro to two magical R packages: purrr and multidplyr. The first one can be utilized to construct iterations over rows of data frames in a functional way, whereas multidplyr allows to run dataframe mutations in parallel across multiple cores.

Visualizing Berlin population using leaflet and sf

In this article I want to give a quick intro to two packages - leaflet and sf. First can be used to draw interactive maps, second to handle geometries dplyr-style, following the simple features specifications. In combination they allow to create simple maps from processed geodata with ease.