ds/dx - a data science & ml engineering blog
–
As an avid fan of Spotify’s Discover Weekly playlist, I always wanted to have a scheduled, automated, self-controlled lean and cheap way of backing up the weekly generated tracks. This toy project uses python (including the spotipy module), Google Cloud Functions, Cloud Sheduler and Secrets as well as Terraform and Github actions to bring the code and infrastructure to life.
In this post we will set up once more serverless infrastructure via Terraform: an Airflow deployment using Amazon Managed Workflows, plus GitHub Actions to automatically sync the DAG code to S3. As a baseline, we will fork off Claudio Bizzotto’s great repository claudiobizzotto/aws-mwaa-terraform.
The goal of this post is a to set up a serverless infrastructure, managed in code, to serve batch predictions of a machine learning model or any other lightweight computation in an asynchronous way: A Google Cloud Run service will listen for new files in a Cloud Storage bucket via Pub/Sub message topic, trigger a computational process and put the resulting data into another bucket.
Goal of this post is a to set up a serverless infrastructure, managed in code, to serve predictions of a containerized machine learning model via Rest API. We will make use of Terraform to manage our infrastructure, including AWS ECR, S3, Lambda and API Gateway.
Working with linear regression is a sometimes under-appreciated trait in data science. As a generalization of very fundamental statistical concepts like t-tests and analysis of variance it has deep ties into the realm of statistics, and can serve as a powerful tool to explain variance in data. Since linear models are only linear in their parameters, they can also describe polynomial and even multiplicative relationships. But due to parametric nature, linear regression is also more vulnerable to extreme values and multicollinearity in the data, of which we want to analyze the latter in more detail, using a simulation.
a.k.a. witing your own sklearn functions, part 3. If you have worked with sklearn before you certainly came across the struggles between using dataframes or arrays as inputs to your transformers and estimators. Both bring their advantages and disadvantages. But once you deploy your model, for example as a service, in many cases it will serve single predictions. Max Halford has shown some great examples on how to improve various sklearn transformers and estimators to serve single predictions with an extra performance boost and potential responses in low millisecond range! In this short post we will advance these tricks and develop a full pipeline.
In the following post I want to give some impressions on how historic travel time data can be used to build a model for travel time estimations. In case you are are collecting your own characteristic travel data the you might actually be able to beat third-party-solutions in terms of accuracy as well as cost. We will see why tree models might be suitable for this use case and we will do some feature engineering to improve the model performance.