There is always one thing that everyone will tell you if you start the path of a data driven project: You will spend most of your time dealing with data quality issues. This is especially true when dealing with data collected from legacy or ad hook systems. You will suffer…


Traditionally, building a data warehouse requires massive capital investments in infrastructure, tools and licenses to get insight to your data. As they live and grow, these solutions often have a tendency to become time-consuming to maintain and complex or slow to change in response to new business needs.

BigQuery is…


Sometimes you just want data from your source into your analytical tool and start doing experiments. I have created a tool that can help you in this kind of prototyping.

A common way to integrate SQL server and BigQuery the lazy way is to:

  • Export table to disk
  • upload CSV…


In this post I will go through best practises on using the KubernetesPodOperator with examples. I will share dags and terraform scripts so it should be easy to test it out for yourself.

Quite a few of the questions I get when talking about Cloud Composer is how to use…


Background

In previous posts (scheduling jobs #1, scheduling jobs #2)I have been writing about how to do workflow scheduling using GCPs Cloud Composer (airflow).

Something that has been bugging me about Cloud Composer is the steep price (380$ / month minimum!). …


Today I had an interesting case from one of our customers. They are running a decent sized composer cluster with 4 n2-highmem-2 machines, with an additional node pool to run data-science jobs spawned with Pod Operator (with even beefier machines). …


Today I was at a customer helping them to optimize their Cloud Composer setup. Cloud Composer is a managed Airflow installation, a job orchestration tool that runs on Kubernetes, made by Airbnb. I had previously advised the customer to use Cloud Composer to only run Docker containers (KubernetesPodOperator), as that…


In my previous post I explained how to load data from cloud SQL into bigquery using command line tools like gcloud and bq. In this post I will go though an example on how to load data using apache airflow operators instead of command line tools. Doing it this way…


Creating a fileshare of unlimited size as NFS mounted on a bucket inside a kubernetes cluster? Disregarding if this is a good idea or not, here is a little description of the problem we faced and how we solved it.

Background

Why did I want this as a NFS server in…


TLDR; link to code repo at the bottom with an example Airflow DAG.

When Google made Apache Airflow a managed service in GCP last year I was enthusiastic — mostly because I had looked into airflow before, and I found it nice to have a task scheduler that is written…

Anders Elton

Software developer and cloud enthusiast

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store