There is always one thing that everyone will tell you if you start the path of a data driven project: You will spend most of your time dealing with data quality issues. This is especially true when dealing with data collected from legacy or ad hook systems. You will suffer considerable data quality pains, since more often than not, there are unclear or even undefined rules on data input from the source systems. Time spent on creating and discovering the data cleansing rules you need to take to mitigate these quality issues will be a considerable time-sink to any project…


Traditionally, building a data warehouse requires massive capital investments in infrastructure, tools and licenses to get insight to your data. As they live and grow, these solutions often have a tendency to become time-consuming to maintain and complex or slow to change in response to new business needs.

BigQuery is a serverless datawarehouse solution provided by the Google Cloud Platform. A datawarehouse is nothing without data, which is typically provided from a pipeline connected to your data sources.

In this blog post I will only focus on the serverless integration offers that GCP provides. This means that popular components like…


Sometimes you just want data from your source into your analytical tool and start doing experiments. I have created a tool that can help you in this kind of prototyping.

A common way to integrate SQL server and BigQuery the lazy way is to:

  • Export table to disk
  • upload CSV file to cloud storage
  • Load to BigQuery and autodetect schema

Currently there is no tool that is doing this in a nice way for you.

It can be tedious to do this process, and usually you have some border cases where the simple strategy will not work, and you end…


In this post I will go through best practises on using the KubernetesPodOperator with examples. I will share dags and terraform scripts so it should be easy to test it out for yourself.

Quite a few of the questions I get when talking about Cloud Composer is how to use it for autoscaling job operations. With the (current) out-of-the-box settings cloud composer gives you, you are quite limited in the scaling you can achieve.

For example, cloud composer is not able to scale horizontally on work demand — so the usual practise is to have a sufficient cluster size and…


Background

In previous posts (scheduling jobs #1, scheduling jobs #2)I have been writing about how to do workflow scheduling using GCPs Cloud Composer (airflow).

Something that has been bugging me about Cloud Composer is the steep price (380$ / month minimum!). For small clusters and small amount of jobs, the spend in dollars and infrastructure does not really add up to the value provided.

A perfect example of this is our in-house Computas application.

It used to run in kubernetes cluster as cron-jobs. At some point we started getting dependencies between jobs, so cron was not really an option anymore.

Every…


Today I had an interesting case from one of our customers. They are running a decent sized composer cluster with 4 n2-highmem-2 machines, with an additional node pool to run data-science jobs spawned with Pod Operator (with even beefier machines). Most of the jobs are extract/load jobs from various databases into BigQuery.

A few days ago, composer started acting up and missing a lot of its deadlines. Jobs would skip or fail. It was vital to get it whatever caused this to happen resolved.

It is important to know that under the hood cloud composer is actually running in a…


Today I was at a customer helping them to optimize their Cloud Composer setup. Cloud Composer is a managed Airflow installation, a job orchestration tool that runs on Kubernetes, made by Airbnb. I had previously advised the customer to use Cloud Composer to only run Docker containers (KubernetesPodOperator), as that really simplifies the testing and rollout process when it comes to Cloud Composer/Airflow. Using PythonOperator to run complex Python programs is something you are bound to regret some time in the future (both in testing and package management).

What had started to happen at this customer, was that Kubernetes workloads…


In my previous post I explained how to load data from cloud SQL into bigquery using command line tools like gcloud and bq. In this post I will go though an example on how to load data using apache airflow operators instead of command line tools. Doing it this way has a few advantages like cleaner code, less hacks needed to get stuff working and more failsafe. For example: We do not have to worry about cloud sql export jobs limit, or export to csv file bugs.

I am using gcp managed airflow that runs in kubernetes — cloud composer…


Creating a fileshare of unlimited size as NFS mounted on a bucket inside a kubernetes cluster? Disregarding if this is a good idea or not, here is a little description of the problem we faced and how we solved it.

Background

Why did I want this as a NFS server in the first place? Why not simply mount it in the pod needing it using gcsfuse?

For my current client, the Kubernetes cluster was a managed airflow instance (cloud composer), and I had already setup a NFS server that was running smoothly inside this cluster (following this great guide). …


TLDR; link to code repo at the bottom with an example Airflow DAG.

When Google made Apache Airflow a managed service in GCP last year I was enthusiastic — mostly because I had looked into airflow before, and I found it nice to have a task scheduler that is written in actual code instead of point and click. Airflow was the first proper task scheduler for GCP and prior to this, if you wanted a scheduler you would have to use a third party service or cron scheduler . Cron is fine if you have tasks like “ping that every…

Anders Elton

Software developer and cloud enthusiast

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store