Traditionally, building a data warehouse requires massive capital investments in infrastructure, tools and licenses to get insight to your data. As they live and grow, these solutions often have a tendency to become time-consuming to maintain and complex or slow to change in response to new business needs.
BigQuery is a serverless datawarehouse solution provided by the Google Cloud Platform. A datawarehouse is nothing without data, which is typically provided from a pipeline connected to your data sources.
Sometimes you just want data from your source into your analytical tool and start doing experiments. I have created a tool that can help you in this kind of prototyping.
A common way to integrate SQL server and BigQuery the lazy way is to:
Currently there is no tool that is doing this in a nice way for you.
In this post I will go through best practises on using the KubernetesPodOperator with examples. I will share dags and terraform scripts so it should be easy to test it out for yourself.
Quite a few of the questions I get when talking about Cloud Composer is how to use it for autoscaling job operations. With the (current) out-of-the-box settings cloud composer gives you, you are quite limited in the scaling you can achieve.
For example, cloud composer is not able to scale horizontally on work demand — so the usual practise is to have a sufficient cluster size and…
Something that has been bugging me about Cloud Composer is the steep price (380$ / month minimum!). For small clusters and small amount of jobs, the spend in dollars and infrastructure does not really add up to the value provided.
A perfect example of this is our in-house Computas application.
It used to run in kubernetes cluster as cron-jobs. At some point we started getting dependencies between jobs, so cron was not really an option anymore.
Today I had an interesting case from one of our customers. They are running a decent sized composer cluster with 4 n2-highmem-2 machines, with an additional node pool to run data-science jobs spawned with Pod Operator (with even beefier machines). Most of the jobs are extract/load jobs from various databases into BigQuery.
A few days ago, composer started acting up and missing a lot of its deadlines. Jobs would skip or fail. It was vital to get it whatever caused this to happen resolved.
It is important to know that under the hood cloud composer is actually running in a…
Today I was at a customer helping them to optimize their Cloud Composer setup. Cloud Composer is a managed Airflow installation, a job orchestration tool that runs on Kubernetes, made by Airbnb. I had previously advised the customer to use Cloud Composer to only run Docker containers (KubernetesPodOperator), as that really simplifies the testing and rollout process when it comes to Cloud Composer/Airflow. Using PythonOperator to run complex Python programs is something you are bound to regret some time in the future (both in testing and package management).
What had started to happen at this customer, was that Kubernetes workloads…
In my previous post I explained how to load data from cloud SQL into bigquery using command line tools like gcloud and bq. In this post I will go though an example on how to load data using apache airflow operators instead of command line tools. Doing it this way has a few advantages like cleaner code, less hacks needed to get stuff working and more failsafe. For example: We do not have to worry about cloud sql export jobs limit, or export to csv file bugs.
I am using gcp managed airflow that runs in kubernetes — cloud composer…
Creating a fileshare of unlimited size as NFS mounted on a bucket inside a kubernetes cluster? Disregarding if this is a good idea or not, here is a little description of the problem we faced and how we solved it.
Why did I want this as a NFS server in the first place? Why not simply mount it in the pod needing it using gcsfuse?
For my current client, the Kubernetes cluster was a managed airflow instance (cloud composer), and I had already setup a NFS server that was running smoothly inside this cluster (following this great guide). …
When Google made Apache Airflow a managed service in GCP last year I was enthusiastic — mostly because I had looked into airflow before, and I found it nice to have a task scheduler that is written in actual code instead of point and click. Airflow was the first proper task scheduler for GCP and prior to this, if you wanted a scheduler you would have to use a third party service or cron scheduler . Cron is fine if you have tasks like “ping that every…
When you are writing Jenkins pipeline scripts, especially when writing pipelines for clouds or kubernetes, often you will end up calling quite a few 1-some liner bash scripts.
After a while when you investigate build&deploy failures you realize that Jenkins is hiding the scripts behind a very generic “Shell script” step-name.
If you have all of these steps happening within a single stage and you have lots of them, it can be very frustrating and hard to se which script is failing. You will have to open the logs and kind of take the context from there.
After searching stackoverflow…
Software developer and cloud enthusiast