Troubleshooting cloud composer
Today I had an interesting case from one of our customers. They are running a decent sized composer cluster with 4 n2-highmem-2 machines, with an additional node pool to run data-science jobs spawned with Pod Operator (with even beefier machines). Most of the jobs are extract/load jobs from various databases into BigQuery.
A few days ago, composer started acting up and missing a lot of its deadlines. Jobs would skip or fail. It was vital to get it whatever caused this to happen resolved.
It is important to know that under the hood cloud composer is actually running in a kubernetes cluster. And kubernetes is just managed VMs that runs docker, that are able to coordinate work. Composer is a manged service from google cloud that is supposed to well.. be managed so you do not have to worry about composer running optimally. But you should.. There is quite a bit one can do with the configuration to make it behave.. well… better.
When looking at the airflow-workers deployment we noticed that all the workers were running on one of the nodes. That one node was using both of its cpu at full capacity, while the three other nodes were having a nice break.
It turns out that composer has seriously misconfigured the airflow worker by not allocating any resources to it. Kubernetes is a fantastic platform, that handles program crashes by restarting them, finds a vm to put them on, without you having to worry too much about it. However, when not allocating any resources in the config file, kubernetes is not smart enough to know that, in airflows case, the airflow-worker can really become quite memory heavy. In addition, the deployment is written in such a way that when it crashes, it does not give kubernetes any hints on where to place it. So kubernetes will find the node with the least work to do. And since the airflow-worker does not allocate any memory or cpu — They will eventually all go to the node with the least to do. Which eventually will turn out to be the same node. I figure you will see this more often if more workers crash (or restarts) at about the same time.
Fortunately this is something we can fix!
First connect to the kubernetes cluster.
Then press the connect button, and run the glcoud command it generates for you in the cloud shell.
Then create a file with the following content (patch.yaml):
spec:
template:
spec:
containers:
- name: airflow-worker
resources:
requests:
memory: 6Gi # half of node capacity so it never gets two of this!!!
limits:
memory: 10Gi
- name: gcs-syncd
resources:
requests:
cpu: 10m # default from composer
memory: 512Mi
limits:
cpu: 10m
memory: 512Mi
Please adjust the numbers here according to your machine type. The important part here is to allocate a bit more than half of the memory, so you do not get two running airflow-workers on the same machine. you do not have to worry about the cpu allocation.
To apply the patch run this final command
AIRFLOW_WORKER_NS=$(kubectl get namespaces | grep composer | cut -d ' ' -f1)
kubectl patch deployment airflow-worker -n ${AIRFLOW_WORKER_NS} --patch "$(cat patch.yaml)"
After this patch, the airflow-worker will allocate memory, which also means that you can run other kubernetes jobs inside the cluster without having to worry if the job would evict an airflow worker..
When debugging this issue I remembered an old blog post about composer autoscaling where I got the idea about how to patch the issue. please check it out! https://medium.com/traveloka-engineering/enabling-autoscaling-in-google-cloud-composer-ac84d3ddd60