Debugging a Python Workload Gone Silent inside Kubernetes

Anders Elton
Compendium
Published in
5 min readDec 9, 2019

--

Today I was at a customer helping them to optimize their Cloud Composer setup. Cloud Composer is a managed Airflow installation, a job orchestration tool that runs on Kubernetes, made by Airbnb. I had previously advised the customer to use Cloud Composer to only run Docker containers (KubernetesPodOperator), as that really simplifies the testing and rollout process when it comes to Cloud Composer/Airflow. Using PythonOperator to run complex Python programs is something you are bound to regret some time in the future (both in testing and package management).

What had started to happen at this customer, was that Kubernetes workloads were starting to be stuck in running state. Forever.

Figuring out why Cloud Composer was leaving workloads in the cluster was just a matter of reading the docs. We had not set the is_delete_operator_pod parameter, which means that composer will leave the workload running after a timeout. This was not really what we wanted, so a simple fix was to apply that parameter change, delete the already stuck jobs, and the problem would somewhat go away. At least it would be hidden from plain sight.

It did bug me, however, that a simple read/write job would be stuck like this. There was no business logic to speak of in the program, and basically it was just reading from one source and dumping the data to another target. How could I debug this?

Step #1: attaching to the running container

It is possible to go SSH into a container that is running, and the simplest method to do this in GKE is to use Cloud Shell. Simply find the running job you want to debug in the workloads menu and then click it. You are then taken to a “Pod details” page. Click the kubectl dropdown menu (top right) → exec → then pick your container (usually just one). Run in Cloud Shell.

You will get a command generated that ends with something like this.

&& kubectl exec<pod_name> — namespace <YOUR_NAMESPACE> -c base — ls

Add -it parameter and start /bin/bash (or similar shell). In this case, the Docker images were based on python:3, which is based on Ubuntu.

&& kubectl -it exec<pod_name> — namespace <YOUR_NAMESPACE> -c base — /bin/bash

Once you press enter you have a shell into the problem container.

Step #2: install debugging tools in the running container

Most likely the container will not contain the tools you need to debug.

Gdb is a C debugger, and is commonly used to debug C/C++ programs. You will need this, as Python is based on C. We also need a Python wrapper that will render gdb to something meaningful. Pyrasite is such a library, and should be in every Python developers toolbox!

These commands were run inside the pod in the SSH session we just started.

apt-get update
apt-get install gdb
pip install pyrasite

Step #3: allow the debugger to work!

After installing this, the debugger is not allowed to do its magic yet if it is running inside Kubernetes. The debugger needs some permissions to do tracing, that a Kubernetes will not grant by default. Demonstrated by running these commands (inside the pod):

$ps x
PID TTY STAT TIME COMMAND
1 ? Ssl 0:12 python3 main.py --path=/foo/bar
$pyrasite
WARNING: ptrace is disabled. Injection will not work.
You can enable it by running the following:
echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope
$echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope
tee: /proc/sys/kernel/yama/ptrace_scope: Read-only file system
0

Ok, so what to do now? I am not allowed to change the file I need to change. A solution could be to run the container in privileged mode, but then we would loose our use case. Can anything be done on this now to get the debug info I need?

Actually, yes! Kubernetes and Docker are running on VMs (in our case container optimized OS), so I located the actual VM machine the job was running on and ran SSH into the VM itself (menu → Compute → Compute Engine → instances → ssh on the instance). Once inside the VM I ran

ps aux | grep python

The same Python program that was running inside the Docker instance showed up (in addition to the Airflow wrappers). I briefly looked into installing the same tools on this machine, but the OS would require some custom installs and compiles to get what I wanted in, and from experience, that can be painful.

Could it be solved another way? Since Python was running on the host, what would happen if I enabled ptrace on the host? Would that do some good? So I ran…

echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope

on the host. I got no error message when doing it.

And then, going back to the SSH session with the hanging pod, I found this:

pyrasite-shell 1
Pyrasite Shell 2.0
Connected to ‘python3 main.py — path=/foo/bar — workers=4’
Python 3.8.0 (default, Nov 23 2019, 05:36:56)
[GCC 8.3.0] on linux
Type “help”, “copyright”, “credits” or “license” for more information.
(DistantInteractiveConsole)
>>>

Yes! Pyrasite now works!

Pasting this little code snippet will print the thread call stacks into stdout.

import sys, traceback
for thread_id, frame in sys._current_frames().items():
print('Stack for thread {}'.format(thread_id))
traceback.print_stack(frame)
print('')

Part of the call stack was (I have obfuscated the call stack on purpose, as the function names and filename of the actual Python program can identify the customer):

File “/foo/bar/baz.py”, line xxx, in problem_functiondata = requests.Session().get(url, cookies=info[“login_token”], headers=headers, params=querystring).json()

And in my case, it was really easy to see that request.get was missing a timeout parameter. This caused .get to wait forever if the server did not respond. And that was causing all this trouble!

Having a stuck program is a gold mine, and simply killing it will only make the problem come back at a later time. You should always try to find the root cause, and getting the call stacks are usually a great help, and more often than not, spell out what the exact cause is.

This method of getting call stacks can also be applied on bare metal or normal VMs, you can just skip all the complexity of the Kubernetes/Docker layer and commands.

If you found this post useful, I always appreciate claps :)

--

--