Troubleshooting Thread Leaks in Python
TL;DR
py-spy & grep & log saved the day. py-spy to identify the leak point, grep & log to confirm the leak chain.
Background
Recently, while doing security hardening, after introducing a second-party lib, one day an alert came in about abnormal CPU usage + fd leak.
This second-party lib is used to proxy HTTP requests. Through this proxy, we can ensure requests are trusted.
Troubleshooting Process
1. Using py-spy to Identify the Leak Point
py-spy can be used not only for Python performance analysis but also for thread dumps. We use py-spy to identify the leak point.
py-spy dump <pid>
With this command, you can see the thread stacks of the current process. If a process has a large number of similar thread stacks, then this process might have a leak. The leak point is the repeatedly appearing thread.
2. Using grep and log to Confirm the Leak Chain
Python’s LSP doesn’t directly jump to concrete implementations when analyzing some duck type or protocol calls. In this case, we can only rely on string matching to confirm the leak chain.
- Use
grepto recursively search for trigger points within the library - Use
logcontent to verify if the call chain hypothesis is correct, such as checking if there are logs matching the hypothesis along the suspected path
Here are some tips:
- Generally, our application has a baseline version that runs normally, and the leak only appears after a certain change. So focusing on the changed code and binary searching for the trigger point will help us clarify the call chain
grepis quite useful - recursive searching is faster and more accurate than you might expect- You can approach from both the leak point and the trigger point, converging to find the common call chain
Others
This article is just an introduction. Later I will analyze py-spy’s working mechanism in more depth.