Hunting Memory Leaks in Python

Posted on Sa, 2013-08-31 in coding

The first thing you might say: "Memory leaks in python? What the hell are you talking about? Python has garbage collection. How is this possible? I don't have to care about memory management!" Well you don't. Until your python project blows up directly into your face because it eats up more and more memory, and you have no idea why. As you might know the garbage collector can only free unreferenced objects and if you allocate lot's of objects and keep references to unused objects, those objects won't be garbage collected. This doesn't sound so bad, but it's easy to forget about a reference to an object somewhere in a list or a dictionary.

Such memory leaks are unfortunate if it is a type of program that batch-processes a lot of files at once or possibly runs a long time without being restarted (i.e. a server). Well recently I was challenged with the first type. Batch processing a lot of files. With some shell voodoo magic (find -exec) you could possibly workaround the memory-leaking program, but that's not a solution. This blog post consists of short descriptions of the tools I encountered and used during my hunt for the memory leak.

So let's start hunting this memory leak.

The Python Debugger

Most of the time it is sufficient to print()-debug smaller python scripts, but in my case I needed something more powerful: pdb. Of course python has a debugger included. The look and feel of pdb is like a mixture of the python shell and gdb. To start the python debugger from within a python program use

import pdb
pdb.set_trace()

and you are dropped into the pdb shell in the current context. Of course pdb also supports breakpoints and all the debugger features you would expect. Check out this link for a tutorial on pdb pdbpymotw. Also I have this cheatsheet on the wall in front of me, in case I forget something pdbcheatsheet.

So that's a good start to inspect the internals of you python program, but we need something more specialized for inspecting the memory usage of a program. The two tools I found useful are guppy/heapy and the memory_profiler module, as suggested on Stack Overflow memprofilerso.

The memory_profiler Module

Let's talk about memory_profiler pymemoryprofiler first. This module allows us to get a line by line memory usage, by using a simple decorator.

from memory_profiler import profile

@profile
def do_something(a, b, c):
    y = [1, 2, 3] * (2 ** 18)
    x = [b.lower() + "-" + str(t) for t in range(a)]
    del y
    x = [b.lower() + "-" + str(t) for t in xrange(a)]
    z = x[::2]
    del x
    return z[c]

do_something((2**16), "omgwtf-", 5)

which will result in the following output:

Line #    Mem usage    Increment   Line Contents
================================================
     3                             @profile
     4     8.746 MB     0.000 MB   def do_something(a, b, c):
     5    14.750 MB     6.004 MB       y = [1, 2, 3] * (2 ** 18)
     6    22.328 MB     7.578 MB       x = [b.lower() + "-" + str(t) for t in range(a)]
     7    16.324 MB    -6.004 MB       del y
     8    22.973 MB     6.648 MB       x = [b.lower() + "-" + str(t) for t in xrange(a)]
     9    22.973 MB     0.000 MB       z = x[::2]
    10    22.973 MB     0.000 MB       del x
    11    22.973 MB     0.000 MB       return z[c]

So you can easily spot which parts of your code require the most memory and where the memory is allocated. This is a good starting point, but the information you get is not always sufficient to spot the leak. In my opinion the info you get if the function is called repeatedly is not that valuable.

Another feature from memory_profiler I think is very useful is the memory_usage function. It allows to gather the memory usage of a process at a certain point of time or over over a certain amount of time at specific intervals. This is especially useful to monitor subprocesses or monitoring other processes if you use multiprocessing. To get the current memory usage of the process

from memory_profiler import memory_usage
print("current memory usage:", memory_usage(-1)[0])

So to get a first overview of the memory usage of the process I included code that called memory_usage at several points in the code and logged it, so that I could grep out the memory usage afterwards.

To monitor another process (in this case my firefox) over a period of 5 seconds and get the memory usage every 0.5 seconds.

>>> usage = memory_usage(pid, interval=0.5, timeout=5)
>>> print(usage)
[2280.01171875, 2280.01171875, 2280.01171875, 2280.5234375, 2293.6015625, 2299.43359375, 2334.91796875, 2163.0078125, 2191.03515625, 2196.421875]

An optional but highly recommended dependency of memory_profiler is psutil. You might want to check out this module if you want to monitor other resources as well psutilmodule.

guppy/heapy

As I found memory_profiler useful to get information on where and when a lot memory is allocated. It can't really tell us what is using the memory. For inspecting this I used the heapy module from the guppy package. I highly recommend to check out this tutorial for heapy heapytutorial.

guppy/heapy didn't work out of the box on my arch linux box with python 2.7, so I installed the latest trunk version using pip

pip install https://guppy-pe.svn.sourceforge.net/svnroot/guppy-pe/trunk/guppy

So to inspect the python heap I added something like the following in my code

hp = hpy()
print(hp.heap())
pdb.set_trace()

so that I could also play around with heapy a little bit more using the pdb shell. Check out the heapy documentation for more information.

What about C extensions?

It seems that heapy cannot access python types from third party libraries that are implemented in C. This is really unfortunate (at least in my case). So what do we do now? Back to good old valgrind :)

massive heap inspection with: massif

valgrind comes with a tool called "massif" which logs the callgraph for each allocation on the heap. When looking at the massif output we can differantiate between two cases. If we see a lot of malloc() calls, it's probably the extension itself that's leaking memory and we can hunt down the leak with the valgrind Memcheck tool. But if we see pythons internal allocator calls, we are probably dealing with a leak in the python part of our code. This is because we deal with a python type implemented in the C extension, which is referenced somewhere in python code. So my primary goal was to find out "what?" uses a lot of memory. Unfortunately valgrind doesn't really help in this case, as you can't see were the calls are originating.

Conclusion

In my case the hunt for the memory leak was unsucessful, although I think the culprit is SQLAlchemy. Anyway most of the time it is enough to monitor, when memory usage rises to identify the culprit. Another hint is too enable debug output for the garbage collector. You might get some useful information out there. So anyway although the tools are not great, they are good enough to solve the occasional memory leak problem in python.