Using gdb(1), ps(1) and watch(1) to debug memory leaks<Posted by Michael Shuldman, Inferno Nettverk A/S, Norway on Fri May 9 14:05:10 MEST 2014
Even though we use automated tools for helping us detect when and where memory leaks occur, like the excellent Valgrind system, as well as commercial static analysis systems, none of the external tools used were able to warn about a problem we recently had.
Fortunately the problem was however detected by our in-house test system, rather than a customer when our test system started generating messages like this:
td/server_hup FAIL: socksify:memleak: possible memory leak, see _showcpu.LOG for process output memcheck: warning: possible memory leak in sockd-michaels (pid 6439): increase in 40.00% of 15 values in 123s
While not all that helpful for debugging, it does tell us that a memory usage is increasing for a specific process.
There are a lot of servers in the world where memory usage is expected to go up and down during the lifetime of one or more process.
Due to some high-level design decisions done by us (which one can argue for or against, at another time) this is not something we normally expect in our own software however, not even in that software which runs on vanilla UNIX system rather than custom built embedded devices.
This particular test, the td/server_hup test, was repeatedly sending SIGHUP signals to our server, causing it to reload the same config file over and over again. Since it was the same config file that was reloaded, and nothing else was happening, we would expect the memory usage to remain fairly constant during this test. The same should be the case for most of our other tests, even tests where many things were happening. This test was however chosen for debugging this the problem since debugging should be easier when less things were happening.
Since a considerable increase in memory usage was detected by the test, causing the test to fail, there was likely a problem here.
After making sure neither Valgrind nor the commercial static analyser software we use reported any problems related to this, we had to start with some considerable more primitive debugging.
Debugging: gdb(1), ps(1), and watch(1)
First I verified that I could observe the problem manually with ps(1). I started our server, ran `ps -p <pid of server', and then sent the server some SIGHUP signals, noting that the VSZ and RSS values increased after each SIGHUP.
The RSS value (resident set size) should correspond to how much physical RAM the process is currently using, and is a subset of the VSZ size.
The VSZ (virtual set size) should correspond to the amount of memory the process has allocated, including memory space that it is currently not using and which is thus not resident in RAM (i.e., it is swapped out, or depending on the kernel, perhaps not even allocated yet).
Having done this, I started a fresh server, attached to the server process with gdb(1) and set a breakpoint at the server's sighup() function. In another xterm, I ran watch(1):
watch -n 0.1 -d ps p <server pid> o vsz,rss
The watch(1) program, particularly with the `-d' option, makes it easy to see when something changes.
In a third xterm, I sent the server a SIGHUP, and while single-stepping through the sighup() function with gdb(1), kept an eye on the xterm where I was running watch(1).
While doing this I noted where the VSZ and RSS values changed. The `-n 0.1' option made watch(1) fetch the output from ps(1) frequent enough (10 times a second) for me to be able to single step in the debugger, noticing exactly what lines caused changes to the VSZ or RSS.
After doing this a few times, I had a list of three function calls (called by our sighup() function) that each caused an increase in either the VSZ or RSS value reported by ps(1):
I then did the same for the three functions itself, single-stepping through the functions, and functions called by them (if any were relevant), until I found the errors. The main error being, rather embarrassingly, a misunderstanding on my own part regarding that mmap(2) will not automatically munmap(2) the currently mapped region passed it if one uses MAP_SHARED, but will instead create a new mapping at a new address, and keep the old mapping too.
With the use of two xterm windows visible at the same time, and one program running in each of them:
One is able to step through large amounts of program code in one xterm window, starting with the highest-level functions of the code and noting in another xterm window which functions cause an increase in memory usage. From thereon it is easy to repeat the procedure, using progressively lower-level functions, until one finds the actual lines of code responsible for the increase in memory usage.
Doing this one is might be able to in a fairly automated fashion considerably reduce the amount of code needing careful examination and intelligent thought.