Livedebug: the art of dumping core and dying on live production systemsPosted by Michael Shuldman, Inferno Nettverk A/S, Norway on Sat Aug 3 22:42:07 MEST 2013
When a customer reports a problem with our software, we work hard to figure out what is going on as fast as possible, usually with no access to the system where our software is running in production.
Depending on the SLA we have with the particular customer and how serious the problem is, this may involve waking up the right developer in the middle of the night to start analysing the problem.
Partly due to this, our software has a built-in debug option which we sometimes ask the customer to enable. This option logs a lot of information that is normally of no interest to anyone, except us if and when a problem occurs.
During the course of many years and many debugging sessions, a large amount of debug logging has been added to our software. These debug logs are usually enough for us to figure out where the problem is, though not always.
An unfortunate side effect of all the logging produced when debugging is enabled is that writing the logoutput to disk or the network also takes time, and can slow down the software a little. This slowdown is not something most of us would notice (though our disks may), but some of our software is network servers that are used to transfer network packets as fast as possible.
When you have customers who contacts support for assistance in order to figure out why a particular network packet with stock exchange data was forwarded at time T, and not 100 milliseconds earlier, you also have customers who with good reason do not want to run your software with debug logging enabled for any considerable amount of time; their users depend on getting the data as fast as possible and the operator may not be able to slow things down for his users in order for us to spend time on analysing the problem.
Add to this that we also have cases where debug logging could be enabled, but the extra delays caused by the debug logging prevented the problem from occurring, or prevented it from occurring within a reasonable amount of time.
If we are unable to reproduce the customers problem in our own network lab, and the customer is unable to enable debug logging while running our software, then we have a problem.
Without the debug logs we have little information to work with, and we will almost certainly have to use considerable more time on figuring out what the problem is than would otherwise be needed. It might also force us to go back and forth with our customer repeatedly, taking him away from his other work.
In some cases the problem has been of a nature where we think that if only we were able to get a good coredump at a certain time, a lot could be gained, with or without debug logs.
If only ...
Unfortunately, even if we sent the customer a special version of our software, designed to crash and dump core when a particular event occurred, chances are the customer would not be very happy with a program that crashes on a production system, and who can blame him; it might be bad enough that some packets are delayed by some milliseconds, but it is much worse if a lot of network sessions suddenly disconnect due to our program dumping core and crashing, even if it is a planned event in order to get some crucial information. With little debug information available, it's also far from certain that dumping core and crashing just once will be enough for us to understand the problem completely.
We thus have two problems:
An idea I've been developing to help us with cases like this is a minor feature we have ended up naming livedebug. The idea, and the implementation too, has turned out to be quite simple.
Like many other ideas, it might very well be that this idea has been invented many times over the years by many independent people. I am not aware of any "prior art" in this regard however, so here I describe our implementation. Other people's idea for the same may differ and be better, in which case I'd be happy to hear about it so I could improve our own implementation. Please leave a comment if you are aware of other similar ideas.
Normally the code flow in a program that has support for debug logging goes something like this
<lines of code doing what needs to be done> ... debuglog("finished doing A, status is A'. Next doing B") ... <lines of code doing what needs to be done>
If debug logging is enabled, debuglog() will normally write the appropriate logmessage to the logfile. If on the other hand debug logging is not enabled, the debuglog()-function is a no-op that does nothing.
Profiling things, we know that it's the i/o to disk and the syscalls involved with that, locking included, which takes time, and which sometimes makes it impossible for some of our customers to enable debug logging on our software in production.
What our livedebug-feature does, when enabled, is to save all logmessages, debug or not, to a fixed-size ringbuffer in our applications internal memory. Since the memory is of a fixed size internal to each process, no locking is needed, and no extra systemcalls are normally needed either. Saving the logmessage to our ringbuffer takes only an negligible, not even easily measurable, amount of extra time, and is normally not a problem on even the most busy production server.
Solution to problem 1: ringbuffer
The ringbuffer is basically an area in memory of a fixed size to which we write a copy of all loglines, the lines being debug-lines or not. When that memory is full, we start writing at the start of the memory again, overwriting the oldest loglines, and keeping the most recent. The ringbuffer thus always keeps the last N bytes of loglines, where N is the fixed size of the ringbuffer.
We write mostly network software here, so depending on how much network traffic there is, the ringbuffer may be big enough to hold minutes or even hours worth of loginfo, or only the last few seconds. If the default size turns out to be too small, it's easy to increase the size of course, even though it requires recompiling. Depending on what your software does, you will obviously have to do some thinking and experimenting in order to figure out what a good size is for your own software, obviously also taking into account whether you run on a small embedded device or on a large Sun SPARC server with zillion gigabytes of RAM.
Flushing the ringbuffer
With livedebug is enabled, all loglines, including debug loglines that would normally not be logged anywhere, are saved into the ringbuffer.
How do we get to view the data in the ringbuffer?
Well, that is something one also has to design into ones program of course, and there are many ways one can do that.
Being a strong believer in the "crash often and crash early" mantra, ... sorry, that should perhaps better be just "crash early", my code is usually littered with asserts. If something is wrong, I want to know about it as quickly as possible, and the easiest way to do that is often by raising an assert.
The assert logs the error in a obvious way, and might also crash and create a coredump if there is no safe way to continue running at that point. Either or both of these will hopefully be noticed by the operator, who will then report the problem to us. If livedebug is enabled, our program will also flush the ringbuffer to the regular logfile at this point.
So if I can add some checks to test for certain conditions that indicate the problem the customer reported is currently present, I can either raise an assert when those checks fail, or just directly flush the ringbuffer to the regular logfile at that time. That's nice and solves problem 1, since saving data in the ringbuffer takes almost no time, and flushing the ringbuffer once the problem does occur is acceptable, even though the latter will take some extra time of course.
How about the other problem, problem 2, though? I do not wish to raise an assert in production that would crash the process, bringing down network sessions for a lot of clients or worse, but sometimes a coredump would be more than a little helpful.
Solution to problem 2: fork(2) and die
The solution to the problem of getting a coredump from a live production system without disrupting service is a two step-process and is technically independent of the ringbuffer, though we have integrated them in our solution.
It is also simple to implement since the kernel people have done all the hard work for us:
After implementing the above solutions, we can have both a coredump and a large logfile, hopefully containing enough information for debugging.
At the same time, the production system keeps running as before. The extra delay occurs only at the point in time when the problem is detected (when a child process is forked and the ringbuffer if flushed), rather than constantly up to that point (which may take days or weeks).
If it is not easy to add code that automatically detects when the problem is present, one could of course also reserve, for instance, a special signal (e.g., SIGUSR2), and code into the application that if this signal is received, the ringbuffer should be flushed, and/or core should be dumped. If the problem is then manually or otherwise detected at some point, the operator could send the signal to our program at that time, enabling us to get a full debug log of the last seconds/minutes/hours.