|
Note that all opinions, views, etc. expressed in these blog pages
reflect those of the authors, not necessarily those of
Inferno Nettverk A/S.
| |
Valgrind and uninitialised reads debugging
Posted by Michael Shuldman, Inferno Nettverk A/S, Norway
on Tue Dec 20 16:46:39 2011
Valgrind is a useful tool for discovering many types of programming errors
at runtime, and here at Inferno Nettverk, Valgrind is part of the automated
testing system for our software. That means I don't spend much time
thinking about or running Valgrind myself, but now and then I get an
email from the test system complaining that Valgrind has detected an error.
Since it can be quite a few months between each time that happens, I have
usually forgot how to interpret the Valgrind warnings and how I debugged
them last time. Alas, that was the case the other day, so I thought
that perhaps this time I could spend an extra hour documenting and
thinking a little more about how I analysed and debugged this Valgrind
warning. Hopefully it would make things easier for me next time, and
perhaps somebody else would find it useful too.
One of the things Valgrind can aid in is discovering use of uninitialised
memory. In my case, Valgrind reported this particular error:
==10701== Syscall param socketcall.sendmsg(msg.msg_iov[i]) points to uninitialised byte(s)
==10701== at 0x4111BFE1: sendmsg (socket.S:64)
==10701== by 0x8056A46: send_io (sockd_child.c:1314)
==10701== by 0x805FCFA: flushio (sockd_request.c:2742)
==10701== by 0x805EAB6: dorequest (sockd_request.c:1502)
==10701== by 0x805BB4E: run_request (sockd_request.c:235)
==10701== by 0x80546B5: addchild (sockd_child.c:505)
==10701== by 0x8055037: childcheck (sockd_child.c:748)
==10701== by 0x80504F9: main (sockd.c:444)
==10701== Address 0xbeea4144 is on thread 1's stack
What this means is that some function is reading from the memory address
0xbeea4144, but that address has not been initialised/written to yet.
It also shows that memory is accessed/read by the sendmsg(2) system call,
called from our send_io() function.
Sometimes it's obvious what is uninitialised. E.g., if the only
thing msg.mg_iov[i] contains is memory copied from the integer object
'a', then ok, we're reading 'a' before we've written a value to it.
Excellent and thank you very much, now the only thing we need to do
is to backtrack from that point and see why we've not written to 'a',
yet are trying to read from 'a'.
Unfortunately, things are sometimes a bit more complex, as was the case
this time.
Looking at send_io() I saw that msg.msg_iov is filled in with the
contents of the "io" object passed to send_io() from dorequest() via
flushio(). I spent some time looking around in dorequest(), where the
"io" object is created, but could not see any missing initialisation.
The "io" object is quite large however, perhaps several thousand bytes,
and has dozens of different attributes. Checking each and every one of
the attributes for missing initialisation, as well as checking whether
other attributes used to initialize the io object had themselves been
properly initialised would take forever. I had to disregard that
approach and find a smarter way to do this I though.
So how could I find out which of the many fields in the io object Valgrind
was complaining about?
Fortunately, if you look carefully at the Valgrind warning and think
about it a little, it's often not too hard to find out what parts of the
object the Valgrind error refers to, though it, at least the way I do it,
requires some old-fashioned debug printf's.
Valgrind reports that the bad read is done from the memory address 0xbeea4144,
and from looking at the source code I had already found out that that
address belongs to the io object, so first I added a debug printf to the
send_io() function, printing out the address of the io-object passed it.
This was just to get a starting point, as I had no idea whether address
0xbeea4144 was at the beginning, end, or middle of the memory allocated
for that io object.
This produced the following output:
Dec 1 23:09:58 (1322755798.591805) barefootd[10701]: debug: send_io(): io = 0xbeea4134"
So the io object starts at address 0xbeea4134, while the uninitialised
memory as reported by Valgrind is on address 0xbeea4144. From this I
understood that the uninitialised data resides close to the beginning of
the io object, since memory usually grows towards higher addresses on
the CPU's most of us get to program on today.
I then expanded the debug printf to print the address of the first field
of the io->state object also, and got the following output:
Dec 1 23:09:58 (1322755798.591805) barefootd[10701]:
debug: send_io(): io = 0xbeea4134, io->state.command = 0xbeea4138, io->state.clientcommand = 0xbeea413c
So the uninitialised object existed 8 bytes after the start of the memory
set aside for the io->state.clientcommand starting at address 0xbeea413c.
I looked at the state object definition in the header file, and saw that
the next fields around the clientcommand were these:
int clientcommand;
int protocol;
int proxyprotocol;
int clientprotocol;
I made a quick guess that on my machine an int is four bytes (easy
to check using e.g., the sizeof operator if I'd wanted. Having atleast a
vague notion of what the size of different standard C types are on your
CPU is useful when debugging things like this). If so, the uninitialised
object should be the "proxyprotocol" attribute.
I changed the debug printf to print out the address of proxyprotocol
to verify this, and reran the test.
This time the output was this:
==13659== Syscall param socketcall.sendmsg(msg.msg_iov[i]) points to uninitialised byte(s)
==13659== at 0x4111BFE1: sendmsg (socket.S:64)
==13659== by 0x8056A2C: send_io (sockd_child.c:1313)
==13659== by 0x805FCDE: flushio (sockd_request.c:2742)
==13659== by 0x805EA9A: dorequest (sockd_request.c:1502)
==13659== by 0x805BB32: run_request (sockd_request.c:235)
==13659== by 0x80546B5: addchild (sockd_child.c:505)
==13659== by 0x8055037: childcheck (sockd_child.c:748)
==13659== by 0x80504F9: main (sockd.c:444)
==13659== Address 0xbedfa134 is on thread 1's stack
==13659==
The debug printf now said this:
Dec 1 23:49:55 (1322758195.274201) barefootd[13658]:
debug: send_io(): io = 0xbedfa124, io->state.proxyprotocol = 0xbedfa134
(You might note that the address of the io object changed after this
compilation, but that does not matter. The interesting thing, where
the uninitialised data is relative to the start of the io object, remains
the same.)
So rather than checking every one of the many fields in the io object
trying to find out which were the uninitialised one(s), I now knew the
one field Valgrind was actually complaining about.
This of course makes it much easier to next perform some code review and
find out what part of the code is wrong, as now I only had to look at the
the code pertaining to initialisation of io->state.proxyprotocol.
It was fairly easy to find the problem now that I knew what field
to check, and a possible bug was thus fixed.
I was just about to leave for the day and go celebrate this, feeling a
good days work had now been done, but as I reran this particular test
manually I noticed to my surprise that even after fixing this error,
Valgrind still reported an uninitialised use in the same send_io()
function. At a different address this time, but in the same function
and regarding the same io object. Certainly unconvenient, so I slumped
back into my chair again to look more closely at the Valgrind report.
This new address turned out to be quite a bit further out from the start
of the io object, so I figured I had found another bug.
Well, that's a pity I thought to myself, but sometimes these things
happens, so what can you do? Since I was now warmed up, I might as well
try to fix this bug also now, before committing the fixes for both to
the main source repository.
Valgrind and false positives
This time Valgrind reported the following:
==5148== Syscall param socketcall.sendmsg(msg.msg_iov[i]) points to uninitialised byte(s)
==5148== at 0x4111BFE1: sendmsg (socket.S:64)
==5148== by 0x8056A3E: send_io (sockd_child.c:1318)
==5148== by 0x805FDD5: flushio (sockd_request.c:2743)
==5148== by 0x805EAFD: dorequest (sockd_request.c:1504)
==5148== by 0x805BB46: run_request (sockd_request.c:235)
==5148== by 0x80546B5: addchild (sockd_child.c:505)
==5148== by 0x8055037: childcheck (sockd_child.c:748)
==5148== by 0x80504F9: main (sockd.c:444)
==5148== Address 0xbeced255 is on thread 1's stack
After a little trial and error, I ended up with the following debug printf:
Dec 2 00:55:52 (1322762152.084539) barefootd[5148]: debug: send_io(): io = 0xbeced134,
io->state.proxychain.extaddr.atype = 0xbeced254,
io->state.proxychain.extaddr.addr = 0xbeced258
This turned out to be a little strange, as "extaddr" is a sockshost_t
object that looks like this:
typedef struct sockshost_t {
unsigned char atype;
union socksaddr_t addr;
in_port_t port;
} sockshost_t;
I.e., the atype field is located at address 0xbeced254, and is the size
of one (one C char), while the next field starts at address 0xbeced258.
So, why on earth was Valgrind complaining about something at address
0xbeced255, where there was nothing?
Well, if you've ever had to deal with things at just a slightly
lower-level than you normally need to deal with, you will have heard
off the term "padding" or "padding bits". Padding is basically
some bits ("pad bits") added between different objects or attributes.
E.g., between the different attributes of a C struct.
We don't normally need to know or care about this (though you often do if
you're working with low-level stuff, or sending objects across a network,
or to a process compiled as part of a different runtime, or writing data
to disk, or ... Hmm, so I guess there are quite a few cases you do need
to care about this after all), but in order to understand the Valgrind
error, we need to know at least a little about what bit padding is.
Otherwise the Valgrind warning would not make any sense.
Nobody was trying to access the padding bits between the atype
and addr object in our code, but since Valgrind is brilliant
enough to detect almost anybody reading the uninitialised memory,
including the sendmsg(2) system call which most certainly knows nothing about
the individual members of the io object and will just do a "raw" read
of the given size, Valgrind had correctly detected that somebody
was reading uninitialised memory.
This would not affect our program, as we were just sending the io object
to another process running on the same machine, part of the same runtime
binary, and were only accessing the existing fields, so there was nothing
wrong in our code per se; we would never try to read the padding bits,
but the sendmsg(2) system call needed to read a range of memory and neither
knew nor cared that some of the bits in that range was padding bits.
However, unless we wanted to live with the Valgrind warning forever,
which would increase the risk of us missing out on real errors in the
same area later, we needed to do something to get rid of the Valgrind
warning somehow.
Well, how do you initialise a field/attribute that does not exist?
I would have preferred to bzero just the padbytes, so that if other fields
of the sockshost_t object at some point ended up being uninitialised,
Valgrind would have a chance to catch that also.
I could probably also have solved the problem by adding some quirky code,
checking for the address of the different fields, comparing that with
the size of the different fields, trying to figure out where and how many
pad bytes there were. That would be quite a bit of overkill however,
and I'm not sure how portable it would be either, at least not without
looking up a good C reference, such as Harbison and Steele's "C A
reference manual".
Instead I ended up bzero'ing the whole sockshost_t object at the latest
point I could, as close as possible to the initialisation of the real
fields in the sockshost_t object.
After this, that Valgrind warning also disappeared, and everything
was fine again.
Comments:
Add comment:
| |