inet   Frontpage - Services - Products - Support - Company - Contact - Blog
Note that all opinions, views, etc. expressed in these blog pages reflect those of the authors, not necessarily those of Inferno Nettverk A/S.  

Valgrind and uninitialised reads debugging

Posted by Michael Shuldman, Inferno Nettverk A/S, Norway on Tue Dec 20 16:46:39 2011

Valgrind is a useful tool for discovering many types of programming errors at runtime, and here at Inferno Nettverk, Valgrind is part of the automated testing system for our software. That means I don't spend much time thinking about or running Valgrind myself, but now and then I get an email from the test system complaining that Valgrind has detected an error.

Since it can be quite a few months between each time that happens, I have usually forgot how to interpret the Valgrind warnings and how I debugged them last time. Alas, that was the case the other day, so I thought that perhaps this time I could spend an extra hour documenting and thinking a little more about how I analysed and debugged this Valgrind warning. Hopefully it would make things easier for me next time, and perhaps somebody else would find it useful too.

One of the things Valgrind can aid in is discovering use of uninitialised memory. In my case, Valgrind reported this particular error:
==10701== Syscall param socketcall.sendmsg(msg.msg_iov[i]) points to uninitialised byte(s)
==10701==    at 0x4111BFE1: sendmsg (socket.S:64)
==10701==    by 0x8056A46: send_io (sockd_child.c:1314)
==10701==    by 0x805FCFA: flushio (sockd_request.c:2742)
==10701==    by 0x805EAB6: dorequest (sockd_request.c:1502)
==10701==    by 0x805BB4E: run_request (sockd_request.c:235)
==10701==    by 0x80546B5: addchild (sockd_child.c:505)
==10701==    by 0x8055037: childcheck (sockd_child.c:748)
==10701==    by 0x80504F9: main (sockd.c:444)
==10701==  Address 0xbeea4144 is on thread 1's stack
What this means is that some function is reading from the memory address 0xbeea4144, but that address has not been initialised/written to yet. It also shows that memory is accessed/read by the sendmsg(2) system call, called from our send_io() function.

Sometimes it's obvious what is uninitialised. E.g., if the only thing msg.mg_iov[i] contains is memory copied from the integer object 'a', then ok, we're reading 'a' before we've written a value to it. Excellent and thank you very much, now the only thing we need to do is to backtrack from that point and see why we've not written to 'a', yet are trying to read from 'a'.

Unfortunately, things are sometimes a bit more complex, as was the case this time.

Looking at send_io() I saw that msg.msg_iov is filled in with the contents of the "io" object passed to send_io() from dorequest() via flushio(). I spent some time looking around in dorequest(), where the "io" object is created, but could not see any missing initialisation.

The "io" object is quite large however, perhaps several thousand bytes, and has dozens of different attributes. Checking each and every one of the attributes for missing initialisation, as well as checking whether other attributes used to initialize the io object had themselves been properly initialised would take forever. I had to disregard that approach and find a smarter way to do this I thought.

So how could I find out which of the many fields in the io object Valgrind was complaining about?

Fortunately, if you look carefully at the Valgrind warning and think about it a little, it's often not too hard to find out what parts of the object the Valgrind error refers to, though it, at least the way I do it, requires some old-fashioned debug printf's.

Valgrind reports that the bad read is done from the memory address 0xbeea4144, and from looking at the source code I had already found out that that address belongs to the io object, so first I added a debug printf to the send_io() function, printing out the address of the io-object passed it. This was just to get a starting point, as I had no idea whether address 0xbeea4144 was at the beginning, end, or middle of the memory allocated for that io object.

This produced the following output:
Dec  1 23:09:58 (1322755798.591805) barefootd[10701]: debug: send_io(): io = 0xbeea4134"
So the io object starts at address 0xbeea4134, while the uninitialised memory as reported by Valgrind is on address 0xbeea4144. From this I understood that the uninitialised data resides close to the beginning of the io object.

I then expanded the debug printf to print the address of the first field of the io->state object also, and got the following output:
Dec  1 23:09:58 (1322755798.591805) barefootd[10701]:
debug: send_io(): io = 0xbeea4134, io->state.command = 0xbeea4138, io->state.clientcommand = 0xbeea413c
So the uninitialised object existed 8 bytes after the start of the memory set aside for the io->state.clientcommand starting at address 0xbeea413c.

I looked at the state object definition in the header file, and saw that the next fields around the clientcommand were these:
   int                  clientcommand;
   int                  protocol;
   int                  proxyprotocol;
   int                  clientprotocol;
I made a quick guess that on my machine an int is four bytes (easy to check using e.g., the sizeof operator if I'd wanted. Having atleast a vague notion of what the size of different standard C types are on your CPU is useful when debugging things like this). If so, the uninitialised object should be the "proxyprotocol" attribute.

I changed the debug printf to print out the address of proxyprotocol to verify this, and reran the test.

This time the output was this:
==13659== Syscall param socketcall.sendmsg(msg.msg_iov[i]) points to uninitialised byte(s)
==13659==    at 0x4111BFE1: sendmsg (socket.S:64)
==13659==    by 0x8056A2C: send_io (sockd_child.c:1313)
==13659==    by 0x805FCDE: flushio (sockd_request.c:2742)
==13659==    by 0x805EA9A: dorequest (sockd_request.c:1502)
==13659==    by 0x805BB32: run_request (sockd_request.c:235)
==13659==    by 0x80546B5: addchild (sockd_child.c:505)
==13659==    by 0x8055037: childcheck (sockd_child.c:748)
==13659==    by 0x80504F9: main (sockd.c:444)
==13659==  Address 0xbedfa134 is on thread 1's stack
==13659==
The debug printf now said this:
Dec  1 23:49:55 (1322758195.274201) barefootd[13658]:
debug: send_io(): io = 0xbedfa124, io->state.proxyprotocol = 0xbedfa134
(You might note that the address of the io object changed after this compilation, but that does not matter. The interesting thing, where the uninitialised data is relative to the start of the io object, remains the same.)

So rather than checking every one of the many fields in the io object trying to find out which were the uninitialised one(s), I now knew the one field Valgrind was actually complaining about.

This of course makes it much easier to next perform some code review and find out what part of the code is wrong, as now I only had to look at the the code pertaining to initialisation of io->state.proxyprotocol.

It was fairly easy to find the problem now that I knew what field to check, and a possible bug was thus fixed.

I was just about to leave for the day and go celebrate this, feeling a good days work had now been done, but as I reran this particular test manually I noticed to my surprise that even after fixing this error, Valgrind still reported an uninitialised use in the same send_io() function. At a different address this time, but in the same function and regarding the same io object. Certainly unconvenient, so I slumped back into my chair again to look more closely at the Valgrind report.

This new address turned out to be quite a bit further out from the start of the io object, so I figured I had found another bug.

Well, that's a pity I thought to myself, but sometimes these things happens, so what can you do? Since I was now warmed up, I might as well try to fix this bug also now, before committing the fixes for both to the main source repository.

Valgrind and false positives

This time Valgrind reported the following:
==5148== Syscall param socketcall.sendmsg(msg.msg_iov[i]) points to uninitialised byte(s)
==5148==    at 0x4111BFE1: sendmsg (socket.S:64)
==5148==    by 0x8056A3E: send_io (sockd_child.c:1318)
==5148==    by 0x805FDD5: flushio (sockd_request.c:2743)
==5148==    by 0x805EAFD: dorequest (sockd_request.c:1504)
==5148==    by 0x805BB46: run_request (sockd_request.c:235)
==5148==    by 0x80546B5: addchild (sockd_child.c:505)
==5148==    by 0x8055037: childcheck (sockd_child.c:748)
==5148==    by 0x80504F9: main (sockd.c:444)
==5148==  Address 0xbeced255 is on thread 1's stack
After a little trial and error, I ended up with the following debug printf:
Dec  2 00:55:52 (1322762152.084539) barefootd[5148]: debug: send_io(): io = 0xbeced134,
       io->state.proxychain.extaddr.atype = 0xbeced254,
       io->state.proxychain.extaddr.addr = 0xbeced258
This turned out to be a little strange, as "extaddr" is a sockshost_t object that looks like this:
typedef struct sockshost_t {
   unsigned char        atype;
   union socksaddr_t    addr;
   in_port_t            port;
} sockshost_t;
I.e., the atype field is located at address 0xbeced254, and is the size of one (one C char), while the next field starts at address 0xbeced258. So, why on earth was Valgrind complaining about something at address 0xbeced255, where there was nothing?

Well, if you've ever had to deal with things at just a slightly lower-level than you normally need to deal with, you will have heard off the term "padding" or "padding bits". Padding is basically some bits ("pad bits") added between different objects or attributes. E.g., between the different attributes of a C struct.

We don't normally need to know or care about this (though you often do if you're working with low-level stuff, or sending objects across a network, or to a process compiled as part of a different runtime, or writing data to disk, or ... Hmm, so I guess there are quite a few cases you do need to care about this after all), but in order to understand the Valgrind error, we need to know at least a little about what bit padding is. Otherwise the Valgrind warning would not make any sense.

Nobody was trying to access the padding bits between the atype and addr object in our code, but since Valgrind is brilliant enough to detect almost anybody reading the uninitialised memory, including the sendmsg(2) system call which most certainly knows nothing about the individual members of the io object and will just do a "raw" read of the given size, Valgrind had correctly detected that somebody was reading uninitialised memory.

This would not affect our program, as we were just sending the io object to another process running on the same machine, part of the same runtime binary, and were only accessing the existing fields, so there was nothing wrong in our code per se; we would never try to read the padding bits, but the sendmsg(2) system call needed to read a range of memory and neither knew nor cared that some of the bits in that range was padding bits.

However, unless we wanted to live with the Valgrind warning forever, which would increase the risk of us missing out on real errors in the same area later, we needed to do something to get rid of the Valgrind warning somehow.

Well, how do you initialise a field/attribute that does not exist?

I would have preferred to bzero just the padbytes, so that if other fields of the sockshost_t object at some point ended up being uninitialised, Valgrind would have a chance to catch that also.

I could probably also have solved the problem by adding some quirky code, checking for the address of the different fields, comparing that with the size of the different fields, trying to figure out where and how many pad bytes there were. That would be quite a bit of overkill however, and I'm not sure how portable it would be either, at least not without looking up a good C reference, such as Harbison and Steele's "C A reference manual".

Instead I ended up bzero'ing the whole sockshost_t object at the latest point I could, as close as possible to the initialisation of the real fields in the sockshost_t object.

After this, that Valgrind warning also disappeared, and everything was fine again.
Comments:

Name: John
Subject: Thanks!
Date: February 2013
Awesome report! I had the exact error message and it helped me to understand how Valgrind detects errors and why I was getting the error. Thanks so much for the thorough explanation!
Name: mMontu
Subject: Thanks!
Date: August 2014
I was stuck with false positives, this post was really helpful!
Name: dynamicsamurai
Subject: Weird encounter
Date: September 2015
just had the weirdest encounter with valgrind - it reported a semi-false-positive. It reported the error you mention above, albeit with a different stack track because I'm obviously using different socket lib. Anyhow, I removed tests until I found one that was causing the error. When I removed it, a different error (a memory leak) arose, which after fixing and putting all tests back in also resolved the initial error.
Name: Mike
Subject: Working on one of these right now
Date: January 2016
Thanks for publishing this. I'm working through one of these right now in sendto. Thanks for the insights.

Name: julino
Subject: Great!
Date: January 2016
Awesone! Great report. It helped me understand and resolve the problem. Thanks a lot :)

Add comment:
Not spam: (optional, check if message is not spam)
Name:
Email: (optional, will not be published)
Subject:
Comment:

Copyright © 1998-2017 Inferno Nettverk A/S