IPC performance: sockets vs filesPosted by Michael Shuldman, Inferno Nettverk A/S, Norway on Tue Jun 30 15:38:18 MEST 2015
This article describes the performance difference observed between using datagram sockets to send a large message directly from one process to another process on the same machine, and using the same type of sockets to instead send a minimal message with a reference to what file the large message can be read from.
Note that this article does not describe a test between using shared memory and message passing. Rather, it's a test between sending a small message that references a file with the large message, and sending the large message directly.
The reason for this performance evaluation is due to a new problem detected during the testing of a future version of one of our products.
This product uses message passing for internal communication, but if the messages are too large, as they have started to become in the development version of the product, they can no longer be sent over datagram sockets; the write call will fail with the error EMSGSIZE.
If stream sockets were used instead of datagram sockets, no such error would be received, but in the architecture we use, communication is for various reasons done using datagram sockets. It is a design that has served us well so far, and one that we do not want to change.
We thus want to continue using datagram sockets for message passing, and message passing will continue to be required, as some of the objects passed in the messages, including file descriptors, cannot be passed between processes in other ways.
The Barefoot port bouncer
The product in question, the Barefoot port bouncer, uses what is commonly referred to as a message based architecture. Basically, the Barefoot server consists of many processes that all run on one machine and communicate with each others by exchanging messages.
When the Barefoot server accepts a network client request, it will connect to a predetermined remote target server, and then forward data between the network client and the remote target server. Before the forwarding of data can begin, various ACLs are checked, and data structures are set up and configured for the client (e.g., bandwidth limitation may or may not be enforced on a particular client, or Barefoot may have been configured to connect to the target server via another proxy, rather than directly).
All in all, five messages are sent between various Barefoot processes during session initialization, before data forwarding can begin (assuming neither the Barefoot ACLs nor anything else prevents forwarding).
Originally, when the code the Barefoot server is based on (Dante) was being developed, these messages were quite small. However, after many years of development and many customer requests for various new features, the size of these messages has grown, and can now in some cases now be larger than 100kB.
This has recently, on the development branch for the next version, led us to receive the error message "EMSGSIZE: Message too long." during testing, while sending some messages on certain UNIX platforms (Solaris in this case). This even though we attempt to set the size of the socket buffers appropriately.
While it might be possible to spend some time on an effort to reduce the size of the messages, perhaps even compress it, it is unlikely that this would serve as a long term solution; new features and customer requests would likely necessitate further increases in the future, and the same problem would then surface again.
There also exists various high-level libraries and interfaces for sending messages or sharing data between processes. Possibly one of these libraries or interfaces could be used to avoid the problem, but for portability, performance, and perhaps most importantly, for customer support reasons, at Inferno Nettverk we prefer to use standard UNIX system calls and libraries as much as possible.
If a customer reports a problem, it may be hard enough as it is to track down and fix our own bug; we do not want to create additional complexity - and additional wait time for our customer - by using some externally developed library or interface, potentially used by a relatively small group of people, and possibly having to debug that library/interface too.
Non-standard third-party libraries are only used as a last resort for things it would be unrealistic to develop in-house.
Alternative to sending a large message
Considering the technical differences between Barefoot sending a (sometimes large) socket message with network client information between processes, and sending a small message with some sort of reference to where the (sometimes large) object with network client information can be read, no technical problems related to such a change were obvious.
The code changes we would need to do for Barefoot are also not expected to be large, as Barefoot would still need to continue sending messages between processes due to having to transfer file descriptors between processes. The main difference would be that, unlike before the change, the messages would only contain file descriptors (related to the network client and target server), and not both file descriptors and client data.
The next question was then related to performance. If the performance would be significantly reduced by referencing files for the large message data, compared to sending the large message data via sockets, this would obviously need to be considered before a change were to be made to Barefoot.
Unfortunately, I was unable to find any paper offering up any applicable analysis of the performance difference between these two methods.
We ended up writing our own small program for this. Having written the program, we thought we'd also share this article related to the results, as perhaps others might also find it useful.
Performance: files versus sockets
While the important metric is obviously the performance difference between using sockets and files to communicate between processes in Barefoot itself, as observed in a production setting, it would be considerably easier to initially start with a standalone test program, incorporating a design as similar as possible to what would be used in Barefoot for the same message passing.
A test program would be easier to analyse, debug, and fine-tune, as well as provide some important initial performance measurements. After all, can we not get a small test program to run with good performance, it would be pointless to try to get similar code in the much larger Barefoot server to run with good performance.
An early, possibly still buggy, version of the test program is for reference provided here: ipc-comparisontest.c. Please note that this program will not compile directly. We compile it as part of our internal test framework, and it uses a few functions that are part of Barefoot, rather than the standard UNIX libraries. These functions are however a small part of the program, which itself isn't all that big, so it should not be much work to make it compile standalone, should that be desired. The program is mainly provided to serve as detailed documentation of what was tested.
The program can run in two modes. One mode uses sockets to send a large message (around 100kB), as well as the necessary file descriptors related to the network client. This is similar to what is used in Barefoot up to the current version (version 1.4.x at the time of this writing).
The other mode uses the same type of sockets to send a zero-byte message with a file descriptor that references a temporary file containing the large message, as well the necessary file descriptors related to the network client.
In the latter case, the contents of the temporary file referenced by the file descriptor is written from a process local object to a file by the sending process, and read from file into a process local object by the receiving process; similar to what would happen if the message was sent and received via a send/receive system call on a socket.
Next an strace(1) of the two modes is shown. The strace shows the system calls executed during the send/receive loop, excluding the select(2) call common to both modes.
The strace was used to verify that neither mode contained any unexpected or not obviously necessary system calls, which it admittedly did in the first version.
The socket mode: sending a 112,192 bytes long message
Sending the client state as part of a 112,192 bytes long message sent over local sockets:
sendmsg (112192 bytes, + 32 bytes for the cmsg-part)
Receiving the client state as part of a 112,192 bytes long message sent over local sockets:
recvmsg (112192 bytes, + 32 bytes for the cmsg-part)
The file mode: sending a file descriptor referencing the message
Creating a temporary file, storing the client state in the temporary file, sending a fd referencing the file to another process as part of a zero bytes long message sent over sockets, and then closing the fd:
open unlink write sendmsg (0 bytes, + 32 bytes for the cmsg-part) close
The other process, receiving a fd referencing the temporary file, lseek(2)-ing to the start, reading the data from the file into a local object, and then closing the fd:
recvmsg (0 bytes, + 32 bytes for the cmsg-part) lseek read close
This procedure is repeated a million times, with each iteration creating a new file to be written and read.
As can be seen from the straces, creating a temporary file and using that file to retrieve the message involves many extra system calls; four extra on the sending side, and three extra on the receiving side. It is natural to assume that this will increase the system time considerably, but lets see what the results show.
The tests were run on a i5 2.67GHz CPU laptop with four cores, while the laptop was in normal, light, use. The laptop uses a SSD disk, and had plenty of free RAM available while testing.
3.19.8-100.fc20.x86_64 #1 SMP Tue May 12 17:08:50 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
The test-program was run in each mode 10 times, executing the send/receive loop one million times for each test run. When the test program was run, 10 times in each mode, we alternated between the two modes for each run, and the median CPU time of all the runs was then computed. The results are listed in the table below:
As can be seen, both user and system time is significantly longer using the file based approach, with the system time in particular being three times as long.
Thus, if there is a choice, at least on this Linux version, sending a large message to a local process appears to be considerably more efficient than sending a small message referencing a file with the large message.
Some further analysis, based on the time spent in each system call, as reported by strace(1) during a single run in each mode, shows the following:
Based on this, sending a very large message via local sockets takes roughly twice as much time as sending a very small message.
Next we performed a similar analysis of how much time was used by the additional system calls required for reading in the large message data from the file descriptor received.
The system call close(2) is shown twice, because both the sending and the receiving process needs to close(2) the file descriptor. The first close(2) entry is for the sending process, and the second close(2) is for the receiving process. Note: The difference in time is probably related to the fact that the unlink(2)-ed file referenced by the file descriptor will only be deleted after the second close(2), as it is only at that point nobody has an open filehandle to the file any longer.
The results indicate that while sending and receiving a large message takes roughly twice as much time as sending and receiving a minimal message, around 40us extra in this case, the additional system calls required to read from file the large message referenced by the small message add another 200us to the time. Thus while we may save 40us by sending a small message rather than a large message, we later incur an extra cost of 200us to retrieve the missing data by other means, putting us in deficit of around 160us per message exchange. The majority of this deficit is caused by the read and write system calls, accounting for almost 130us of the extra 200us cost.
This also corresponds well with the times observed while running our test program 10 times in each mode. In that case, we found that the total system plus user time was roughly three times larger for the small message/file based mode, compared to large message mode.
The conclusion to this problem will probably be internal, as it will most likely be specific to our own product, how it is designed, and how much an increase in the communication overhead between Barefoot's internal processes will affect the overall network performance of Barefoot.
Likewise, possible optimisations and removal of some of the extra system calls added in order to transfer the information via a filehandle, should this be implemented in a future version of Barefoot, will likely also be somewhat specific to the Barefoot server.
What can be said however, based on the measurements, is that simply changing things from sending a large message, to sending a small message with a filehandle referencing data in the large message, is something we cannot do without considering the performance impact further.
It is an operations that is only done during session initiation, and with five message exchanges done in order to completely set up a network session, the small message/file based mode should only add about 1ms to the total setup time. The total setup time would normally be several magnitudes larger than this, normally accounting for TCP session setup between Barefoot and the perhaps remote client, and then between Barefoot and the perhaps remote target server.
Still, it would be preferable to do this setup part optimally too, so more research and testing is required before a decision can be made.
July 23, 2015
More research and testing led to an alternative prototype being created.
This alternative is based on creating a shared memory area with mmap(2) and using the area as a queue between the sending and the receiving processes.
With fcntl(2)-based locking (historically used in Barefoot for portability reasons, though different and more efficient mechanisms undoubtedly exists on various platforms today), the performance of this solution was roughly half of the socket-based solution. So while still inferior, it was an considerable improvement over the file-based solution. Further optimization to the locking (not locking every time), improved the performance to roughly 3/4 of the socket-based solution.Out of curiousness, the socket-based solution was also tested in a mode where each message was gzip-ed upon sending, and unzip-ed upon receiving. The performance in this mode was roughly 1/40 of the uncompressed socket-mode.