Frontpage - Dante - Download - Status - Support - Modules - Docs - Links - Survey - GDPR

Config
- Introduction
- Client
- Server
- Authentication
- Bandwidth
- Chaining
- CPU/RT
- Hostid
- IPv6
- Libwrap
- Logging
- Monitors
- Redirection
- Redundancy
- Session
- Socket options
- Capacity estimation
- Upgrade

Server capacity estimation and stress testing

This page provides guidelines for how to make practical estimates on the capacity and resource usage of a specific Dante server running on a specific machine.

The intent is to provide the Dante operator with the necessary documentation and tools to test and dimension the UNIX server hardware running Dante for the client load expected in production.

By stress testing the Dante server on hardware identical, or at least very similar, to what Dante will run on in production, the possibility of problems occurring in production should be considerably reduced. It also provides the operator with the opportunity to do extensive experimentation with a deterministic client load, making it easier to fine-tune the hardware and system settings outside of production.

While the tools described on this page have been developed by Inferno Nettverk A/S for testing the Dante SOCKS server specifically, there should be no dependencies on Dante in the tools themselves. You are welcome to also use them for testing other SOCKS servers. Comments and improvements can be sent to the public dante-misc mailinglist. Please see the mailing list page for details about this.

Some of the tests rely on services such as echo and discard, which are typically run via inetd. Note that some identd servers fork a new process for each new client connection. If many clients connect, this might place a significant load on the target server, potentially having a negative impact on performance. It is recommended that the load on the server running these services is monitored during testing to verify that the machine running them is not being overloaded, which would potentially impact the accuracy of the capacity estimation.

If you need to run tests with, for example, many thousands of connections, and find that the machine running inetd is being overloaded, contact Inferno Nettverk and we will provide you with a copy of a more scalable implementation of the chargen/discard/echo services.

Quick start

This section provides a simple example extracted from the text below, without the more detailed context, and will likely be sufficient for most users. If more extensive capacity estimation is desired, consult the subsequent sections of this document.

The maxconn.pl program can be used to provide a quick estimate of how many simultaneous sessions a given Dante configuration can handle, and can be used as follows:

maxconn.pl -w 100 -b 10.0.0.1 -c 2000 -s 10.0.0.2:1080

This command will cause the program to bind to the address 10.0.0.1 (on the local machine), and then attempt to connect to the socks server expected to be running at 10.0.0.2 (port 1080) in order to open up to 2000 connections. It will maintain the successfully opened connections for 100 seconds before exiting.

Verify before running the command, via e.g., ulimit, that there are enough descriptors available for the requested number of connection attempts. It must be possible for the Dante server to open connections to the address specified with -b (here 10.0.0.1).

The resulting output might vary a little depending on the Dante version in use, but the script should print output like the following before exiting:

...
maxconn.pl: opened 1952/2000 connection(s)
maxconn.pl: sleeping
maxconn.pl: terminating, 1958 connections accepted in total

In this case, the server was able to successfully forward 1952 connections, meaning the current Dante configuration should able to handle 1952 concurrent client sessions. The actual number might vary depending on the client type (see the text below for details), but this value should give a fairly accurate estimate.

The maximum number of client sessions that can be accepted is determined by either the descriptor or process limits for the user running the Dante server; if the number of sessions reported by maxconn.pl lower than desired, increase these limits. If the number of accepted sessions too high (e.g., because the machine runs out out physical memory before reaching this number of sessions), reduce the descriptor limit for the user running the Dante server.

Dante Resource usage

The resources consumed by Dante are primarily determined by two types of factors:

Static factors, such as:
- Build time constants (typically values set in Dante's config.h-file).
- Run time flags (e.g., use of the -N option when starting Dante).
- Server configuration file values set in sockd.conf (e.g., network socket buffer sizes).
- OS settings (e.g., default network socket buffer sizes).
Dynamic factors, primarily:
- The number of client sessions.
- The types of client sessions (packet payload sizes, packet frequency).
- The current state of client sessions (SOCKS negotiation/data forwarding).

Using the tools provided on this page, stress testing of Dante can be done in order to determine the capacity of the Dante server on a given machine, as well as making it easier to examine how Dante and the machine it runs on behaves when when the server is highly loaded or approaching max capacity.

Dante Server capacity

In order to understand the client capacity of Dante and why it is difficult to provide an exact number for e.g., how much memory Dante will need to handle a given number of SOCKS clients, or how many clients Dante can handle on a machine with any given hardware characteristic, it is beneficial to have some knowledge about the internals of the Dante server.

Dante handles clients by using several dedicated process types, with each process type handling different tasks. The processes are structured hierarchically, normally with one mother processes accepting new SOCKS clients, and multiple dedicated child processes handling the SOCKS request processing and data transmission. There are three main types of child processes, and SOCKS clients that are not blocked will pass through each process type in this order:

negotiate processes: used for the first-level ACL (client-rules) and subsequent SOCKS protocol negotiation between Dante and SOCKS clients.
request processes: used for the second-level ACL (socks-rules) and finalising the steps required for the client request before active I/O can start (e.g., connecting to the requested target, or waiting for a bind reply).
i/o processes: used for transferring data between the SOCKS client and its target.

Normally only a little time will be spent in the negotiate and request processes, while the majority of the time and resource usage will be spent in the i/o process. In this document, we also refer to a SOCKS client as being in the corresponding phases, depending on what Dante process is currently involved in handling the SOCKS client. A SOCKS client can thus be in the negotiate phase, the request phase, or the i/o phase.

Each child process can handle a hard-coded number of clients only, which is set at compile time, but there is no hard-coded limit on the number of child processes a mother process can create. The number of child processes a Dante mother process can create is limited only by the resources available to the user the Dante server is running as. The Dante mother process will create and terminate child processes as needed, according to the number and type of SOCKS clients it is currently serving.

This in effect means that there is no hard-coded limit on the number of clients the Dante server can handle, even though there is a hard-coded limit on the number of clients each Dante child process can handle.

Since each of the three process types have somewhat different resource requirements, the resource requirements for handling a given SOCKS client will vary according to what phase the SOCKS client is currently in, as well as which SOCKS request the client issues; the Dante resource requirements for handling a TCP CONNECT, TCP BIND, or UDP UDPASSOCIATE SOCKS request all differ.

Since different SOCKS clients/requests in different phases require different Dante resources, giving an exact number for how many clients a given Dante server can handle is difficult; it depends on the type of clients and what phase they are in at a given time. The tools provided here provide a way to make a reasonable, but not exact, estimate of how many SOCKS clients Dante can handle when running on a specific machine.

If there are other heavy processes running on the machine, behavior naturally becomes less predictable, as it is possible that other processes will prevent Dante from acquiring the system resources it needs. For example, maybe Dante will be unable to create the processes it needs at a given time because another process on the same machine has already created many processes. In this text, we ignore these factors and assume that Dante is the main application running on the tested machine, and that it will have the resources allowed to the user running Dante when they are required. In other words, we assume that the Dante server is limited by the per-user resource limits, not the global system-wide limits. Essentially, having other resource intensive applications running on the same machine as Dante will reduce the resources available to Dante, reducing the maximum number of clients Dante can handle.

To aid the operator in estimating how the resource limits that are in effect affect a Dante server, Dante upon startup (or reception of a SIGUSR1/SIGINFO signal), calculates and logs an estimate of the number of clients that it can handle, and what it expects to be the first limiting factor. This is an example of such a log message:

 max limits:
   processes: 72 (53 free),
   files: 190 (135 free),
   negotiate-child-slots: 5088,
   request-child-slots: 53,
   i/o-child-slots: 1696
   (max clients limited by process limit)

Output like above can be found in the Dante log of recent Dante versions. It first shows an estimate of the process and file limits and how much free capacity there currently is relative to these limits. In this case, the process and file limits were set to 72 and 190, respectively, for the shell the Dante server was started from. At the time that this information was logged, the Dante server estimated that it would be able to create 53 additional processes and open 135 additional descriptors in the main mother process, in addition to the processes and file descriptors Dante had already created or opened.

These values are estimates based on the shell user limits set for Dante. If there are many non-Dante processes running on the machine, the global system limit for either processes or descriptors might be reached before the user limit.

Using the above estimates for the maximum number of files and processes, Dante estimates that it can create either 5088 additional slots for SOCKS protocol negotiation, 53 additional slots for request processing or 1696 additional slots for i/o.

Each SOCKS client will occupy one slot, but what slot it will occupy will vary according to the phase the client is in at any given time. It will initially occupy a negotiate slot, then proceed to using a request slot, and finally end up using an i/o slot until the client session is terminated.

Since all process types share the same machine resources, adding e.g., a new i/o slots will reduce the resources available for the other slot types, but Dante attempts to maintain a good balance regarding the number of available slots of the various types at any time.

As the negotiate and request processes are only used during initial processing of each SOCKS client, the limits in effect should allow the server to handle 1696 more clients, assuming that the current number of negotiate and request processes are mostly sufficient to handle the initial phases for each new client.

The log output above also specifies that the number of clients that can be handled is currently limited by the process limit for the user. To allow the server to handle additional clients, this limit is the first that will need to be increased.

When the resource limits for the Dante user are almost reached (e.g., if there are less than 10 descriptors left), the estimated values for the remaining slots might become inaccurate, as it might be difficult to actually make use of the remaining descriptors.

Practical server capacity

The process and descriptor limits represent hard limits on the resources available to the Dante server. Upon being reached, an error will be returned from the OS if Dante attempts to create additional processes or file descriptors. There are however also other limits that might reduce the number of clients that can be handled without performance being significantly affected.

To achieve good performance, it is necessary to also consider the CPU and the amount of physical memory on the machine Dante is running on.

Memory consumption

Each process and socket created by Dante to handle client requests consumes memory on the machine Dante is running on. If the amount of memory used starts exceeding the amount of physical memory available on the machine, the machine will start using on-disk swap, which might result in a significant reduction in performance.

Dante allocates very little memory dynamically. This is both to avoid memory leaks and to make Dante's memory consumption predictable. As noted above, in the section on server capacity, there are some variations in resource consumption depending on the types of clients that are currently active and on the phase these clients are currently in, and this will also have an effect on memory consumption..

However, unless most client sessions are very short, the majority of clients will have completed SOCKS negotiation during normal usage and will be in the i/o phase, which makes it simpler to estimate resource requirements.

Observing the memory usage of Dante while gradually increasing the number of active sessions until the hard descriptor or process limit is reached, should give a good estimate of the maximum memory requirements of Dante, and how many active clients it can handle.

We provide a tool called maxconn.pl to be used for this purpose. It will run in a loop, and for each pass through the loop, bind a TCP port and then attempt to connect to itself on this port via the Dante server, like a regular SOCKS client. All successfully established connections will be maintained by maxconn.pl until it terminates.

After creating the specified number of SOCKS sessions, maxconn.pl will delay for a specified number of seconds before terminating. During this delay, the Dante operator may examine the state of the machine Dante is running on to see if the amount of physical memory that is free is adequate, or alternatively, consider adding more RAM to the machine before it is put in production.

The maxconn.pl program can be used as follows:

#INTIP="..."       # IP address of host maxconn.pl is run on.
#SOCKSSERVER="..." # IP address of Dante SOCKS server.
#SOCKSPORT="..."   # port number of Dante SOCKS server.
CONNCNT=2000       # number of SOCKS sessions to create.
WAITDUR=100        # number of seconds to wait before terminating.

SOCKS="-s $SOCKSSERVER:$SOCKSPORT"
maxconn.pl -w $WAITDUR -b $INTIP -c $CONNCNT $SOCKS

Note that the Dante server must be able to connect to INTIP, which is the address that maxconn.pl will bind to, and which needs to be an address on the machine maxconn.pl is started on. The descriptor limit for the user running maxconn.pl might also need to be increased if CONNCNT is set to a high value. E.g., if you want maxconn.pl to create 10,000 SOCKS sessions, maxconn.pl will need to open at least 20,000 file descriptors (one for connecting to Dante, and one for accepting the connection from Dante), plus an additional handful of file descriptors for other usage.

To estimate the memory consumption of a fully loaded Dante server, follow these steps:

Estimate the maximum number of concurrent SOCKS client sessions expected in production.
Set the CONNCNT value to the expected capacity, or preferably slightly higher. E.g., 10 percent higher.
Run maxconn.pl as specified above.
Verify that the desired maximum number of client sessions is reached and that it is close to the expected value (see below for how to verify this).
On the machine Dante is running, check how much memory is being used via the system tools provided by the vendor. If the machine has started swapping, or is close to start swapping, consider either making appropriate changes to Dante's config.h-file in an attempt to reduce memory usage, possibly at the expense of performance, or preferably increase the amount of RAM on the machine.

To estimate the likely highest possible memory usage of the Dante server with a specific descriptor/process resource limit, set CONNCNT to a value larger than the maximum i/o-child slots value logged by Dante when it starts (see above). Then follow the steps above.

The number of successfully forwarded connections is logged by maxconn.pl. Note that this number will not simply be the number of created SOCKS sessions, but the number of SOCKS sessions that have progressed through all phases and are now in the i/o phase, ready to start transferring data. Below is sample output from a maxconn.pl run:

maxconn.pl: connection open: 10/2000
maxconn.pl: got connection (10/2000)
maxconn.pl: connection open: 20/2000
maxconn.pl: got connection (20/2000)
maxconn.pl: connection open: 30/2000
maxconn.pl: got connection (30/2000)

...

maxconn.pl: connection open: 1940/2000
maxconn.pl: got connection (1940/2000)
maxconn.pl: connection open: 1950/2000
maxconn.pl: got connection (1950/2000)
maxconn.pl: SOCKS request timeout
maxconn.pl: SOCKS request timeout
maxconn.pl: SOCKS request timeout
maxconn.pl: SOCKS request timeout
maxconn.pl: SOCKS request timeout
maxconn.pl: SOCKS request timeout
maxconn.pl: too many failures (6), aborting
maxconn.pl: opened 1952/2000 connection(s)
maxconn.pl: sleeping
maxconn.pl: terminating, 1958 connections accepted in total

In this example, the SOCKS requests start failing before the requested number of connections is reached. The maxconn.pl program keeps each connection open until the program terminates and initiates one connection at a time, which eventually results in all i/o child process slots being filled without any resources available to create new slots. The Dante server will in this situation attempt to wait to see if any resources free up, but this never happens in this situation because maxconn.pl does not terminate any connections. The requests are eventually aborted by the client after a timeout triggers in the client.

The result is that 1952 connection are successfully forwarded, while an additional 6 connections are initially accepted by the Dante server, but need to be terminated before they can be used as Dante does not have sufficient resources available.

In production, some variation can be expected depending on the types of SOCKS clients that can connect, but the Dante server in this example should be able to handle about 2000 active client sessions before reaching a hard resource limit.

The alarm lines indicate that maxconn.pl had problems initiating new connections to Dante, as can be expected if Dante is unable to accept more clients. In the example above, this was due to the descriptor limit being reached on the system Dante was running on, but this cannot be observed in the logs on the client side.

CPU usage

The primary task of the Dante server is to forward data between two endpoints. This involves potentially decoding and encoding data (if encryption is used), and moving data between sockets. This can be a CPU intensive operation when data is transmitted at high rates, or there are many active clients. Processing new clients and adapting to changes in the client load also requires some CPU involvement. As the CPU resources on any machine are limited, this means that the CPU might also become a bottleneck in some cases.

Ensuring that the machine is sufficiently powerful to be able to handle the client load expected in a given usage scenario typically requires stress testing. There are many different ways to run a stress test, so it is useful to have some knowledge of the types of traffic that the server will experience during production usage. The server should ideally be able to comfortably handle the expected load, as well as occasional spikes in traffic.

We provide tools that can be used for this type of stress testing. Ideally, the client load generated by the stress test should match the actual load expected in production usage, but emulating the type of traffic and the client behavior found in production will often be be difficult. It will however obviously be beneficial to generate a client load that is as similar as possible to what is expected in production.

Some of the factors that should be considered when creating a load for a stress test:

The number of clients expected.
The types of clients expected (outgoing TCP connections, accepting incoming TCP connections, or UDP traffic).
The traffic type expected (interactive, bulk data transfer, etc.).
The rate of client churn expected. I.e., typical client durations. Will most of the clients sessions be short-lived or long-lived?

While the stress test is running, the operator can observe the following:

The effect of the stress test on the machine Dante is running on. This includes the effect on the Dante server itself, as well as kernel/system usage, and can be observed via programs like top(8), or more platform specific tools. Does the stress test overload the hardware and Dante, or is there still a reasonable amount of resources, like CPU and memory, available?
Are any relevant messages logged by Dante? Dante might report clients being dropped due to lack of resources, or that some operations take so long that Dante suspects the machine might be overloaded.
What is the effect on the performance seen by Dante's clients?
- Transfer rate (for bulk data transfer clients).
- Round Trip Time (RTT) (for interactive or request/response-type clients).
- Session set-up time (how long does it take the client to establish a connection to the target server via Dante).

One way to measure the effect on the SOCKS clients is to perform the load generation for the stress test on one set of machines, and the client-side measurements on another set of machines. In other words:

One or more machines generate a high load, creating many client sessions on the Dante server.
One or more machines simultaneously create a small number of sessions on the Dante server for measuring latency, throughput, and similar.

This division is generally necessary because generating a high load for the Dante server might result in the client machines generating the load to also become highly loaded. This will in turn make reliable measurement of latency or throughput on these client machines difficult. Simultaneously using a different set of client machines for latency or throughput measurements avoids this conflict.

To generate a high load on the Dante SOCKS server, we provide the multiclient.pl tool. It is typically used together with standard UNIX discard and echo targets servers. Both discard and echo servers are typically available on most UNIX platforms via inetd, xinetd, or similar.

The possible arguments to multiclient.pl are as follows:

 -b <file>          : read addresses to bind to from filename <file>
 -L <host>:<port>   : bind local end to <host>:<port> before connecting
 -l <sec>           : test duration in seconds
 -O iplist          : set list of option28 socket options before connect
 -f <format>        : if <format> is "csv", log in csv format
 -r <rate>          : in rtt mode, enables rate limiting of packets per second
 -s <host>:<port>   : connect through socks server at <host>:<port>
 -t <file>          : read target address list from filename <file>

Typically, using only the -l option to specify the program execution time, and the -s option to specify the Dante server address, will be sufficient.

The main purpose of the tool is to emulate different types of client behavior in a scalable way. A traffic specification is used to control various behavior parameters. The specification contains a series of values separated by ":" characters that specify how the clients should behave, in the following compact format:

<type>:<clientcnt>:<proto>:<clients/sec>:<host>:<port>[:clients/process][:id]

These parameters specify the basic client behavior, the number of clients, the target host they should connect to, that rate at which they should be created etc.

Understanding all the details of the format is not necessary to deploy the tools, as the examples below show specific command lines which can be used for the basic tests. For the interested reader, the following is however a description of the traffic specification format:

type: client type: send|recv|sendrecv|rtt|connect
- send: data sending client (host:port should be a discard server)
- recv: data receiving client (host:port should be a chargen server)
- sendrecv: data send and receive (host:port should be an echo server)
- rtt: data latency measuring client (host:port should be an echo server)
- connect: connect latency measuring client (host:port can be any server)
- connectbyte: as 'connect', but include rtt time with a single byte
- connecsec: as 'connect', but sleep for one second before closing
clientcnt: number of clients to start.
proto: protocol type, only tcp supported currently.
clients/sec: rate at which new client connections are opened.
host:port: hostname/IP address and port number clients attempt to connect to. The host field can include usernames and passwords in the following format: user;password@host
clients/process: optional value, specifies the number of connections/process. Will default to using only a single process if not given.
id: optional value, can be used to name client when logging.

Some combinations of arguments might not yet be supported, but some examples of how the tool can be used with currently supported argument combinations are described below.

Note that the multiclient.pl program is used in the description below to both generate a client load (on one set of machines) and to observe the effect of the client load (on another set of machines). In other words, multiclient.pl should be run independently on at least two machines, unless another tool is used to measure latency or throughput.

The examples below are based on shell scripts that allow control of the clients by changing variables. There are some variations, but some common elements are found in each script.

The SOCKSSERVER and SOCKSPORT variables should be changed to contain the address of the SOCKS server to be tested. They are combined in the SOCKS variable which can be commented out to have the clients connect directly to the target server without going to through the SOCKS server.
The DUR variable controls the combined runtime of the clients. All connections will be terminated once this limit is reached.
The number of clients is specified via the *CLIENTS variable (with the variable name changing depending on the client type, RTTCLIENTS, etc.). Related variables are *RATE and *CLIENTSPERPROC. The *RATE value specifies how many new connections should be opened per second, and the client attempts to reach this number, but there is no guarantee that the value will be reached if it is set higher than the machine can handle. The *CLIENTSPERPROC value can be used to improve performance on machines with CPUs/CPU cores, and specifies how many clients should be handled in a single process. Having many clients in a single process will likely reduce the performance of any given client but will reduce the number of processes when there are many clients in total, as having many processes will likely also result in reduced performance. The optimal value will likely vary between hardware configurations; splitting the load between available CPU cores should give better performance than running all clients in a single process.

A configuration with 125 clients, 25 clients per process and a connect rate of 5 will result in five processes with 25 clients. The connect rate will be spread among the processes; the aggregated connect rate will be 5 connections per second, with each process having a connect rate of 1 connection per second. The client will need to be started with a duration of at least 25 seconds for all the requested connections to be established.

Some client types (such as the connectbyte type) supports a connect rate of zero, which will result in the rate being automatically set based on the *CLIENTS, *CLIENTSPERPROC and DUR values to have connections continuously established during the entire runtime of the program.
The *TARGETSERVER and *TARGETPORT values specify the target server that the client should connect to (via the SOCKS server).

Multiple client specifications can be given to the program, which will cause it to create extra processes as needed to handle each specification.

Various data files will be created in the current directory when multiclient.pl is executed, so it should be run from a directory where files can be created by the user running the program.

Generating traffic: clients with small payloads

Traffic received by Dante requires processing by both the machine Dante is running on and by Dante itself. Even when the payload is small, there will be overhead from tasks such as scheduling, interrupt handling, etc., and the example below shows how clients that produce only this type of load can be generated with multiclient.pl.

The rtt client-type creates clients that each send a one byte payload, wait until a reply is received, and then send another byte, etc. If the target server is an echo server, the result will be single byte payloads transmitted at the max speed/lowest latency possible. The resulting load on the Dante machine should come primarily from client handling, and not from copying large amounts of data. By default, there will be no delay between a byte being received and the next byte being sent; the packet rate will only be limited by the capacity of the system.

The following configuration will result in 50 new rtt client sessions being attempted opened each second, until 300 connections have been established. The clients should terminate 60 seconds after the program started.

#start many clients sending single byte payloads in a loop to an echo server
SOCKSSERVER="..."               #address of socks server to test
SOCKSPORT="1080"                #port number of socks server
DUR="60"                        #number of seconds to run clients
SOCKS="-s $SOCKSSERVER:$SOCKSPORT" #comment out to connect directly to target

RTTCLIENTS="300"                #number of client connections to generate
RTTRATE="50"                    #number of connections to open each second
RTTTARGETSERVER="..."           #address of target server
RTTTARGETPORT="echo"            #target server port
RTTCLIENTSPERPROC="50"          #number of client connections per process

RTTSPEC="rtt:$RTTCLIENTS:tcp:$RTTRATE:$RTTTARGETSERVER:$RTTTARGETPORT:$RTTCLIENTSPERPROC"
multiclient.pl -v $SOCKS -l $DUR $RTTSPEC 2>mc.stderr

Upon termination, the program will print some status information for each process, like the following:

[proc 1]: rtt type client (0), runtime 60.07s, 50/50 connection(s) attempted opened
[proc 1]: (requested connect rate of 8.33 conns/sec)
[proc 1]: connections made using SOCKS server 10.0.0.1:1080
[proc 1]: 50 connect requests, 50 to target succeeded (100.00%), 0 failures (0.00%)
[proc 1]: 50 requests to socks server succeeded (100.00%)

In the example above, all connection attempts to the SOCKS server succeeded, and the SOCKS server was able to successfully forward all connections to the target server.

In the case of errors, a summary with a symbolic code for the error cause will be printed. In the example below, a connection refused error is reported by the SOCKS server, resulting in all forwarding attempts to the target server failing:

[proc 1]: rtt type client (0), runtime 60.52s, 50/50 connection(s) attempted opened
[proc 1]: (requested connect rate of 8.33 conns/sec)
[proc 1]: connections made using SOCKS server 10.0.0.1:1080
[proc 1]: 50 connect requests, 0 to target succeeded (0.00%), 50 failures (100.00%)
[proc 1]: 50 requests to socks server succeeded (100.00%)
[proc 1]: failure overview:
[proc 1]:  SOCKS5REQCONNREFUSED: 50

A more detailed trace of the operations performed by the client, possibly along with more detailed information in the case of errors, can be found in the mc.stderr file.

Generating traffic: clients with small payloads (rate limited)

In some situations, it might be desired to scale the capacity of a machine based on an expected rate of traffic. For example, when using a virtual machine, it might be beneficial to only allocate as many resources as are necessary to handle an expected load. By running the rtt type client load with a specified packet send rate, the number of CPU cores available to the virtual machine can be adjusted to ensure that the machine can handle the expected packet rate for the expected number of simultaneous clients. Note that traffic consisting from mainly large data payloads will place an additional load on the system.

To do this type of testing, it is necessary to know the expected number of simultaneous clients, and the total number of packets with a data payload that these clients are expected to generate. For example, 10,000 packets per second, spread among 500 clients. By simulating the expected load it can be verified that the machine running Dante is able to handle the load from process switching and I/O processing. It is important that both the number of clients and the packet rate used in the test correspond to the expected load, due to the number of simultaneous clients affecting the number of processes used by Dante and the resulting overhead from switching processes.

The -r multiclient.pl option can be used to place an upper limit on the number of send operations that should be initiated per second. When communicating with a target echo server, this will give a packet rate for the Dante server corresponding to roughly twice the specified send rate. This assumes that the system is able to send and forward the data at the requested rate; the actual rate will be lower if a too high rate is specified.

The following configuration will result in 100 SOCKS clients, that run for 60 seconds, and during that time generate a best-effort combined send rate of 1000 packets per second, each with a single-byte payload. This should result in around 2000 I/O operations for the Dante server each second, when the data returning from the target echo server is included. In other words, the packet rate given to the multiclient.pl tool should be half of the expected packet rate that is being tested.

The send rate is specified in the RTTSENDRATE variable:

#start many clients sending single byte payloads in a loop to an echo server
SOCKSSERVER="..."               #address of socks server to test
SOCKSPORT="1080"                #port number of socks server
DUR="60"                        #number of seconds to run clients
SOCKS="-s $SOCKSSERVER:$SOCKSPORT" #comment out to connect directly to target

RTTCLIENTS="100"                #number of client connections to generate
RTTSENDRATE="1000"              #rtt mode outgoing bytes/second
RTTRATE="$RTTCLIENTS"           #number of connections to open each second
RTTTARGETSERVER="$TESTSERVIP"   #address of target server
RTTTARGETPORT="echo"            #target server port
RTTCLIENTSPERPROC="150"         #number of client connections per process

RTTSPEC="rtt:$RTTCLIENTS:tcp:$RTTRATE:$RTTTARGETSERVER:$RTTTARGETPORT:$RTTCLIENTSPERPROC"
multiclient.pl -r $RTTSENDRATE -v $SOCKS -l $DUR $RTTSPEC 2>mc.stderr

Upon termination, the program will print some status information for each process, like the following:

[proc 0]: rtt type client (0), runtime 60.07s, 100/100 connection(s) attempted opened
[proc 0]: (requested connect rate of 100.00 conns/sec)
[proc 0]: connections made using SOCKS server 172.30.0.34:13884
[proc 0]: 100 connect requests, 100 to target succeeded (100.00%), 0 failures (0.00%)
[proc 0]: 100 requests to socks server succeeded (100.00%)
[proc 0]: average send rate: 974.133/sends per second
[proc 0]: (requested rate: 1000.000/sends per second for process)
[proc 0]: (estimated total send/receive rate: 1948.267/packets per second for process)

In the example above, only one process was used, and an average of 974.113 single byte data transfers were initiated per second. This is close to the requested rate of 1000, and should have resulted in a packet rate of around 1948 single payload packets per second for the machine with the Dante server, spread among the 100 clients. There is some startup time involved, so the achieved rate being slightly lower than the requested value is to be expected. Note that if the traffic is divided between multiple processes, the rate for each process will be shown, and the total should then correspond to the requested rate.

If the achieved rate is significantly lower than the requested rate, this indicates that there is a bottleneck in the system, either at the client side, the Dante server machine, or the target server. If the client and target servers are run on different machines than the Dante server, and are not the source of the bottleneck, the machine running the Dante server will likely need additional resources to handle traffic at the requested rate. For a virtual machine, assign additional CPU cores. For a physical machine, better hardware (server-type NIC/faster CPU/additional CPU cores) might be necessary.

Generating traffic: short-lived clients

For long-running processes, the overhead from SOCKS protocol processing, connection setup, etc., will likely be trivial, but this might make up a significant part of the lifetime of short-lived connections. Despite being short-lived, these connections still increase the load on the SOCKS (and target) servers.

The connectbyte client-type creates short-lived clients that open a connection, send a single byte, wait for it to be received back, and then closes the connection. The target server is expected to be an echo server. The resulting load on the machine running the Dante server should come primarily from operations involved in SOCKS processing and connection setup.

The following configuration will result in a total of 3000 connectbyte client sessions being created over a period of 60 seconds. The load will be divided among 6 processes (500 clients handled per process). The connection rate value being set to zero ensures that the connection rate is set automatically based on the max number of connections and the total runtime.

#start many short-lived clients sending single byte payload to echo server
SOCKSSERVER="..."                 #address of socks server to test
SOCKSPORT="1080"                  #port number of socks server
DUR="60"                          #number of seconds to run clients
SOCKS="-s $SOCKSSERVER:$SOCKSPORT" #comment out to connect directly to target

SHORTCLIENTS="3000"               #number of client connections to generate
SHORTRATE="0"                     #rate set to zero; calculated automatically
SHORTTARGETSERVER="$TESTSERVIP"   #address of target server
SHORTTARGETPORT="echo"            #target server port
SHORTCLIENTSPERPROC="500"         #number of client connections per process

SHORTSPEC="connectbyte:$SHORTCLIENTS:tcp:$SHORTRATE:$SHORTTARGETSERVER:$SHORTTARGETPORT:$SHORTCLIENTSPERPROC"
multiclient.pl -v $SOCKS -l $DUR $SHORTSPEC 2>mc.stderr

Note that some echo (or identd) servers might fork a new process for each new client. If many clients are requested, this might place a significant load on the target server, potentially having a negative impact on performance.

Upon termination, the program will print some status information for each process, like the following:

[proc 1]: connectbyte type client (0), runtime 60.75s, 500/500 connection(s) attempted opened
[proc 1]: (requested connect rate of 8.33 conns/sec)
[proc 1]: connections made using SOCKS server 195.139.68.35:9806
[proc 1]: 500 connect requests, 500 to target succeeded (100.00%), 0 failures (0.00%)
[proc 1]: 500 requests to socks server succeeded (100.00%)

A more detailed trace of the operations performed by the client, possibly along with more detailed information in the case of errors, can be found in the mc.stderr file.

Generating traffic: clients doing bulk data transfer

A more CPU intensive client load can be generated using clients that transmit a continuous stream of data. This will create additional overhead on the Dante server from data copy operations. The example below shows how this type of load can be generated with multiclient.pl.

The send client-type creates clients that write data as fast as possible. Data is only written, never read, and it is expected that the target server will be a discard server. The resulting load on the machine running the Dante server should come from both interrupts due to packets being received, etc., and the size of the payload in the packets.

The following configuration will result in 20 new send client sessions being opened each second, until 100 connections have been established. The clients should terminate 60 seconds after the program started.

#start many clients sending steady streams of data to a discard server
SOCKSSERVER="..."                #address of socks server to test
SOCKSPORT="1080"                 #port number of socks server
DUR="60"                         #number of seconds to run clients
SOCKS="-s $SOCKSSERVER:$SOCKSPORT" #comment out to connect directly to target

SENDTARGETSERVER="..."           #address of target server
SENDTARGETPORT="discard"         #target server port
SENDCLIENTSPERPROC="25"          #number of client connections per process
SENDCLIENTS="100"                #number of client connections to generate
SENDRATE="20"                    #number of connections to open each second

SENDSPEC="send:$SENDCLIENTS:tcp:$SENDRATE:$SENDTARGETSERVER:$SENDTARGETPORT:$SENDCLIENTSPERPROC"
multiclient.pl -v $SOCKS -l $DUR $SENDSPEC 2>mc.stderr

Note that some discard (or identd) servers might fork a new process for each connection to the discard server. If many clients are requested, this might place a significant load on the target server, potentially having a negative impact on performance. There are however also discard servers that do not require a new process to be forked.

Upon termination, the program will print some status information for each process, like the following:

[proc 1]: send type client (0), runtime 60.62s, 25/25 connection(s) attempted opened
[proc 1]: (requested connect rate of 5.00 conns/sec)
[proc 1]: connections made using SOCKS server 10.0.0.1:1080
[proc 1]: 25 connect requests, 25 to target succeeded (100.00%), 0 failures (0.00%)
[proc 1]: 25 requests to socks server succeeded (100.00%)

A more detailed trace of the operations performed by the client, possibly along with more detailed information in the case of errors, can be found in the mc.stderr file.

Generating traffic: combined load with multiple client types

To emulate the behavior on a server with more varied types of traffic, it is possible to combine multiple traffic specifications. The example below will construct a client that combines send, rtt and connectbyte clients to create a combined load on a SOCKS server.

The duration and SOCKS server values are shared, the rest can be set for each client type as desired. In the example below, the total client duration is 60 seconds, with 50 bulk data send clients, 200 small payload rtt clients, and 1000 short-lived connectbyte clients.

SOCKSSERVER="..."                #address of socks server to test
SOCKSPORT="1080"                 #port number of socks server
DUR="60"                         #number of seconds to run clients
SOCKS="-s $SOCKSSERVER:$SOCKSPORT" #comment out to connect directly to target

#open many clients sending large payload to echo server in loop
SENDCLIENTS="50"                 #number of client connections to generate
SENDRATE="25"                    #number of connections to open each second
SENDTARGETSERVER="..."           #address of target server
SENDTARGETPORT="discard"         #target server port
SENDCLIENTSPERPROC="50"          #number of client connections per process

SENDSPEC="send:$SENDCLIENTS:tcp:$SENDRATE:$SENDTARGETSERVER:$SENDTARGETPORT:$SENDCLIENTSPERPROC"

#open many clients sending single byte payload to echo server in loop
RTTCLIENTS="200"                 #number of client connections to generate
RTTRATE="25"                     #number of connections to open each second
RTTTARGETSERVER="..."            #address of target server
RTTTARGETPORT="echo"             #target server port
RTTCLIENTSPERPROC="50"           #number of client connections per process

RTTSPEC="rtt:$RTTCLIENTS:tcp:$RTTRATE:$RTTTARGETSERVER:$RTTTARGETPORT:$RTTCLIENTSPERPROC"

#open many short-lived clients sending single byte payload to echo server
SHORTCLIENTS="1000"              #number of short lived clients to create
SHORTRATE="0"                    #RATE set to zero; calculated automatically
SHORTTARGETSERVER="..."          #address of target server
SHORTTARGETPORT="echo"           #target server port
SHORTCLIENTSPERPROC="250"        #number of client connections per process

SHORTSPEC="connectbyte:$SHORTCLIENTS:tcp:$SHORTRATE:$SHORTTARGETSERVER:$SHORTTARGETPORT:$SHORTCLIENTSPERPROC"

multiclient.pl -v $SOCKS -l $DUR $SENDSPEC $RTTSPEC $SHORTSPEC 2>mc.stderr

Measuring the effect on clients: latency

The rtt client type can be used also to measure latency. The target server should be an echo server. The latency, as measured by the application, is the time elapsed between a byte is sent by the application to the echo server, via Dante, until the same byte is received back from the echo server, again via Dante.

The measured latency includes the time spent on scheduling, kernel processing, and other similar tasks on the local machine the latency is being measured on. This means the measured latency will be somewhat less accurate, and probably slightly higher, than what it would be if the network traffic on the link had been measured directly. It should however still provide an useful indication of latency, especially if the client is run on a lightly loaded machine.

The following configuration will result in a single client that will continuously measure the latency of a single byte going to the target echo server, for a duration of 60 seconds.

#start single client sending single byte payload in a loop to an echo server
SOCKSSERVER="..."            #address of socks server to test
SOCKSPORT="1080"             #port number of socks server
DUR=60                       #number of seconds to run clients
SOCKS="-s $SOCKSSERVER:$SOCKSPORT" #comment out to connect directly to target

CLIENTS="1"                  #number of client connections to generate
RATE="1"                     #number of connections to open each second
TARGETSERVER="..."           #address of target server
TARGETPORT="echo"            #target server port
CLIENTSPERPROC="500"         #number of client connections per process

SPEC="rtt:$CLIENTS:tcp:$RATE:$TARGETSERVER:$TARGETPORT:$CLIENTSPERPROC"
multiclient.pl -v $SOCKS -l $DUR $SPEC 2>mc.stderr

The latency information is logged once each second, summarising the RTT values measured during the last second. The data is logged to the file log-0-rtt-rtt-0.dat, generated in the directory from which multiclient.pl is being run, in a space separated format with the following fields:

time: Timestamp with time when data was logged.
id: Identifier for connection (can be ignored).
fileno: File descriptor value (can be ignored).
median: Median RTT, in seconds.
average: Average RTT, in seconds.
stddev: RTT standard deviation.
minimum: Lowest measured RTT value, in seconds.
maximum: Highest measured RTT value, in seconds.

The following is an example of possible log output, taken from a LAN:

1423020577.64567 grande.inet.no:rtt:0 10 0.000288 0.000294 0.000164 0.000188 0.001806
1423020578.64586 grande.inet.no:rtt:0 10 0.000288 0.000312 0.000226 0.000171 0.004561
1423020579.64588 grande.inet.no:rtt:0 10 0.000287 0.000295 0.000191 0.000216 0.009559

For the first second, the median RTT was 0.000288 seconds (or 0.288 ms). The average RTT was 0.294 ms, the stddev was 0.164 ms, etc.

Data for a simple histogram of median RTT values (from a different data set than above) can for example be generated with the following command:

$ cat log-0-rtt-rtt-0.dat | awk '{ print $4 }' | sort -n | uniq -c | sort -n -k2
      7 0.000191
     13 0.000192
      3 0.000193
      6 0.000194
      1 0.000195
      2 0.000196
      7 0.000197
      2 0.000198
      5 0.000199
      6 0.000200
      4 0.000201
      4 0.000202

Note again that the RTT value measured includes not only the forwarding time of Dante, but the transmit time and scheduling/application response time of the client and target server. To isolate the delay added by Dante a more complicated configuration would be needed (which is not documented here).

Measuring the effect on clients: transfer rate

The send client type can also be used to measure transfer rates. The target server should be a discard server. The throughput, as measured by the application, is based on the rate that it is able to transmit data at. This value will initially be affected by buffering at each TCP endpoint, and for this reason, the first few measurement seconds should normally be ignored. After the first few seconds have passed, the buffers should be full and the measurements should be based on the rate that data can be transmitted through the Dante server at. The client can be affected by scheduling and similar load related factors and should ideally be run on a lightly loaded machine.

The following configuration will result in a single client that will continuously measure the amount of transmitted data going to the target discard server, for a duration of 60 seconds.

#start single client sending steady stream of data to a discard server
SOCKSSERVER="..."            #address of socks server to test
SOCKSPORT="1080"             #port number of socks server
DUR="60"                     #number of seconds to run clients
SOCKS="-s $SOCKSSERVER:$SOCKSPORT" #comment out to connect directly to target

CLIENTS="1"                  #number of client connections to generate
RATE="1"                     #number of connections to open each second
TARGETSERVER="$TESTSERVIP"   #address of target server
TARGETPORT="discard"         #target server port
CLIENTSPERPROC="1"           #number of client connections per process

SPEC="send:$CLIENTS:tcp:$RATE:$TARGETSERVER:$TARGETPORT:$CLIENTSPERPROC"
multiclient.pl -v $SOCKS -l $DUR $SPEC 2>mc.out >>_info.txt

Information on the number of bytes written is logged once each second. The data is logged to the file log-0-send-bw-0.dat, in a space separated format with the following fields:

time: Timestamp with time when data was logged.
id: Identifier for connection (can be ignored).
fileno: File descriptor value (can be ignored).
readbytes: Total number of bytes read during last second.
wrotebytes: Total number of bytes written during last second.

The following is an example of possible log output, taken from a LAN:

1423021036.90436 grande.inet.no:send:0 9 0 7864320
1423021037.90442 grande.inet.no:send:0 9 0 104857600
1423021038.90468 grande.inet.no:send:0 9 0 117112832
1423021039.90535 grande.inet.no:send:0 9 0 118030336
1423021040.9062 grande.inet.no:send:0 9 0 117309440

During the first second, 7,864,320 bytes were transmitted, during the second 104,857,600 bytes were transmitted, etc.

Stress test tool deployment

To properly design a stress test scenario, the production environment that the Dante server will be used in should be taken into consideration, but some general factors should always be considered.

Ideally, the stress test will include the following elements, all simultaneously running on separate machines:

A machine running the Dante server (Dante server). Measuring the performance and behavior of this machine and the Dante server will typically be the main purpose of the stress testing.
One or more machines generating a large load to stress the Dante server (stress test clients), for example, by running the multiclient.pl script as described above. Having these machines on the same LAN as the Dante server machine will generally be unproblematic and simplifies generating large amounts of traffic (at the cost of not having traffic behavior matching that of WAN traffic).
A target server for the stress test clients (stress test target server), running echo, discard, etc. servers, as necessary, to which connections can be forwarded using the Dante server.
One or more machines running multiclient.pl, or a similar tool, to measure latency or throughput for a small number of clients when the Dante server is placed under load (performance measurement clients). These machines should ideally be lightly loaded and connect to the Dante server in a way that matches that of the clients that will use the Dante server in production as close as possible.
A target server for the latency/throughput measurement clients (performance measurement target server), again running echo/discard servers, or the type of application server that will be used in production.

The examples above show how two types of stress test clients can be created: clients frequently sending small payloads and clients sending bulk data. These client types can be deployed as appropriate based on the types of load expected in production; using either only one type, or by choosing a ratio between the two. For example, a small number of bulk data transfer clients and a larger number of small payload clients.

There is much that could be said about running this type of test. One important challenge is ensuring that the stress test traffic that is generated is meaningful. A stress test will typically attempt to emulate the traffic of many clients, most running on different machines, with a much smaller number of stress test machines. Creating many connections and sending large amounts of traffic is relatively simple, especially if the stress test clients are located on the same LAN as the Dante server. This will however likely create network behavior that is entirely unrealistic, with both the network and all the machines being overloaded, and not representative of what would be the case in production, even under high load.

One way to avoid this problem is to start with a small number of clients on the stress test client machines, and then gradually increase the number of clients. Ideally, both the stress test client machines and the target server machines should be monitored with regards to memory usage, CPU usage, etc., and not only the Dante server machine. Especially the stress test target server machines can easily become overloaded and produce unwanted behavior unless care is taken. However, as long as no machines are too overloaded, it should be unproblematic to gradually increase the stress test load and get meaningful measurements.

Running a stress test essentially involves repeating a sequence of choosing a stress test load level, monitoring all involved machines and networks while measuring the performance of clients and the Dante server machine, and then repeating the process with slightly different parameters until the capacity of the Dante server has been reached or exceeded:

Determine the range of acceptable performance for the clients.
First run the performance measurement clients without any stress test load. This should give baseline performance values that represents the best possible performance with the current software and hardware combination. If the performance is outside the accepted performance range without any load, it will likely be impossible to achieve acceptable performance when load is added.
Test the stress test clients/servers without the Dante server by connecting directly to the stress test servers, in order to observe the effect the stress client load will have on the stress test machines, and to check if the stress test servers are able to handle the load.
Start the stress test clients with a small number of clients and run the performance tests again.
Increase the stress test load and run the performance tests again, while observing the changes in client performance and the CPU usage of, at least, the machine running Dante.
Repeat the above step until either the client capacity is reached, the performance clients report performance outside the accepted range, or the load from the stress test clients overload the network.

The result should ideally be an indication that the Dante server configuration, and the machine Dante is running on, is able to handle the load expected in production with a comfortable margin and acceptable performance. If this is not the case, first consider if the stress test clients are likely imposing a greater load on the system than actual clients are likely to do in production. Frequent packet loss and retransmission of TCP packets will likely be an indication of the load from the stress test clients overloading some part of the network topology.

Moving the stress test clients or stress test target servers further away from the Dante server, or adding some type of bandwidth shaping or artificial latency increase between the stress test machines and the Dante server, might give behavior that is more realistic with the client numbers that are used. One should keep in mind that in production, it will rarely be the case that both the SOCKS clients, the target servers, and Dante, are all on the same LAN. Having all three network elements on the same LAN during testing will make it much simpler to accidentally overload parts of the topology, compared to what may be the case in production, where at least part of the network topology is likely to be outside the local network.

However, if the conclusion is that the system is unable to handle the desired number of clients with acceptable performance, either the capacity or configuration of the Dante server, or the hardware of the Dante server will need to be improved. In general, our experience is that the Dante server machine will perform better if the Dante server runs on a machine with different network interface cards (NICs) for each of Dante's internal and external interfaces. Especially when there are large amounts of bulk data transfers the NIC might become a bottleneck. High quality server NICs generally give better performance than vanilla desktop/motherboard integrated NICs. If the CPU appears to be a bottleneck, adding a faster CPU with a larger number of CPU cores might also help; the Dante server with its multi-process architecture should scale very well with many CPU cores.

If behavior is observed that points to a bug or problem in Dante, please report this to the dante-bugs mailinglist (please see the Dante mailing list page for details).

Good luck stress testing!