dante   Frontpage - Dante - Download - Status - Support - Modules - Docs - Links - Survey
 

Traffic monitoring

The Dante server forwards network traffic between different sets of machines. As a result, some types of network problems that might affect the forwarded traffic will result in observable changes in network behavior. This includes behavior such as large number of TCP connections being terminated due to a server being rebooted, or traffic no longer being received or sent. The monitoring functionality in the Dante server allows different alarms to be set, which will result in warnings being logged if these types of situations are detected.

For reverse proxy topologies, or topologies where the SOCKS clients are used to only access a limited set of target servers, which are always expected to be available, please also see the error logging functionality documentation for how to enable logging of routing and related errors returned by the kernel.

Traffic monitoring

The alarms are specified in so-called monitors. These objects have the same general format as the rules Dante uses for access control. However, monitors are completely independent from the access control rules, and only perform passive monitoring of network traffic, or the lack of network traffic.

The following example shows the general monitor syntax, showing a monitor, without any actual monitoring operations:

monitor {
   from: 0.0.0.0/0 to: www.example.org port = 80
   protocol: tcp
}

The example has a from address that matches all IPv4-addresses, a to address that matches only the host www.example.org, port 80, and a protocol keyword that limits monitoring to TCP traffic. This monitor can be used to monitor TCP traffic from all connecting clients (via Dante) to the host www.example.org on port 80. Any other traffic passing through Dante will be ignored by this monitor.

A monitor can include many of the same keywords that are available in the Dante ACL rules. The following subset is currently supported:

  • from - Normally specifies what client addresses/networks to monitor.
  • to - Normally specifies what target addresses/networks to monitor.
  • protocol - Can be used to restrict monitoring to a certain protocol (TCP, UDP or both). Note: only TCP should be used for now.
  • hostid - Can be used to restrict monitoring to only clients with a specific hostid value set.
  • hostindex - Used along with the hostid keyword to control which of the two possible hostid values will be used when matching.

NOTE: It is currently recommended that the protocol keyword is always specified and set to tcp because there is currently only limited support for monitoring of UDP traffic, and only limited testing of UDP traffic monitoring has not been done.

A monitor can be mostly empty, as in the example above, in which case no actual monitoring will be performed. The main function of monitors is to provide a container for one or more alarms, which are specified using a new set of keywords not available for other objects. These keywords can be used to make Dante warnings for different types of unexpected network behaviour.

Network behaviour that can cause the currently supported alarms to trigger include the following:

  • Periods of no or little data being transmitted.
  • Many TCP connections disconnecting during a short period of time.

The keywords for the different alarms are described further down.

Active TCP sessions will at most match one monitor, but multiple alarms can be specified in a single monitor. This makes it possible to specify multiple sets of conditions for the same TCP sessions, depending on what network interface the traffic is transferred on and whether the traffic is being received or transmitted.

Data idleness detection

For machines or networks that are expected to continuously send or receive traffic, a period of no or little traffic being transmitted or received might be an indication of a network problem. To allow these types of situations to be detected it is now possible to enable data alarms in the Dante monitors using the alarm.data keyword.

Adding an alarm.data keyword to a monitor will result in warnings being logged if there are periods with too little network traffic.

Dante has four network paths and data alarms can be configured independently for each of them:

  • internal.alarm.data.recv - Data received on Dante internal interface. In practice this means data sent from the SOCKS clients to Dante.
  • internal.alarm.data.send - Data sent out on Dante's internal interface. In practice this means data sent from Dante to the SOCKS clients.
  • external.alarm.data.recv - Data received on Dante external interface. In practice this means data sent from the target servers to Dante.
  • external.alarm.data.send - Data sent out on Dante's external interface. In practice this means data sent from Dante to the target servers.

The data.alarm keyword takes two parameters: a byte count and a duration in seconds. The alarm will trigger if the specified number of seconds pass with only the specified number of bytes (or less) being transmitted.

The syntax is as follows:

internal.alarm.data.recv: DATALIMIT in INTERVAL
  • The DATALIMIT is a number that specifies the byte limit.
  • The INTERVAL is a number that specifies the duration.

If only DATALIMIT bytes (or less) have been transferred during a period of INTERVAL seconds, an alarm will trigger in Dante.

The following is an example of a configuration where Dante is expected to always receive at least some traffic on the internal network interface. At most there should be a 10 second pause without any data being received:

internal.alarm.data.recv: 0 in 10

If no data, not even one byte, is received by Dante on its internal interface during a period of 10 seconds, an alarm will trigger in Dante. In this example the alarm is for Dante's internal network interface, on which it would typically have connectivity to the SOCKS clients.

The following is an example with a data limit of 10240 bytes and a duration of 20 seconds:

external.alarm.data.recv: 10240 in 20

On this network the operator expects that during a period of 20 seconds there will never be the case during normal operation that Dante will have received only 10240 bytes or less on the external interface. Should there be a 20 second period where Dante has received only 10240 bytes or less, an alarm will trigger.

In this example the alarm is for Dante's external network interface, on which it would typically have connectivity to the target servers.

Placed in a monitor, the full expression for this alarm can be expressed using this syntax:

monitor {
   from: 0.0.0.0/0 to: www.example.org port = 80 
   protocol: tcp

   # warn in case 20 seconds pass where only 10240 bytes have been 
   # received from the target server www.example.org port 80.
   external.alarm.data.recv: 10240 in 20
}

The above monitor will apply only to TCP traffic received from the server www.example.org on Dante's external network interface. It will not consider traffic sent to the server www.example.org, or the traffic received from the SOCKS clients.

Multiple alarms can be specified in more complicated monitors:

monitor {
   from: 0.0.0.0/0 to: www.example.org port = 80 
   protocol: tcp

   # warn if only 10240 bytes have been received from target server
   # www.example.org port 80 during a period of 20 seconds.
   external.alarm.data.recv: 10240 in 20

   # warn if only 1024 bytes have been sent to the clients during
   # a period of 20 seconds.
   internal.alarm.data.send: 1024 in 20
}

Data alarms trigger when a period of data idleness has been detected. Once a data alarm has triggered, it will remain active until it is cleared. A warning will be logged when the alarm triggers and than again when the alarm condition is cleared. In between these two points no warnings related to this alarm will be logged. This avoids repeating the same alarm/warning multiple times during network problems that last for an extended amount of time. When the alarm is cleared, Dante will also include information about how long the alarm condition lasted.

A data alarm can be cleared in two ways:

  • Automatically, once enough data has been transferred in a short enough amount of time.
  • Manually, by sending the Dante server a SIGHUP signal. A SIGHUP will cause all active alarms to be cleared. No log messages indicating that the alarms have cleared will be logged when alarms are cleared in this way.

Once an alarm has been cleared, it can trigger again if enough data is not being transferred.

Using the previous example:

external.alarm.data.recv: 10240 in 20

An alarm will trigger if only 10240 bytes have been received by Dante on the external network interface during the last 20 seconds.

If, after the alarm has triggered, more than 10240 bytes of data is received on the external interface during a period of 20 seconds, Dante will clear the alarm and log that the alarm has been cleared using the same log level at which it logged the alarm triggering.

Note that alarms will trigger also shortly after server startup if the Dante server does not receive sufficient traffic to prevent the alarms from triggering.

Data alarms will trigger regardless of whether there are active sessions matching the monitor or not; if enough data is not being transmitted or received, a data alarm will trigger.

The following format is used for the data alarm warnings:

warning: monitor(MONNUM): alarm/data STATE: MONSRC -> MONDST TYPE: NBYTES/DATALIMIT in INTERVALs. Session count: SESSIONS

The keywords have the following meaning:

  • MONNUM - Monitor number; the first monitor is numbered 1, the second 2, etc.
  • STATE - Whether the alarm has triggered or been cleared.
  • MONSRC - The value of the from keyword in the monitor.
  • MONDST - The value of the to keyword in the monitor.
  • TYPE - Identifier for the network interface triggering/clearing the alarm (internal.recv/internal.send/external.recv/external.send). Corresponds to the keyword used when the alarm was specified.
  • NBYTES - The number of bytes transmitted the last INTERVAL seconds before the alarm triggered/cleared. Will be a value equal to or less than DATALIMIT if the alarm triggers, and a value higher than DATALIMIT if the alarm is being cleared.
  • DATALIMIT - The data limit specified in the alarm.
  • INTERVAL - The length of the alarm interval in seconds. Corresponds to the keyword used when the alarm was specified.
  • SESSIONS - The number of TCP sessions matching the monitor at the time of the alarm.

The following is an example of a monitor and the corresponding warning that is produced when the second alarm triggers:

monitor {
   from: 0.0.0.0/0  to: 0.0.0.0/0 
   protocol: tcp

   internal.alarm.data.recv: 1 in 2
   external.alarm.data.recv: 1 in 2
}
warning: monitor(1): alarm/data [: 0.0.0.0/0 -> 0.0.0.0/0 external.recv: 0/1 in 2s. Session count: 0

From the warning fields it can be seen that the alarm triggered because no data was received during the last two seconds. No sessions were active when the alarm triggered.

When the monitor clears due to enough data having been transferred, the log message can look like this:

warning: monitor(1): alarm/data ]: 0.0.0.0/0 -> 0.0.0.0/0 external.recv: 2/1 in 2s. Session count: 1. Alarm duration: 4s

As can be seen, '[' is used to specify that a data alarm triggers, while ']' is used to specify that it has cleared; the former indicates that an error condition has occurred, the latter that the error condition has ended. The message logged when the alarm is cleared also specifies how long the alarm lasted. In this case the alarm was active for four seconds, at which point two bytes had been received during the last two seconds and there was one active session matching the monitor.

Note that the message indicating that an alarm has cleared is not logged if the alarm was cleared due to a SIGHUP signal being received.

Abnormal rate of connection termination detection

If a large number of connections are terminated within a short period of time, this is also a possible indication of a connectivity or network problem, perhaps due to a remote network server/proxy crashing.

By using the alarm.disconnect keyword the Dante server can log a warning when this type of situation occurs.

There are two variants of the alarm keyword, one for the internal network interface, between the SOCKS clients and Dante, and one for the external interface, between the Dante server and the target servers:

  • internal.alarm.disconnect - Connections between clients and the Dante server.
  • external.alarm.disconnect - Connections between the Dante server and the target servers.

Each alarm keyword takes three parameters, a minimum count, a ratio value, and a time interval. The following format is used:

internal.alarm.disconnect: MINCOUNT/RATIO in INTERVAL
  • The MINCOUNT is the minimum number of connections that must be disconnected for the alarm to trigger.
  • The RATIO is used together with the MINCOUNT to express the number of connections, relative to the total number of connections that have existed in the time period, that must be disconnected for the alarm to trigger.

    For example, assuming the operator wants to specify that an alarm should trigger if half of the connections active during a time interval are disconnected, but only if there are at least 200 such disconnects. In this case the MINCOUNT should be given as 200 and the RATIO should be given as 400, resulting in the value 200/400, corresponding to half, or 50%, of the active connections.

    Here the operator does not care about cases where there have been, for example, only two sessions and one of them disconnects, even though this one session would also constitute half of all the sessions; there must be at least 200 disconnects.

  • The INTERVAL is the time in seconds within which the disconnects must occur for the alarm to trigger.

The following is an example of an alarm that will trigger if one third of all connections between Dante and the target server are disconnected within 15 seconds, but only if the number of disconnected connections amount to at least 1000:

external.alarm.disconnect: 1000/3000 in 15

If there are less than 1000 disconnects, or less than 33 percent of all connections that existed during this period are disconnected, no alarm will trigger. An alarm will also not trigger if the 1000 disconnects occur over a period of time that is longer than 15 seconds.

The following should be noted:

  • To set values that are useful, some knowledge about the expected amount of network traffic and number of sessions is required.

    If the rate of disconnects, as a percentage, is lower than the ratio specified, an alarm will not trigger.

    Conversely, if the MINCOUNT is set too low, alarms might trigger too frequently because only a small number of disconnects might be sufficient to achieve the required number of disconnects and disconnect ratio at times when there are only a few active sessions.

    For example, having all connections to a target server suddenly disconnect during one second will usually indicate a serious problem, but specifying an alarm using the values below would likely result in many false alarms:

    external.alarm.disconnect: 1/1 in 1
    

    If there is only one active session at a given time, the above alarm will trigger if the target server related to that single session disconnects. Requiring more session to disconnect in order for the alarm to trigger can avoid this problem.

  • Only connections that are terminated on the specified interface are counted, i.e., an external.alarm.disconnect alarm will only trigger for connections that are terminated on the network interface between the Dante server and the target server, either by the target server closing the connection to Dante or by Dante receiving a fatal network error from that side of the connection (e.g., a TCP RST packet).

    Connections that are closed on the internal interface (by the SOCKS clients) will not count towards a disconnect alarm on the external side. Likewise, connections closed by target servers will not count towards a disconnect alarm on the internal side.

    A practical consequence of this is that if a large number of connections are simultaneously closed by both the client and the target server, each connection will only be counted as a disconnect on one of the sides; either the external side or the internal side, depending on which side closes the connection first.

A complete monitor with two disconnect alarms can look like this:

monitor {
   from: 0.0.0.0/0 to: www.example.org port = 80 
   protocol: tcp

   # warn if 1/3 or more sessions disconnect during a period of five seconds,
   # but require a minimum of 1000 disconnects on either side.
   internal.alarm.disconnect: 1000/3000 in 5
   external.alarm.disconnect: 1000/3000 in 5
}

The above monitor will apply to TCP connections to the server www.example.org. If at least 1000 sessions on either the internal or external network interface side disconnect during a period of five seconds, and these 1000 disconnects constitutes at least 33% of the connections to www.example.org port 80 that existed during these five seconds, an alarm will trigger.

The alarm will trigger regardless of whether the disconnects occurred on the connections between the clients and Dante, or between Dante and the target servers, but does require there to be at least 1000 disconnects on either or both sides. If there are 3000 sessions to www.example.org port 80, and 500 of these disconnect on the external interface (from www.example.org), while 500 disconnect on the internal interface, no alarm will trigger.

Alarms trigger each time a sufficient number disconnects occur. Each sufficiently large burst of disconnects will result in an alarm, but normally at most one warning per alarm will be logged during each time interval, though this might change in a later version of Dante.

Separate alarms are produced for each distinct alarm keyword.

The following format is used for the disconnect alarm warnings:

warning: monitor(RULENUM): alarm/disconnect ]: MONSRC -> MONDST TYPE: DISCONNECTS/SESSIONS disconnects during last INTERVALs. Session count: SESSIONS

The keywords have the following meaning:

  • MONNUM - Monitor number; the first monitor is numbered 1, the second 2, etc.
  • MONSRC - The value of the from keyword in the monitor.
  • MONDST - The value of the to keyword in the monitor.
  • TYPE - Identifier for the network interface triggering the alarm (internal/external). Corresponds to the keyword used when the alarm was specified.
  • DISCONNECTS - The number of disconnected connections in the time interval. Will be a value not lower than MINCOUNT.
  • SESSIONS - The number of TCP sessions matching this monitor at the time of the alarm.
  • INTERVAL - The alarm interval in number of seconds. Corresponds to the keyword used when the alarm was specified.

The following is an example of a monitor and the corresponding alarm that is produced when it triggers:

monitor {
   from: 0.0.0.0/0  to: 0.0.0.0/0
   protocol: tcp

   external.alarm.disconnect: 1/2 in 5
}
warning: monitor(1): alarm/disconnect ]: 0.0.0.0/0 -> 0.0.0.0/0 external: 1/1 disconnects during last 5s. Session count: 0

From the warning fields it can be seen that the alarm triggered because one connection was disconnected by the target server. The ratio in the alarm is 1/2, meaning that at least one connection must be disconnected and at least 50 percent of the total number of connections must terminate for the alarm to trigger.

With only a single connection present, one disconnected connection corresponds to 100 percent of all connections, and thus 1/1 is the ratio given in the warning, meaning all sessions disconnected.

At the time the alarm triggered, there were zero active sessions matching the monitor.


Copyright © 1998-2016 Inferno Nettverk A/S