Active network performance monitoring

CESNET technical report number 8/2007
also available in PDF, PostScript, and XML formats.

Sven Ubik
15 October 2007

1   Abstract

In this report we describe our experience with active performance monitoring in our CESNET network. We describe what performance characteristics can be measured by active monitoring, what are the recommended tools and what experience we have acquired in setting up active monitoring. We use a novel pragmatic approach for loss and reordering presentation and for UDP throughput measurements.

Keywords: network performance, active monitoring

2   Active monitoring

In active monitoring we send test packets into the network, capture these packets after they have passed through the network and analyse how these test packets were affected in terms of volume, time and error characteristics. Active monitoring can be considered a probe into the network.

On the other hand, in passive monitoring we do not send test packets, rather we observe and analyze real network traffic. Passive monitoring can be consider a watch on the network.

Characteristics obtained from active monitoring are in principle only truly applicable to the test packets. It is not certain whether comparable characteristics are experienced by real network traffic. Some characteristics, such as packet loss rate, are known to differ significantly as experienced by test packets when compared to real network traffic (we will explain the case of packet loss later). With active monitoring we also cannot say what real traffic is there in the network. With passive monitoring we can in principle obtain any information about real network traffic. Therefore, passive monitoring is by nature more powerful than active monitoring.

Nevertheless, active monitoring is less expensive than passive monitoring, because it does not require powerfull hardware to process large volumes of traffic in real time, and it can provide certain useful network characteristics.

In this report we concentrate on performance monitoring, rather than operational monitoring. That is we are interested in characteristics affecting quality of data transfers, such as delay, packet loss, reordering and throughput, rather than indicators of a running status, such as reachability of network nodes.

The rest of this report is organized as follows. Section describes what network performance characteristics can be obtained from active monitoring. Section provides more details about delay, loss and reordering monitoring. Finally, section Section describes our experience with throughput monitoring.

3   Performance characteristics

Important network performance characteristics that can be obtained from active monitoring are delay, packet loss, reordering and throughput. These characteristics are useful for two purposes. First, to assess the expected quality of data transfers. Second, as indicators of a good health of the network - we should look for unusual values and sudden changes in continuous monitoring. In the following paragraphs we describe more details about these characteristics.

3.1   Delay

Delay can be measured as round-trip or one-way. Round-trip delay is most commonly measured by a well-known ping tool, which sends ICMP or UDP packets to a remote host, which returns back a response in ICMP packets. ICMP support is a standard part of any network equipment with IP connectivity. Therefore we do not need to install any special software on remote hosts.

For performance debugging it is more useful to know separate one-way delay from the source to the destination and one-way delay in the other direction. One-way delay monitoring requires precise time synchronization between the source and destination hosts and cooperating software on the destination host.

A formal definition of one-way delay is provided in [rfc-ippm] as a time period between the first bit of a packet is sent by the source host and the last bit of a packet is received by the destination host. It is technically difficult to obtain exact timestamps refering to these moments. Therefore, practical measurements are approximation by using timestamps that are close to sending or receiving a packet on a given host.

Accuracy of one-way delay monitoring is affected mainly by the following two factors, their usual magnitude is indicated in parenthesis:

Standard one-way delay over an inter-city link is 1-5 milliseconds. Therefore, GPS-based synchronisation is required and if used the measurement accuracy is at the order of tens of microseconds.

However, we found that on a highly loaded machine, the second factor can be at the order of milliseconds. This can be tested by sending packets in precise time intervals using a hardware packet generator or a hardware monitoring card (such as DAG) and receiving these packets on a monitoring station. If the magnitude of this fluctuation reaches significant percentage of a typical one-way delay that we want to measure, we must use hardware monitoring cards that assign timestamps in hardware. In our case, we checked with a hardware generator that regular Ethernet cards and assigning timestamps in an operating system were acceptable.

The difference of delay between two consecutive packets is called IP packet delay variation (IPDV) or jitter [rfc-ipdv]. It is an important statistical characteristics for certain network applications, such as audio and video transmissions and it is often considered a separate network performance characteristics.

Delay is a performance characteristics, which is most conveniently measured by active monitoring. The implementation is much easier than with passive monitoring and values measured by test packets are well applicable to real traffic on a given network path.

3.2   Loss

Packet loss can be measured as a singleton metric - a packet is either delivered or lost. More conveniently, it is usually expressed statistically, as a percentage of packets lost from the total number of packets sent over a certain time period [rfc-loss]. It is also interesting to measure the number of consecutive packets lost (loss period) or the number of packets sent between two detected losses (loss distance) [rfc-loss-dynamics], which can affect some applications.

Active monitoring of packet loss by test packets is easy, but the values obtained in this way only apply to test packets and are known not to correspond to packet loss experienced by real network traffic.

A short summary of the problem follows. Suppose that we sent 10 test packets per second between two end points. It would take almost 3 hours to detect packet loss of 10-5 and more than a day for 10-6. Such packet loss rates still affect significantly TCP throughput over fast long-distance networks. These calculations are true for evenly distributed packet loss. When bursts of packet loss occur, which is a common case, it can take even longer time to measure packet loss rate with reasonable accuracy. We can look at the problem from another perspective. If we send a burst of 10 test packets and manage to catch a loss period and lose 5 out of 10 packets, what is the time period for which was this 50% packet loss rate valid? A consequence is that packet loss experienced by test packets, no matter how many and in what patterns we send them, is not related to the actual behaviour experienced by real network traffic.

Therefore, active packet loss monitoring should be considered only as an indicator of a good network health, rather than accurate measurement of what packet loss is expected by user traffic. Measurement with test packets should normally indicate near zero packet loss. Steady loss of test packets usually indicates a network problem.

3.3   Reordering

Similarly to packet loss, packet reordering can be also measured as a singleton metric - a pair of packets is either reordered or not and statistically, as percentage of reordered packets from the total number of packets sent. Additionally, there are several metrics to quantify dimensions of reordering in time and space [rfc-reordering].

Packet reordering has similar properties to packet loss regarding its monitoring. While packet loss experienced by real network traffic is often higher than what we can detect by test packets, due to temporary congestions caused by real network traffic, it may be even harder to apply reordering detected by test packets to real network traffic. The cause of packet reordering may be different from the cause of packet loss. Reordering generally happens as consequence of different timing in parallel processing. Some routers are known to cause reordering in periods of high load. Therefore, reordering detected by active monitoring is also useful mainly as an indicator of a possible network problem.

3.4   Throughput

When observing network load, we can define several terms:

Installed capacity

is the maximum volume of data that can be theoretically transfered over a network in a unit if time. It is a property of the physical network medium.

Used capacity

is the currently occupied part of installed capacity. It can be expressed as percentage from the installed capacity, in that case we call it utilization.

Available capacity

is the currently unoccupied part of the installed capacity. It is a complement of used capacity to installed capacity.

Throughput

sometimes called bulk transfer capacity or goodput, denotes the volume of additional data that can be transfered over the network already including some data.

The term bandwidth is sometimes used interchangeably with capacity, which is a prefered terminology according to [ietf-capacity]. Installed capacity is specified at the physical layer including inter-packet gaps. Used capacity and available capacity are usually given at the network layer, that is in bits of IP packets including IP headers. And throughput is usually given at the session layer that is in bits successfully transfered by the transport protocol.

Throughput is different from available capacity, not only because it is computed at a different layer, but mainly because it depends on transport protocols used by existing traffic and by added traffic. Most traffic is currently carried by TCP, which is an elastic protocol reacting to congestion by reducing volume of data sent into the network. Throughput of a TCP connection added to a network whose current traffic consists mostly of TCP is usually higher than available capacity, because added traffic stresses existing traffic.

All these metrics can be considered either for individual links in a network or for the whole path through the network. In the latter case we are interested in the maximum value over all links for used capacity and minimum value for other metrics.

Available capacity cannot be monitored directly. We can monitor used capacity by reading router interface byte counters by SNMP. We can also obtain used capacity from packet capture on a network line. Throughput can only be measured by active monitoring.

While monitoring of used capacity is non-intrusive, does not depend on network protocols and can run continuously, it also useful to make throughput measurements as practical verification that certain volume of data can really be transfered over the network. Current backbone network lines have often installed capacity of 10 Gb/s or more. When we measure throughput with monitoring stations equipped with Gigabit Ethernet network adapters, performance is usually limited by monitoring stations themselves. The value of such measurements is in providing baseline of good network health, similarly to active loss measurements. If measured throughput is very low or drops suddenly, it is usually an indicator of a network problem.

4   One-way delay, loss and reordering monitoring

We have decided to use RUDE/CRUDE tool for active one-way delay, loss and reordering monitoring. rude is a packet generator that can send multiple UDP streams of packets of specified sizes and rates. Each packet carries a sender timestamp expressed in seconds and microseconds. crude is a packet receiver that can log a short record about each received packet to a log file, including sender and receiver timestamps.

The reason we chose RUDE/CRUDE is its flexibility in stream configuration and very low overhead that allows to send packets with precise timing and high rates (even though we currently use quite low rates).

Original RUDE/CRUDE only works for unicast IPv4 packets. We added support for IPv6 and multicast. We also added ability to specify very low packet rates with less than one packet per second and ability to send packets in bursts of specified sizes.

We monitor one-way delay and packet loss from the central monitoring station located in CESNET premises in Prague to monitoring stations located in CESNET PoPs in different cities and conversely from these remote monitoring stations back to the central monitoring station. One test UDP packet is sent every 10 seconds in each direction. Deployment of monitoring stations is illustrated in Figure. The same monitoring stations are used also for active throughput monitoring (see next section) and for passive monitoring [abw].

[Figure]

Figure 1: Deployment of monitoring stations

We have developed a set of scripts that compute one-way delay, loss and reordering, store these values in an RRD database and provide results in a graphical form based on user requests. A web-based user interface is illustrated in Figure.

[Figure]

Figure 2: User interface of delay, loss and reordering monitoring

An example of a delay graph is shown in Figure. Average delay computed over specified time steps during a specified time range is shown in as positive values in red color for one direction and as negative values in green color for the other direction. The reason we use different colors is that in case of poor time synchronization values may be shifted from positive to negative part of the graph or vice versa. The colors make it easy to note such a problem.

The same graph also shows average and maximum delay over a coarse specified time steps for comparison. These additional two values are depicted by yellow and blue lines respectively.

An example of a loss graph is shown in Figure. We also indicate packet reordering in the same type of graph. We adopted a pragmatic approach to storing and presentation of packet loss and reordering. As we described in Section, active loss monitoring is useful only as an indicator of possible network problems, rather than precise characterisation of packet loss experienced by real traffic. The same is true for packet reordering. Therefore, we just store each detected loss event in the RRD database as the number of consecutive test packets lost and present this number in the graph. If reordering of test packets is detected, we store the difference of a received and an expected sequence number and present this reordering size in the graph. We instruct the rrdgraph utility to plot maximum values during a specified time range with a specified time step. Loss is shown in red color and reordering is shown in green color. If you choose any time range and any time step, you will always see the same magnitude of loss or reordering (they are not averages), if those we detected during the time range covered by the graph. You can find more precisely the times when loss or reordering were detected by generating another graph with different time range or time step. So far we have not detected any reordering of test packets in our network. It would be indeed unlikely given the low rate of test packets. We plan to investigate possibilities of detection of reordering in user traffic.

[Figure]

Figure 3: Example of delay monitoring

[Figure]

Figure 4: Example of loss monitoring

5   Throughput monitoring

For throughput measurements, we have decided to use a well-known and time-proven tool iperf. It is a classic stress-type throughput measurement tool, which tries to send as much data as possible over a network path between a sender and a receiver, which are both implemented in the same tool.

We decided not to use any of the lightweight capacity estimation tools, such as pathrate or pathload, because they tend to be unprecise and unreliable in Gigabit-speed networks.

We wanted to measure throughput separately for IPv4 and IPv6. We have found that IPv6 support is not working properly in iperf distributed in a tarball. The latest version from CVS (version 2.0.2, as of the time of this writing) works correctly.

A few practical findings about iperf use:

We monitor throughput from the central monitoring station located in CESNET premises in Prague to monitoring stations located in CESNET PoPs in different cities and conversely from these remote monitoring stations back to the central monitoring station. We use the same monitoring stations as for delay monitoring and passive monitoring.

We have developed a set of scripts that perform regular throughput measurements, store results in an RRD database and provide results in a graphical form based on user requests. Throughput is monitored separately for IPv4 and IPv6 and separately for TCP and UDP. When monitoring UDP throughput, we send a stream of UDP packets in a stepwise increasing rate and note two values - the maximum rate (separately for each direction) with zero packet loss and the maximum rate with packet loss lower than a specified limit. When packet loss rate exceeds the specified limit, we stop measurement. We currently use the limit of 5%. The measured values suggest performance that can be expected from practical UDP-based applications without excessively stressing existing network traffic. Configuration of monitoring stations follows standard recommendations regarding socket buffers and other end-host tuning for high performance [end-host-tuning].

A web-based user interface is similar to delay monitoring and is illustrated in Figure.

[Figure]

Figure 5: User interface of throughput monitoring

An example graph of throughput monitoring is shown in Figure. Green color (light grey) indicates UDP throughput with zero packet loss, red color (dark grey) indicates additional UDP throughput with packet loss under 5% and blue lines indicate TCP throughput. User can select time range and time step to compute average values. Only one time step is used in one graph. The indicated throughput is shown as measured by iperf that is in bits of TCP or UDP payload.

[Figure]

Figure 6: Example of throughput monitoring

We studied effects of stress-type throughput tests affect other types of monitoring done on the same monitoring stations. Each complete throughput test between two monitoring stations (including IPv4 and IPv6, UDP and TCP in both directions) transfers approximately 4 gigabytes of data. This volume of data is visible in SNMP link monitoring for links that are normally lightly loaded. It is invisible for links that are normally highly loaded (such as Prague - Brno link), because it is below normal fluctuations of link load. We found that throughput tests do not cause visible affects on passive monitoring results done on the same monitoring stations. However, we found that throughput tests cause significant fluctuations in delay measured actively on the same monitoring stations. For instance, Figure shows delay peaks around 8 AM when throughput test was performed. Another fluctuation on the left is unrelated to the throughput test. We will modify our scripts so that delay measurements will be temporarily suspended during throughput measurements.

[Figure]

Figure 7: Effect of throughput test on delay measurements

6   Conclusion

We have described how active monitoring can be used for delay, loss, reordering and throughput measurements. We summarized our experienced with certain monitoring tools and we presented how the monitored characteristics can be conveniently presented.

References

[rfc-ippm] G. Almes, S. Kalidindi, M. Zekaukas. A One-way Delay Metrics for IPPM, RFC-2679, IETF, September 1999.
[rfc-ipdv] C. Demichelis, P. Chimento. IP Packet Delay Variation Metric for IP Performance Metrics (IPPM), RFC-3393, IETF, November 2002.
[rfc-loss] G. Almes, S. Kalidindi, M. Zekaukas. A One-way Packet Loss Metrics for IPPM, RFC-2680, IETF, September 1999.
[rfc-loss-dynamics] R. Koodli, R. Ravikanth. One-way Loss Pattern Sample Metrics, RFC-3357, IETF, August 2002.
[rfc-reordering] A. Morton, L. Ciavattone, G. Ramachandran, S. Shalunov, J. Perser. Packet Reordering Metrics, RFC-4737, IETF, November 2006.
[ietf-capacity] P. Chimento, J. Ishac. Defining Network Capacity, IETF Draft <draft-ietf-ippm-bw-capacity-05>, May 2007. Work in progress.
[abw] Sven Ubik, Demetres Antoniades, Arne Øslebø ABW: Short-Timescale Passive Bandwidth Monitoring, CESNET Technical Report 3/2007.
[tbwtools] Sven Ubik, Václav Řehák, Lukáš Baxa. Tbwtools: Debugging TCP Performance, CESNET Technical Report 6/2006.
[end-host-tuning] Brian L. Tierney. TCP Tuning Techniques for High-Speed Wide-Area Networks, NFNN2, Edinburgh, June 2005.
další weby:fond rozvojemetacentrumCzechLightpřenosyvideoservereduroameduID.cz