FlowMon for Network Monitoring

CESNET technical report 20/2010

Martin Žádník, Libor Polčák, Ondřej Lengál, Martin Elich, Petr Kramoliš

Received 1. 12. 2010

Other formats: PDF, EPUB

Abstract

In today's networks we can observe ever increasing complexity and variability of services. A primary service – constant network connectivity – is accompanied with demands on cost-effectiveness, robustness and sufficient bandwidth provisioning. Specific applications may further require a network to be loss-free, to guarantee required bandwidth or exhibit low delay with low jitter. In order to address these demands and requirements, measurements must be taken to render an account of current network status. The measurement result determines the scope for further improvement, e.g., routing optimization, bandwidth over-provisioning, anomaly mitigation and others. We propose to deploy FlowMon probe in order to extract relevant information about network traffic from high-speed links. FlowMon probe is based on a concept of flow measurement which provides aggregation of the traffic but keeps sufficient level of detail.

Keywords: flow, measurement, FPGA

1  Introduction

Most modern communication services (world wide web, streaming, databases, e-mail, on-line shops etc.) now use the Internet infrastructure. Its reliable operation also depends on large-scale measurement capable of providing accurate data about traffic patterns, applications used, hostile activities etc. In this respect, NetFlow or IPFIX protocol seems to be widely deployed scheme of flow measurement in some networks. It can help network operators to manage their current networks or plan new network topologies. Other management techniques such as bandwidth provisioning, detecting DoS attacks, billing and accounting also require detailed measurement. Currently available measurement devices have their limitations in terms of performance and flexibility. In particular, for security-related applications it is not acceptable to get information about only a random portion of the network traffic when the measurement device becomes overloaded, for example during a DoS attack. Moreover, some devices support only NetFlow v5 protocol which renders it impossible to modify or add statistics.

A standalone dedicated device, such as FlowMon probe, has several benefits. For example it is not necessary to change routers that do not support flow measurement. Also the probe has sufficient performance to cope with any network traffic mix and at the same time provides reasonable flexibility in terms of how and what is measured plus further experimental enhancements have been made to support innovative features, such as application identification.

The project was started six years ago and since then several flow measurement probes and other related tools (such as NETCONF remote configuration software or IPFIX collector) have been developed. First probe was a proof of concept that shows feasibility to implement IP packet flow measurement on board with FPGA chips. Despite its performance was poor, it outperformed standard software probes. The second implementation was improved in both functionality and performance. It was able to monitor up to 3 million packets per second (holds for shortest packets) or up to five gigabits (holds for longest packets). These probes are successfully used to monitor networks by our GEANT2 partners such as SURFnet, SWITCH, ISTF (monitoring whole Bulgarian NREN), etc. With ever growing speed of links, it was inevitable to redesign both network acceleration cards and the structure of flow measurement on the card and PC. The outcome was a basis of current form of FlowMon probe. During its evolution the probe was enhanced with precise timestamp generation module to support delay measurements and correlation of gathered data from several probes. Another extension was focused on detecting specific patterns in payload of packets to determine the application traffic carried.

The objective of this technical report is to describe two stable implementations of FlowMon probe which are available as distribution packages for download at the Liberouter web site. The report describes parameters of the probe and the way of its configuration. The deployment and experiments carried are analyzed at the end.

2  Design

FlowMon probe is a passive network monitoring device based on the COMBO card equipped with a field-programmable gate array (FPGA). The FPGA serves as an accelerator of packet processing. FlowMon probe is able to collect dynamic data about IP flows and export them to external collectors in the NetFlow (version 5 and 9) and IPFIX format.

Current FlowMon probes are based on the general architecture shown in Figure 1. First, each captured Ethernet frame is marked with a precise timestamp. Then, L2-L4 layers are analyzed and selected header fields are extracted. Packets are assigned to flows according to the extracted data and flow statistics are aggregated. Finally, the exporter composes a flow information to an output format and forwards the data to the collector. The architecture was used as an essential starting point to decide hardware-software co-design problem of the probe implementation.

[Image]

Figure 1. Processing of the traffic for each port.

Packet preprocessing is distributed to several stages. First, incoming packets are analyzed and selected header fields are extracted according to found L2-L4 protocols. Extracted fields are assembled into a specific but simple and short data structure (Unified Header - 64 bytes in total). Thus, the length of data processed in all following sub-blocks of the probe is reduced to 64 bytes per each packet no matter of its length. Flexibility of the probe is achieved by collecting different header fields in several FlowMon variants. FlowMon variants focuse on support of either standard NetFlow or temporal characteristics or L7 application identification. A subset of extracted fields is considered as a flow identifier. Packets containing identical flow identifiers form a flow. In the second stage Bob-Jenkins hash is computed from the flow identifiers. The computed hash is used as an address in the flow processing. Flow processing involves aggregation of Unified Header in flow records, storage of flow records in the flow cache and the flow cache management. Flow cache management must deal with situations such as collisions when more flows are mapped to the same hash value, further, it must deal with inactivity of a flow (when flow is inactive for longer than inactive timeout) or its persistent activity (flow is active for longer than active timeout). Based on an occupancy or timeouts the flow records are released from the flow cache. The exporter assembles several released flow records into a packet of NetFlow v5, v9 or IPFIX format and sends it to a collector.

A common feature of FlowMon probes is precise timestamp assignment to all captured frames at the network interface. Timestamp module [1] uses time signaling provided by GPS module if it is connected to the COMBOv2 card.

Another important feature of FlowMon probes is that they can stay invisible on the second and higher layers of the ISO/OSI network model so an attacker is not able to find out that a monitoring of his or her activities is performed. The transparency is achieved by a cut-through repeater (Figure 2) implemented in the firmware. When the firmware is correctly loaded to the Combo card the probe, independently on the state of the rest of the design and software, retransmits all the traffic from the port 0 to the port 1 and vice versa. When needed it is possible to stop the repeater in one or both directions.

[Image]

Figure 2. Repeater connection.

Figures 3, 4 and 5 show possible connections of the FlowMon probe to the network. The simplest option is exploiting a SPAN port of a router or a switch (Figure 3). Disadvantage of this regime is that routers are forced to send all packets not only to their correct output port but also to the SPAN port. In such a scenario, the queues inside the router may overflow thus some packets may be dropped and not monitored. Another option is to insert optical splitting device into the network line (Figure 4). In this case all traffic is monitored by the probe and it preserves time characteristics of the passing traffic. The built-in repeater in the probe (Figure 5) offers all advantages of the optical splitter solution and does not need another device. However, the splitter solution does not work when the card is not working (e.g., because of a power outage).

[Image]

Figure 3. FlowMon probe inserted at a mirror port.

[Image]

Figure 4. FlowMon Probe Connected via an Optical Splitter.

[Image]

Figure 5. FlowMon Probe Inserted in Line as a Repeater.

3  Architectures

We have implemented two solutions of the mapping of the proposed architecture to hardware and software. Both architectures implement packet preprocessing in the hardware. FlowMon Light (FlowMon LT) aggregates flow information entirely in the software whereas FlowMon Full divides flow information aggregation between hardware and software. Export of the aggregated data is done in the software via external FlowMon exporter in both FlowMon implementations.

We use INVEA-TECH FlowMon exporter as a an aggregator and exporter of flow statistics. The exporter can generate flow statistics based on raw packets but there is an alternative which allows to customize the exporter to a specific input. If the input plug-in is installed its task is to transform input data into exporter-readable format. This is also the case of FlowMon probe which uses combination of FlowMon probes and input plug-in. The data from the card are transported to the software using high-throughput sze2 interface in form of Unified Headers (FlowMon Light) or flow records (FlowMon Full). The input plug-ins transform these structures to flow structure readable to the FlowMon exporter which may further aggregate received data or export these data to a collector. Exporter can export flow statistics in NetFlow v5, NetFlow v9 or in IPFIX format.

3.1  FlowMon Light (LT)

FlowMon LT implements the mapping of the general architecture to hardware and software modules as shown in Figure 6. Even though the flow cache is implemented in the software, hardware preprocessing of the input network traffic allows the probe to achieve high throughput.

In case of FlowMon LT, the task of hardware part is to process incoming packets, extract and assemble simple 64B-long Unified Header that is transferred to the software part. Subsequently, this preprocessed data is aggregated and exported. Despite the hardware preprocessing, flow aggregation and flow cache management on 10 Gbps networks is still too complex for a single processor core to handle. In order to address this problem, the workload can be spread between up-to eight instances of the exporter. In such a case, Unified Headers are distributed in the FPGA to eight DMA channels, four per a network port. The distribution among DMA channels is flow-wise. It means that it is guaranteed that all Unified Headers belonging to a signel flow are transported through the same DMA channel. Hence, the exporters do not have to mutually synchronize collected flow records when they access the preprocessed data nor the flow is broken into several parts. The rest of the flow processing is handled by exporter itself. In summary, multiple FlowMon exporters serve as a distributed and the only flow cache in whole metering process.

[Image]

Figure 6. FlowMon Light (LT) layers.

FlowMon LT probe was also equipped with an application-identification engine. The engine identifies the application carried in the traffic based on a specific pattern occurring in L7 protocol. The core of the engine is a pattern match unit [2] which is integrated in the packet processing block and its task is to scan the L7 data of all captured packets for pre-defined signature (regular expression) of up-to 32 L7 protocols. These signatures might be arbitrarily defined. By default, we support a subset of L7 signatures to detect 32 protocols such as: ssh, telnet, sip, NBNS, http, dns, Bittorrent, Samba, Apple Juice, Counterstrike, Direct Connect, Doom, eDonkey2000, IMAP, IRC, Jabber, MSN, POP3,RTSP, Shoutcast and Icecast, SNMP, SOCKS, SSDP, TeamSpeak, Tor, VNC, X11. An example of IRC signature:

# IRC - Internet Relay Chat - RFC 1459
/^(nick[\x09-\x0d -~]*user[\x09-\x0d -~]*:|user[\x09-\x0d -~]*:[\x02-\x0d -~]*nick[\x09-\x0d -~]*\x0d\x0a)/i

The information about the detected protocol is added into the Unified Header structure as a bitmap. The L7 protocol detected in the flow can be exported to the collector. The L7 classification as well as advanced time statistics are not embedded in all FlowMon LT variants. It is possible to download a package with various combinations of L7 and time statistic features. So there are basic NetFlow, NetFlow with L7 detection, Timestats, and Timestats with L7 detection packages available.

Figure 7 shows FlowMon LT throughput when only one port out of the two is utilized and measured. Measurement of throughput using both ports is presented in Section 5.

[Image]

Figure 7. Throughput of the FlowMon LT when only one port is monitored (RFC 2285).

It is planned to implement an interface for custom configuration of the probe. The goal is to generate the hardware part that is capable of parsing and extracting any fields from L2-L4 protocol headers based on current needs of a user. The configuration will affect the amount of data that is sent through PCI Express. When the probe is deployed on the network with lower throughput it might be possible to collect more and more data from the packets and still do not discard any packets prior the exporting process.

3.2  FlowMon Full

FlowMon Full is a state-of-the-art network probe that performs flow aggregation of two 10 Gbps Ethernet interfaces directly in the FPGA. This yields superior performance when compared to other approaches, as the accelerated network card transports already aggregated flow records to the host memory for an export to a collector. The exporter is able to export aggregated data either in NetFlow (version 5 or 9) or in IPFIX format. The division of tasks between FPGA and software is depicted in Figure 8.

[Image]

Figure 8. FlowMon Full layers.

Most of the flow measuring process is implemented in the FPGA. Input packets are analyzed and selected header fields are extracted into a Unified Header. Then the hash value is computed for flow identifiers, i.e., all packets from a certain flow are labeled with the same hash value. A hash is a direct address to the hardware flow cache. The hash collisions are solved by replacing a current flow record with an new colliding flow record. Besides a collision, a flow may also be released based on its inactivity, long activity or if some counter in the flow record tends to overflow.

The hardware flow cache is implemented in part in QDR memory on the card (as a second level cache) and in on-chip memory blocks (as a first level cache). The use of two-level cache hierarchy enables processing of both large number of flows and heavy-hitter flows, i.e. flows that account for significant amount of network bandwidth. When a flow record for an input packet is loaded, it is updated according to values in headers of the Unified Header and then stored back to both levels of the cache.

Flow records which have been released are transported through a single DMA channel to the CPU for further processing. The provided exporter performs secondary aggregation of flow records. Secondary flow cache in software manages the flow records in the same way as the flow cache in firmware and its expiration is based on the same types of timeouts. The secondary aggregation can be performed at much lower speed and at the same time may use available RAM memory of the host computer. Therefore the aggregation ratio of incoming packets at the network interface and outgoing flows is very high.

We provide two variants of the FlowMon Full. One is specialized on Netflow measurement and is capable of exporting more header fields but the length of provided timestamp is only 32-bits. On the other side Timestats design type provide 64-bit timestamp precision at the cost of omitting some header fields such as ICMP type and code, Class of Service, or Time to Live. The goal of Timestats design is to compute additional temporal statistics for each flow. All figures are provided in nanoseconds. For each flow its start, end, and duration is supplied together with minimal and maximal time difference between two consecutive packets. In addition dispersion and arithmetic average of all packet differences in the flow are computed.

Figure 9 shows the throughput of the design measuring flow temporal statistics. No matter of a packet length, the design is able to process two fully utilized ports on shortest packets.

[Image]

Figure 9. Throughput of the time statistics design of FlowMon Full when both ports are monitored (RFC 2544).

4  Configuration

4.1  Hardware configuration

COMBO card behavior is very dependent on a variant of loaded firmware. After one of FlowMon implementations is loaded software recognizes the variant and provides an interface for configuration of available parameters. The actual configuration affects sampling rate, repeater settings and card exporting timeouts. There are two ways how to modify the configuration. One way is to use fflowmonctl script which modifies only current run of the probe and all settings are set back to default in case of the probe restart. This is useful when user needs to change settings temporarily. Second way is to change default values in probe configuration file. The file configuration is persistent and affects probe after its restart. It is recommended to use file configuration for long-term measurement because in the case of an accidental probe restart all settings from configuration file are loaded automatically.

4.2  Exporter configuration

An exporter itself has several parameters which allows user to customize exporting process to her needs. It can be started manually, by FlowMon script or automatically by FlowMon start-up scripts upon a system boot. Manually started exporter is locked to the first CPU core by default. This is sufficient when only one exporter is started but in the case there are more than one exporter running on a system exporters should be locked (based on a parameter) on different CPU cores. Any exporter can subscribe multiple virtual interfaces (outputs of a DMA channels). If two or more exporters subscribe the same interface the incoming data is shared among all of them and is not removed until all exporters are finished processing it. As described previously, FlowMon LT provides 8 virtual interfaces, 4 per each network port. This means that all incoming packets on a network interface are distributed among 4 virtual interfaces flow-wise which means that a single entire flow is always be delivered through the same DMA channel to the same CPU core. The exporter can subscribe either all four interfaces to see a complete traffic on the network interface or just to some of them to see only a portion. In the case of a heavily loaded link, it is recommended to run several exporters one per each virtual interface and CPU core in order to efficiently utilize all computation resources available in the host computer. The exporter also allows parametrization of its flow measurement and exporting using volatile or persistent configuration.

4.3  FlowMon scripts

In order to start FlowMon probe manually, COMBO drivers must be loaded. It can be done by modprobe tool but we recommend to use fflowmonlkm script which loads all required modules at once. FlowMon LT and Full probes can be started by scripts fflowmon and fflowmon-lt respectively. These scripts allows user to select variant of design for COMBO card and start exporter with proper input plug-in. Scripts load parameters from hardware and exporter configuration files as well as they allow for manual configuration.

4.4  Start-up script

Start-up script starts FlowMon probe upon each system boot according to configuration file automatically. This includes loading of COMBO drivers, booting selected variation of design and starting all exporters specified in configuration file. Start-up scripts helps probe to overcome unscheduled reboots such as power failures. Configuration file is actually list of definitions of environment variables used by scripts. For easier configuration each variable is described in detail directly in the configuration file. Start-up script and configuration file are very similar to other scripts and configuration files commonly used in UNIX systems.

4.5  Description of some parameters

Sampling - User can modify sampling rate and sampling type of the probe. It is the rate in which packets are sampled on the interface prior to any processing. There are two types of sampling supported, constant and random. Constant sampling means that every n-th packet is processed whereas using random type sampling means that each packet is sampled with probability 1/n.

MTU - User can set maximum transmission unit (MTU) of a receiving frame. The maximal frame length that is acceptable for processing. Currently, jumbo frames longer than 4096B are not supported. We plan to add support for up to 16384B long jumbo frames to FlowMon LT designs in the very near future.

Repeater - FlowMon probe supports four types of repeater setting. Traffic from port 0 can be forwarded to port 1 or traffic from port 1 can be forwarded to port 0. Otherwise user can set repeating from both directions and of-course user can disable repeating process at all. Repeater is not affected by other settings like sampling or MTU.

Default design - The default variant of available designs. This settings affects FlowMon start-up scripts. After each reboot for which start-up script is enabled the specified version of design is booted.

Default exporters - Same as default design but this settings affects start-up of default exporters. Each default exporter have it's whole configuration specified in configuration file. The configuration consists of NetFlow protocol version (or IPFIX), transport protocol for flow records, CPU core lock, collector host name, and port etc.

4.6  Examples of configuration

Loading of COMBO drivers:

$ fflowmlkm -l

Booting netflow design and starting exporters specified in configuration file:

$ fflowmon-lt -F netflow -e

Booting netflow design and starting exporter which will use NetFlow version 9 for exporting. Records are send to collector.liberouter.org at port 60000:

$ fflowmon -F netflow -e -p NF9 -c collector.liberouter.org:60000

Manually started exporters (it requires booted design). Exporters are locked to 4 different CPU cores and use different FlowMon interfaces (whole traffic from port 0 is divided to four exporters (CPU cores) flow-wise). Proper input plug-in is loaded and collector is set. Important is to set different exporter identification number because otherwise collector may not recognize reported data correctly:

$ flowmonexp -m 1 -d -E fflowmon_input_plugin.so,0:1 \
  -l 1 collector.liberouter.org:7776
$ flowmonexp -m 2 -d -E fflowmon_input_plugin.so,0:2 \
  -l 2 collector.liberouter.org:7776
$ flowmonexp -m 4 -d -E fflowmon_input_plugin.so,0:4 \
  -l 3 collector.liberouter.org:7776
$ flowmonexp -m 8 -d -E fflowmon_input_plugin.so,0:8 \
  -l 4 collector.liberouter.org:7776

5  Results

Performance of the probe is a primary factor that matters when reported data should be unbiased and usable for applications such as anomaly detection, accounting and others. In order to measure the performance, we have built a simple testbed which consists of Spirent Traffic Generator capable of generating 10 Gbps traffic on 2 network links and a server (8x Xeon(R) CPU E5410 @ 2.33GHz, 10GiB RAM) with either Myricom or COMBOv2 cards. Both devices are directly connected through two 10 Gbps lines at first via a Myricom card and subsequently via a COMBOv2 card.

Figure 10 shows performance of four different implementations of a flow probe. We compare two variants of implementation of software flow measurement (using INVEA-TECH exporter) where one of the variants is accelerated by bypassing a standard Linux stack and two implementations of FlowMon probe, FlowMon LT and Full. These implementations undergoes a performance test which is designed as follows:

[Image]

Figure 10. Throughput of four flow measurement implementations under two fully loaded 10 Gbps network interfaces.

Exporter with Dual Myricom: This is a performance of a solution that does not implement any optimization. The Myricom card is used as it comes (no upgrades), the card is treated as if it was an ordinary network card, i.e., there is no optimization of transfers between card and host memory and all incoming frames are passed to a user space through a standard Linux stack.

Exporter with Dual Myricom SNF: The performance of a flow measurement system is significantly improved if a special firmware is uploaded into the Myricom card and optimized driver is installed to support PCI Express transfers with interupt reduction. Also a proprietary interface is used to access incoming frames which allows to bypass the Linux stack.

FlowMon LT: When very short packets are being received on both input ports the probe has very high demands on the PCI Express bandwidth (each packet generates one Unified Header which are transfered in a bulk). Unfortunately, it is not possible to transport all data through the PCI Express x8 slot when the probe is under heavy load of very short packets. However, those situations are rare in the real network environment. FlowMon LT handles full throughput on both ports if 106B or longer frames are being monitored on average.

FlowMon Full: Thanks to an on-card flow cache, the design is able to process even the shortest packets at maximum speed. However, due to the limited size of the on-card cache, the performance may decrease when the number of flows exceeds certain level. If the traffic poses no locality that is every packet belongs to a different flow then the cache limit is reached precisely when the number of flows exceeds its capacity. In such a situation the performance would decrease as denoted in Figure 11. Under normal circumstances, the network traffic exhibits certain locality documented as packet trains (bursts) and heavy-hitters paradigm. Therefore the performance would drop only if the number of flows exceeded the cache capacity multiple times.

[Image]

Figure 11. Throughput of FlowMon Full in dependence on the number of flows.

The FlowMon LT and Full implementations differ not only in the performance but also in consumed resources. Overall consumed resources on Virtex-5 LXT155 are denoted in Table 1.

Implementations and variants

Slices

BlockRAMs

FlowMon LT without L7

78 %

33 %

FlowMon LT with L7

91 %

67 %

FlowMon Full

87 %

89 %

FlowMon Full + TimeStats

85 %

88 %

Table 1. Virtex-5 LXT155 resource utilization of FlowMon variants.

Further, the FlowMon LT probe was deployed in the CESNET network infrastructure. The intention was twofold. The primary goal was to measure network traffic. The focus was given on IPv6 traffic as the probes are IPv6 ready. Despite a forthcoming IPv6 era only a minor percentage (less than one per thousand) of total flows belongs to IPv6, compare the scale of Figure 12 with IPv6 flows and Figure 13 with all flows.

[Image]

Figure 12. Number of IPv6 flows per 5 minutes interval as reported by FlowMon probe.

The second point was to show functionality and reliability of the probe under real network conditions. The probe was running without a stall or misbehavior for more than a month. During that period there were few very short intervals when the collector did not receive any data. But we can rule out that this was due to the probe as the collector was not able to receive any data from other independent NetFlow exporters and we account this to the loss of network connectivity at the collector side. Anyway, both, the probe and the collector, recovered from this situation correctly, without a need to restart. Figure 13 shows measurement of number of flows per 5 minutes intervals on CESNET link Ostrava-Pionier from October 9th till November 14th 2010.

[Image]

Figure 13. Number of flows per 5 minutes interval as reported by FlowMon probe.

6  Conclusion

This report presented a general description of flow measurement pipeline and its two implementations using hardware accelerated network card and host computer. The standard flow measurement was further extended with additional capabilities such as application identification engine or precise time statistics. The probe were deployed on a real network where they showed their ability to correctly measure real traffic. Before and during deployment the probes were heavily tested and many bugs were revealed and fixed. The probes were also tested in a testbed against Spirent traffic generator. Both probes provide superior performance which is limited by a PCI Express bandwidth in case of FlowMon-LT or by capacity of an on-card memory in case of FlowMon Full. In both cases, the hardware support and software flow cache allows for monitoring of large number of concurrent flows at high speed. Both probes are available after registration in the form of RPM packages from the Liberouter web pages.

A challenging task is to finish tools that would allow the probe to be fully configurable. The intention is to provide nearly any information about network traffic and assign it to a flow which can be arbitrarily defined by user.

6.1  Acknowledgment

Authors of this technical report would like to acknowledge fine work of all members of the Liberouter team who participated on development of hardware accelerated flow measurement.

References

[1] MARTÍNEK, T.; ŽÁDNÍK M. Precise Timestamp Generation Module and its Applications in Flow Monitoring. Technical Report 13/2009, Praha: CESNET, 2009.
[2] ŽÁDNÍK, M. Flow Measurement Extension for Application Identification. Technical Report 14/2009, Praha: CESNET, 2009.
další weby:fond rozvojemetacentrumCzechLightpřenosyvideoservereduroameduID.cz