Flow-Based Traffic Analysis System - Architecture Overview

CESNET technical report number 15/2004
also available in PDF, PostScript, and XML formats.

Tom Kosnar
15.12. 2004

1   Flow-Based Traffic Analysis System - Architecture Overview

1.1   Abstract

Flow-Based Traffic Analysis System - FTAS is an experimental traffic analysis system which aims to give both short-term and longer-term view at network traffic. It is designed for use in high-speed network environment and for multi-source flow collecting. The focused user communities are network and service administrators. From the short-term point of view it should support network and services tuning, incidents handling and particular traffic debugging. These operations should relay on primary flow processing and as much as possible not aggregated flow based data stored in format suitable for efficient retrieval according to requested conditions. In the longer-term perspective it should help to discover significant traffic information, various trends in network usage and users behavior and provide support for strategic planning. Post-processed and aggregated flow based data are the primary data sources in this case. This report tries to give a brief description of its architecture.

1.2   General Architecture

From the top level point of view the system consists of one or more collector-hosts which behavior as well as processing depend on central master configuration. Collector-hosts is something like traditional collector with extended functionality. Those collector-hosts process either flows from primary flow-sources (routers, probes) or flows from another collector-hosts which are generated by filtering or classification mechanisms. The idea is to provide intra-system flow distribution and processing in similar way as it is done in case of flows coming from primary flow-sources. The only differences are in flow formats - intra-system flows have proprietary extended flow format than usual flows (extra classification fields) and in flow processing - there is no secondary filtering or classification allowed on previously filtered or classified flows (the system has no limits in that, but there are limited possibilities to detect wanted/unwanted multiple or never-ending loops given by configuration especially in multi-hosts installations).

[Figure]

Figure 1: FTAS - Top Level Architecture

While processing flows or flow based data any distribution among collector-hosts is available. Flow processing finishes with data drop or storage. When the data are stored, they can be retrieved and post-processed. Post-processing is provided inside appropriate collector-hosts only without any data exports outside. Collector-hosts may be distributed. They relay on centralized configuration - FTAS master configuration in any case.

1.3   Collector-Host Definition

Each collector-host consists of multi-purpose collector and its' related database - this is given by appropriate record in central configuration. In general all configured databases can be controlled by single database engine and/or can be distributed among several hosts. It means, that database part of a collector-host may be located on other physical host than the multi-purpose collector. We use collector-host term for this collector-database relation. Collector-host is the basic unit in the FTAS system architecture. Collector-host count can be easily configured within FTAS installation instance therefore the capacity and processing power of the system can follow actual needs.

[Figure]

Figure 2: FTAS - Collector-host architectures

1.4   Collector-Host Functions

There are several tasks which should be provided at collector-host. Which of them will be run at appropriate collector-host depends on its configuration. In general the following groups of functions must be supported at collector-host:

  1. Flow Collecting
  2. Flow Classification
  3. Flow Filtering
  4. Data Post-Processing

System design expects to provide flow filtering and classification on-fly while collecting incoming flows. The data post-processing is provided "off-line" on background with delay given by configuration. All these functions are incorporated into the multi-purpose collector.

1.5   Multi-purpose Collector

As was described above this module is expected to be multi-functional. From the flow point of view it can be both the receiver and sender (in some cases) and from the time point of view it supports real-time flow processing as well as delayed background computing. The main components of multi-purpose collector are:

[Figure]

Figure 3: FTAS - Multi-purpose collector

1.5.1   Primary Flow-source Receiver

This module processes incoming flows from primary sources (routers, probes). It listens at configured UDP port. The processing scenario starts with incoming UDP datagram and is split into several steps:

Datagram validity check
The system requires exact hosts and list of acceptable flow export formats to be configured for each flow-source. Flow-source is rather logical source - it can consist of several real sources. Therefore any incoming UDP datagram is checked against configured source IP addresses and configured flow version formats.
Datagram replication
This option may solve lack of multiple destinations implementation at real flow sources. In general it may be useful in any case we want to redistribute flow to another independent system. Multiple destinations in form of host/UDP-port are allowed. Replicated datagrams will be send with collector source address - no spoofing is allowed.
Datagram parsing
The parser is built to be (as much as possible) flow version format independent. The outlet was the idea of open flexible flow structure similar to Cisco NetFlow version 9. Parser is based on separate flow data structure definition and generic parsing engine, therefore new format versions can be "easily" added. Currently any common flow formats (versions 1-9 in Cisco terminology) are supported.
Flow sampling
Original incoming flows may be the result of sampling at real flow sources - we call it input-sampling in our terminology. For FTAS purposes the sampling rate at real flow sources should be configured as constant or with constant probability (when used). We added secondary sampling option - internal-sampling. It is flow based (NOT datagram based) and is implemented as an emergency option in cases of insufficient hardware or extreme flow rates. Internal-sampling is currently provided with constant step which value may be reconfigured to reach the optimal load of appropriate collector. Both sampling values are required when configuring primary flow-source in FTAS and are considered while computing estimated values (pkts, octets).
Flow classification
This mechanism can be used either for accounting purposes or/and for precise filtering. Additional fields are added to parsed flow structure. They are related to source as well as destination part of the flow. These fields are filled with appropriate values according to configured conditions. Conditions may be defined as input and/or output traffic identifiers. There is a priority (weight) value assigned to each condition, so the global overlaps are accepted and the classification result is given by the best match (weight based). Classification is hierarchical and currently consists of 4 layers for each side (source, destination) of flow. In case of valid classification (validity depends on selected accounting model) extended flow in extended internal format is prepared to be sent to accounting receiver.
Flow filtering
Filtering is based on filter definition and its related filtering conditions. Unlike classification conditions can be of any type (not input or output). All matching flow are prepared to be sent to appropriate filter receivers where can be stored into matched filter data sets. Both classification and filtering mechanisms share the same code, therefore the condition syntax is the same. Filtering conditions may also consist of fields (and their values) created and filled by classification mechanism.
Flow on-fly preprocessing
There is no need to store the original flows in any cases. Sometimes may be useful to store some flow record fields only and also in aggregated form within specified time range. This mechanism enables to define flow field-set of interest and gives the possibility to provide on-fly aggregations.
Primary data storage
Flows from primary flow-sources are stored into separate data sets assigned to appropriate flow-source with granularity given by configured time range (flow end-time is used). The data-set time range as well as the data expiration may be configured for each flow-source separately.
[Figure]

Figure 4: FTAS - Primary flow-source receiver

1.5.2   Filter and Accounting Receivers

These modules process incoming flows which were exported from primary flow-source receivers. They listen at configured UDP port. The processing scenario is similar to primary flow-source receiver with missing classification and filtering steps, so we describe the differences only:

Datagram validity check
Valid IP addresses are given by configured primary flow-source receivers and filtering/accounting configuration. Flow version formats are limited to internal ones designed for filter or accounting purposes.
Datagram replication
Datagram parsing
Flow sampling
Flow on-fly preprocessing
Primary data storage
Flows are stored into separate data sets assigned to appropriate filter definitions (in filter receiver) or into separate data sets assigned to appropriate accounting outputs (in accounting receiver). The time granularity of data-sets is given by configured time ranges (flow end-time is used as master value). The data-set time range as well as the data expiration may be configured for each filter definition separately.
[Figure]

Figure 5: FTAS - Filter and accounting receivers

1.5.3   Post-processor

This module processes primary data collected and stored by any receiver. The main purpose is to reduce the data amount while keeping the most significant traffic information (case by case) for a long time. Post processing is provided separately for each configured object. Configured object is one of the following:

The existence of primary data stored by appropriate receiver is of course the necessary condition. The post-processing consists of three basic steps.Request construction prepares the commands for primary data retrieval. Data in requested record structure (configured flow field-set) are selected according to configured condition, ordered in configured direction and always pre-aggregated. Results processing may be provided globally and/or group based (groups may be defined by flow field-set). Step by step flow field masking should decrease the data granularity and fit as much as possible into final results. Post-processed data storage is done in similar way as the primary data storage. There are separate data sets prepared for each processed object in advance with time ranges given by configuration.

[Figure]

Figure 6: FTAS - Post-processor

1.6   Master Configuration

Master configuration is the only authoritative source of FTAS directives and parameters in an FTAS installation instance and is shared by all collector-hosts as well as the presentation layer. Multi-purpose collector site configuration consists of access parameters to master configuration only. It means that FTAS master configuration can reside anywhere. Its content is periodically checked by collector-hosts and its changes result in reloading appropriate collector processes. The configuration data model consists of relatively independent blocks of configuration object classes. Child-parent relations between configuration object classes as well as optional relation for setting access rights to users are shown in the following diagram.

[Figure]

Figure 7:

další weby:fond rozvojemetacentrumCzechLightpřenosyvideoservereduroameduID.cz