FPGA-based Packet Header Anonymization
CESNET
technical report number 16/2006
also available in PDF,
PostScript, and
XML formats.
Sven Ubik, Petr Žejdl, Jiří Halák
27.12.2006
1 Abstract
Packet header anonymization is very important for passive network monitoring, which observes characteristics of user traffic and for sharing captured tracefiles, which are valuable resource for networking research.
In this paper we present design and implementation of our FPGA-based packet header anonymization operating in real time and preventing sensitive information from getting to the monitoring PC and beyond.
2 Purpose of packet header anonymization
In passive network monitoring we process real user traffic. This approach has important advantages over active network monitoring which uses artificially injected test traffic. Particularly, test traffic can affect user traffic and it cannot be used to detect phenomena present only in user traffic, such as security attacks and dynamic characteristics. Captured packet traces are useful resource for networking research in workload characterization, traffic engineering and security application design.
When processing user traffic, we have to secure user privacy. For the purposes mentioned above, we can remove most of the packet payload, but we need to keep packet headers and certain patterns from the payload, such as security attacks indicators. But packet headers still include sensitive information, particularly in the form of source and destination IP addresses.
The purpose of packet header anonymization is to remove all sensitive information to protect user privacy, while keeping as much of the original traffic characteristics as possible.
3 Related work
Anonymization can be done on a reconstructed stream, which is then again split into packets [anon-stream]. This type of anonymization can do advanced processing of higher-layer protocols, such HTTP. However, this approach is compute-intensive and it looses many of the original traffic dynamics (packet spaces, bad CRCs, IP options, etc.).
A table-based approach to IP address anonymization was implemented in TCPdpriv. A cryptography-based scheme to provide consistency of anonymization across different traces using the same cryptography key was described in [anon-crypt]. However, software implementation is slow for gigabit speeds and it requires that sensitive information is temporarily stored in a monitoring PC.
Hardware implementation on a network processor using precomputed mapping trees [np-anon] can remove sensitive information in the monitoring hardware, but it still requires a lot of instructions and several memory accesses per packet. The measured throughput for 40-byte UDP datagrams was 65000 packets per second, which is approx. 50 Mb/s at a wire level.
4 Requirements
We defined the following set of requirements to be fulfilled by our implementation:
-
Anonymization must provide a wide range of easily configurable possibilities of removal of sensitive information.
-
Anonymization up to TCP and UDP headers must be implemented in a hardware monitoring adapter for high speed and to remove sensitive information before it gets to the host PC.
-
IP address mapping must be consistent among traces - the same real IP address in two traces must be mapped into the same anonymized IP address. Multicast addresses must be mapped into anonymized multicast addresses.
-
IP address mapping must be also prefix-preserving. Two real IP addresses that share a common prefix must be mapped to two anonymized IP addresses that also share a common prefix of the same length.
-
The architecture must allow follow-up software anonymization of higher-layer protocols.
5 Implementation
We decided to use a COMBO6 card developed by the Liberouter project with SCAMPI firmware as a platform to implement hardware anonymization. The COMBO6 card is a PCI card for PC equipped with Gigabit Ethernet ports and the Virtex II FPGA circuit to process packets arriving from the network. The SCAMPI firmware implements advanced packet processing including packet classification, sampling and statistics. An important advantage of COMBO6 cards and SCAMPI firmware for our implementation was that we had access to VHDL source code of the SCAMPI firmware.
We added hardware anonymization to the SCAMPI firmware as part of the LOBSTER project. The goal of LOBSTER is to enhance the SCAMPI architecture and to deploy it for passive monitoring in a European scale. In the following section we describe the architecture and properties of our implementation.
We divided anonymization tasks into two layers (or tiers) first-tier anonymization implemented in hardware of the monitoring adapter and second-tier anonymization implemented in software.
First-tier hardware anonymization works at packet level. It can do various types of anonymization of any fields in IP, ICMP, TCP and UDP headers. Packets with anonymized headers are retrieved from the monitoring adapter to the PC. Second-tier software anonymization works at flow level. Individual flows are reconstructed from packets. This allows anonymization of information in higher-layer protocols, such as replacing HTTP header fields or URLs in payload. In this report we concentrate on the first-tier hardware anonymization.
5.1 Position of anonymization in packet processing
Anonymization is a kind of data transformation. Therefore, we designed a universal Transformation Unit (TU) for general packet transformations, which can be used to anonymize selected information inside packets. The TU unit can be used as a standalone IP core when we provide required data to its input signals.
The SCAMPI firmware consists of several units. The position of the TU unit in the SCAMPI firmware is illustrated in Figure, which includes only units important to our discussion.
Input packets go to the Header Field Extractor (HFE), which parses packet headers, stores them in internal data structures and prepares packet parameters, which are pointers to important sections of the packet, such as positions of selected header fields. A table of these packet parameters is added in front of each packet. HFE operation is programmable and we modified it such that the packet parameters include pointers to all important sections of the packet that we want to anonymize.
Parsed headers come to the Lookup Processor (LUP), which sorts packets into classes based on header fields and assigns each packet a 32-bit control word, which consists of three parts:
-
SAMASK (16 bit) - assigns a packet to be passed to any combination of 16 sampling units, each bit corresponds to one sampling unit
-
STATID (8 bit) - assigns a packet to be processes by one of 256 statistical units identified by this field
-
PCMASK (8 bit) - PCK mask, defines a set of checked patterns in the payload checker
Due to limitations in hardware design we could not use a wider control word. Therefore, we reused an 8-bit STATID part of the control word indicating the STU number to also indicate a class of packets for anonymization purposes. Different anonymization can be applied to packets in different classes. Packets which have been assigned some non-zero STU number are passed to the TU unit. Packets which have been assigned zero STU number do not go to the TU unit and are not transformed.
5.2 Transformation Unit (TU)
The TU unit is designed as a small processor interpreting programs in a simple instruction set. These programs describe what anonymization should be performed for each header field for packets in each class.
The TU unit can do the following types of header field transformations:
-
set to a specified constant
-
set to a pseudorandom number
-
xor with a specified constant
-
table-based hashing
-
prefix-preserving mapping
-
any combination of the above (e.g., first half of IP address can be set to a constant and second half randomized)
Each of the above types of anonymization can be applied to any 16-bit header field in the packet. Two flags in the TU instruction (see below) can be used to mask out left or right half of the 16-bit header field and request that the anonymization function applies to an 8-bit header field only. Each header field can be identified by an offset (in 16-bit words) from one of the following packet parameters:
-
PP_ETHER - Ethernet header
-
PP_ARP - ARP header
-
PP_ICMP - ICMP header
-
PP_ICMPv6 - ICMPv6 header
-
PP_IPv4 - IPv4 header
-
PP_IPv6 - IPv6 header
-
PP_UDP - UDP header
-
PP_TCP - TCP header
For example, you can identify source IPv4 address by using PP_IPv4 packet parameter and offsets 6 and 7.
5.2.1 TU instructions
The anonymization is specified as a sequence of instructions. Each instruction is 32 bits wide and it is structured in the following format:
| | | | | |1 0|9 8 7 6 5 4 3 2 1 0|9 8 7 6 5 4 3 2 1 0|9 8 7 6 5 4 3 2 1 0| |J| IC | | HL| PP | OFFS | IMM16 |
From left to right, the instruction includes the following fields:
-
1-bit flag to signal a jump instruction (0) or a normal instruction (1)
-
3-bit instruction code:
-
000 set to constant
-
001 randomize
-
010 xor
-
011 hash
-
100 prefix-preserving mapping
-
other reset to zero
-
-
2-bit flags signaling whether higher or lower byte of the 16-bit header field should be unaffected by the instruction (allowing to anonymize 8-bit fields)
-
4-bit code of packet parameter (see the list of packet parameters above)
-
4-bit offset from the packet parameter (in 16-bit words)
-
16-bit immediate constant or jump address (for hashing, bits 0-2 and bits 8-10 select one of eight hashing tables to be applied to lower and higher byte of a 16-bit packet field, respectively)
Programming of the TU unit is started by writing 0 to address 0x130010, which disables the TU unit operation. The TU program is then stored beginning at address 0x131000 and its end is marked by 0 value stored after the last instruction. The TU unit operation is then started again by writing 1 to address 0x130010.
If we want to use table-based hashing, we also need to upload content of hashing tables. There are eight tables, each including 256 rows of 16-bit patterns. The tables are beginning at address 0x132000.
An example of the TU program is shown in section.
5.2.2 TU operation
The TU unit structure is shown in Figure. Anonymization programs are stored in the instruction memory (1). A packet class assigned by LUP is used to select a corresponding anonymization program (2). Each instruction is then decoded (3), the required packet parameter (offset to some key header field) is retrieved (4), internally added with direct offset specified in the instruction and compared with the current packet position (5). The packet itself arrives through the input register in 16-bit chunks. These chunks are associated with the transformation description (6) and passed to the data pipeline. Different anonymization functions (7) require different number of clock cycles to complete. Therefore, the input data is passed to each function from the different stage of the pipeline such that all results arrive at the right time to the multiplexor (8), which selects the function that should be applied to the current header field.
5.2.3 C language API
A high-level programming API as a set of C language functions was developed for easy configuration of anonymization. Users do not have to configure the TU unit directly. The following functions are provided:
-
scampi_reset_classification(int mode)- initializes data structures, onlyDIRECTmode is currently implemented -
scampi_add_transformation(int filter_id, tu_transformation_type_t transformation_type, tu_proto_type_t protocol, u_int8_t offset, u_int16_t parameters)- adds anonymization of one header field, can be called multiple times -
scampi_set_transformation_hash_table(int filter_id, int table_id, tu_hash_table_t *hash_table)- uploads content of a hash table -
scampi_set_transformation_prefp_table(int filter_id, u_int32_t *table, u_int32_t count)- uploads content of a prefix-preserving table -
scampi_print_filters()- prints all requested anonymizations -
scampi_set_filters()- downloads all requested anonymizations in the card
Predefined constants are available to refer to common protocols and
their header fields for the scampi_add_transformation()
function.
5.3 Prefix-preserving IP address mapping
The Prefix-Preserving Mapping Unit was designed for IP addresses, but it can operate on any 32-bit chunks of data. The advantage of prefix-preserving mapping is that is retains the full extent of the original traffic dynamics in IP address space. On the other hand, this property also implies vulnerability to attacks that can reveal original IP addresses.
The mapping between the original and anonymized IP address is based on swapping subtrees of a tree representation of IP addresses [anon-crypt]. The method is illustrated in Figure for 3-bit numbers. The set of all 3-bit numbers can be represented by leaf nodes or by paths going from the root node to the leaf nodes in a 3-level binary tree. When we take bits of a binary number from left to right, then the path from the root node goes left for 0 bit and goes right for 1 bit.
Now we can mark some nodes in our example by black color. When we swap bits in original IP addresses corresponding to the marked nodes, we get anonymized IP addresses and the mapping between original and anonymized IP addresses is prefix-preserving. We can mark any nodes in a tree to get prefix-preserving mapping, but certainly some markings produce low-quality anonymization, such as when very few nodes are marked or not marked. A common method is to mark nodes pseudorandomly or as a result of cryptographic function applied to a key.
The flip or non flip action can be implemented in hardware by XOR operation of the original bit with 1 or 0 bit, respectively. The mapping and XOR patterns for our example are shown in Table. Note that the property of this method is that the first bit of an IP address is always either swapped or not swapped (it does not depend on the IP address) and that the last bit of an IP address is not used to select mapping.
| Original address | XOR pattern | Anonymized address |
|---|---|---|
| 000 | 110 | 110 |
| 001 | 110 | 111 |
| 010 | 110 | 100 |
| 011 | 110 | 101 |
| 100 | 100 | 000 |
| 101 | 100 | 001 |
| 110 | 101 | 011 |
| 111 | 101 | 010 |
Table 1: Prefix-preserving mapping by XOR operation
5.3.1 Memory organisation
To store marks of the whole tree, we need at least one bit for each non-leaf node. That is for 32-bit IP addresses we need at least 223 - 1 bits = approx. 512 MB of memory. For practical implementation it is more convenient to store each path from the root to a leaf as a separate 32-bit word which can be directly XORed with the original IP address to get the anonymized address. In this representation we would need 16 GB of memory. There is only 512 MB of memory on the COMBO card and we need some memory to store packets.
The volume of required memory can be reduced by storing only a part of the mapping tree and replicating it. This method was first proposed in [np-anon]. We can divide the 32-bit tree into four levels of 8-bit subtrees. There is one 8-bit subtree in the first level, 256 of 8-bit subtrees in the second level, 256*256 of 8-bit subtrees in the third level and 256*256*256 of 8-bit subtrees in the fourth level. We can anonymize the first byte (the most significant) of an IP address using the first level of 8-bit subtrees, the second byte by the second level of 8-bit subtrees and so forth.
In real packet traces we normally do not have all 232 IP addresses. The number of different IP addresses present is much smaller. Therefore, if we store only a subset of 8-bit subtrees and replicate them to the positions of remaining subtrees, chances are that we do not use many duplicates. In an extreme case, when we store only one 8-bit subtree in each of the four levels, the anonymization of each byte in an IP address is done independently of the values of previous bytes, which would result in weak anonymization. Therefore we should store more subtrees.
One 8-bit subtree includes 256 paths from the root to its leaves. But marking is the same for each two adjacent paths that differ only in the last step to the leaves. Therefore, one 8-bit subtree can be represented by 128 of 8-bit words.
In order to read a mapping path for the whole IP address as fast as possible and to allow configuration of the number of stored 8-bit subtrees, we store the mapping tree in BRAMs inside FPGA as illustrated in Figure. We use two dual-port BRAMs, which can be used to read mapping paths in all four levels in one clock cycle. The data width of each BRAM is 8 bits, the address width is configurable and it is at least 11 bits.
Figure a) shows the case when both BRAMs have address width of 11-bits. Each BRAM can store 16 8-bit subtrees. Only seven bits from each byte are used to direct anonymization (see the description of the mapping tree above). Therefore, there are some spare bits in the BRAM s address space. These bits are connected to some of the bits of the previous byte in the IP address. In this way anonymization of one byte is influenced at least by a part of the value of the previous byte. Spare bits of the first BRAM are connected to fixed values. Each additional available address bit doubles the number of stored 8-bit subtrees.
Figure b) shows the case when the address space of the BRAM for the two lower levels is larger than for the two upper levels. This is desirable because the lower levels need more 8-bit subtrees if we do not want to increase replication. More bits from the second and third bytes are used to influence anonymization of next bytes. In practical implementation we use at least 2 BRAMs for the upper levels and 6 BRAMs for the lower layers.
A consequence of using separate BRAMs is that the part of the first BRAM is not used and subtrees for each level can be replicated from only one BRAM. An alternative solution shown in Figure c) is to use one larger BRAM. We can use the whole address space of this BRAM to select subtrees for all four levels, but we need to read them in four clock cycles one after another. As speed was our primary goal we use two separate dual-port BRAMs at the cost of replicating subtrees only within each two levels (option b).
The mapping tree must be loaded into BRAMs before enabling the TU unit. We connected BRAMs to the local bus in the SCAMPI design. The data width of the local bus is 16 bits. Lower 8 bits are sent to the first BRAM and the upper 8 bits are sent to the second BRAM.
A pseudo-random mapping tree can be generated by the
build_tree.pl script (no input is required), for example:
$ build_tree.pl -g tree
The script generates the tree.hex file, which can be used
to load BRAMs and the tree.sim file, which can be used to
simulate IP address anonymization. For example:
$ build_tree.pl -s tree 10 20 30 40 10.20.30.40 (0x281E140A) -> 63.198.30.128
The mapping tree is stored in 32-bit words in the tree.hex
file. For example, in order to load the following paths (hex numbers):
First BRAM: 11, 12, 13, 14 Second BRAM: 21, 22, 23, 24, 25, 26,\\ 27, 28
the content of the tree.hex file will be as follows:
22122111 24142313 26002500 28002700
6 Examples of use
We can first use the scampi_load.sh script to load
firmware into the FPGA for the specified card type:
$ scampi_load.sh tu mtx xcv2000
We can then use the scampi_init utility to initialize all
firmware units. In the following example we set header filtering to
forward all packets to the sampling unit SAU0 (which is by default
configured to pass every packet in the deterministic mode):
$ scampi_init -a 0 src net 0.0.0.0/0
Finally, we can use the init-tu script to configure the TU
unit to do the required anonymization (which is specified inside the
init-tu script) and to load the prefix-preserving mapping
tree from the tree.hex file:
$ tu-init tree
Suppose that we want to do the following anonymization:
-
Hash the source IP address
-
Keep the first part of the destination IP address
-
Randomize the second part of the destination IP address
-
Set the source port to constant 9876
-
XOR the destination port with constant 0x1234
The required anonymization can be configured by the
init-tu script consisting of the following set of TU
instructions:
csbus 130010 0 # Disable TU, allow programming
csbus 131000 B0560000 # Hash source IP (first part)
csbus 131004 B0570000 # Hash source IP (second part)
# Keep destination IP (first part)
csbus 131008 90590000 # Randomize destination IP (second part)
csbus 13100C 80709426 # Set source port to 9876 (0x2694)
csbus 131010 A0711234 # XOR destination port with 0x1234
csbus 131014 0 # End of program
csbus 130014 12345678 # Init LFSR (random number generator)
csbus 132000 40302010 # Load hash table
csbus 132004 80706050 # dtto
csbus 130010 1 # Enable TU
The csbus utility reads and writes values from and to
specified locations in memory space of COMBO card
firmware. Alternatively, we can use the C language API to program the
TU unit:
scampi_reset_classification(DIRECT); scampi_add_transformation(0, TU_TRANSFORMATION_HASH, TU_PROTO_IP, 6, 0); scampi_add_transformation(0, TU_TRANSFORMATION_HASH, TU_PROTO_IP, 7, 0); scampi_add_transformation(0, TU_TRANSFORMATION_RANDOM, TU_PROTO_IP, 9, 0); scampi_add_transformation(0, TU_TRANSFORMATION_MAP, TU_PROTO_UDP, 0, 0x2694); scampi_add_transformation(0, TU_TRANSFORMATION_XOR, TU_PROTO_UDP, 1, 0x1234); scampi_set_filters();
We used the setup shown in Figure to test the configuration. PC1 sends UDP packets to PC2. These packets are captured by PC3. Before we enabled anonymization, the packet capture looked as follows:
01:00:00.000000 IP 10.0.1.2.2000 > 10.0.1.1.2000: UDP, length: 1000 01:00:00.000000 IP 10.0.1.2.2000 > 10.0.1.1.2000: UDP, length: 1000 01:00:00.000000 IP 10.0.1.2.2000 > 10.0.1.1.2000: UDP, length: 1000 01:00:00.000000 IP 10.0.1.2.2000 > 10.0.1.1.2000: UDP, length: 1000 01:00:00.000000 IP 10.0.1.2.2000 > 10.0.1.1.2000: UDP, length: 1000
After we enabled anonymization, the capture of the same packets looked as follows:
01:00:00.000000 IP 0.32.48.96.9876 > 10.0.68.143.13250: UDP, length: 1000 01:00:00.000000 IP 0.32.48.96.9876 > 10.0.200.43.13250: UDP, length: 1000 01:00:00.000000 IP 0.32.48.96.9876 > 10.0.42.77.13250: UDP, length: 1000 01:00:00.000000 IP 0.32.48.96.9876 > 10.0.131.131.13250: UDP, length: 1000 01:00:00.000000 IP 0.32.48.96.9876 > 10.0.95.134.13250: UDP, length: 1000
7 Performance and resources
Packets are passed through the TU unit at the rate of 16 bits per clock cycle. We need additional 20 clock cycles per packet (9 cycles to retrieve packet parameters, 9 cycles to fill the data pipeline and 1 cycle to initialise the instruction pipeline). Throughput depends on the packets size and it can be computed by the following formula:
throughput = (length + gap) * 16 * clockrate / (length + gap + 2*delay)
where:
-
throughput in bits/s measured at the wire level
-
length of packet in bytes including Ethernet header and CRC
-
gap includes the interframe gap and the preambule and it is equivalent to 20 bytes
-
delay corresponds to the 20 clock cycles of overhead per packet; as 2 bytes are normally processed in one clock cycle, we multiple it by two and use the result of 40 bytes to represent the delay in the formula as the equivalent number of bytes
The TU unit on its own was synthesized including all constraints at 100 MHz clock rate. Computed throughput depending on the packet size for this clock rate is shown in Figure. Throughput for the worst case of 64-byte packets is 1.08 Gb/s, which is sufficient to process Gigabit Ethernet traffic at line rate.
We tested the TU unit with the COMBO card consisting of the COMBO6 mainboard and the COMBO4MTX interface card. We used SCAMPI phase 1 design version 1_02_07 as the basis for our modifications. We changed HFE program to produce more packet parameters and we integrated our TU unit. However, the SCAMPI design can run at 50 MHz only and the TU unit integrated in the SCAMPI design must also run at 60 MHz. Some units also add more per-packet overhead. Therefore, throughput of the whole modified SCAMPI design, which we measured by a hardware packet generator was lower than the computed throughput of the TU unit alone and is also indicated in Figure.
The consumption of FPGA resources for the TU unit alone and for the whole SCAMPI design with the integrated TU unit is shown in Table.
| TU unit | Modified SCAMPI design | |||
|---|---|---|---|---|
| Used | Percentage | Used | Percentage | |
| Slices | 956 | 7% | 6238 | 44% |
| Flip-flops | 1116 | 4% | 6496 | 23% |
| 4-input LUTs | 987 | 4% | 8001 | 28% |
| BRAMs | 13 | 14% | 46 | 48% |
Table 2: Consumption of FPGA resources
8 Conclusion
We implemented easily configurable FPGA-based packet header anonymization that removes sensitive information from packet headers in real time on the monitoring adapter before it can get to the host PC. Anonymization functions can be different for various classes of packets and can include prefix-preserving IP address mapping, which preserves original dynamics in IP address space. The implemented TU unit to perform anonymization can run at full Gigabit Ethernet speed, although the whole modified SCAMPI design currently runs at lower speed. We plan to integrate our TU unit in the newer version of design for the COMBO card, which will permit to operate at full Gigabit Ethernet speed.
References
| [anon-stream] | Pang R., Paxson V.: A High-Level Programming Environment for Packet Trace Anonymization and Transformation, SIGCOMM 2003, August 25-29, 2003, Karlsruhe, Germany. |
| [anon-crypt] | Fan J., Xu J., Ammar M. H.: Crypto-PAn: Cryptography-based Prefix-preserving Anonymization. Software, available online. |
| [np-anon] | Ramaswamy R., Weng N., Wolf T. An IXA-Based Network Measurement Node. Proc. of Intel IXA University Summit, 2004. Available online. |