Hardware Packet Filter with IPv6 Support

CESNET technical report 26/2010

Tomáš Dedek, Jan Kořenek

Received 2. 12. 2010

Other formats: PDF, EPUB

Abstract

This technical report describes an extension of the classification algorithm in the NIFIC packet filtering engine that allows for filtering IPv6 traffic. The classification algorithm implemented in the NIFIC device uses the longest prefix match operation on the source and destination IP addresses. The existing LPM algorithms and architectures are able to achieve high throughput for the IPv4 protocol but they are not well suited for the IPv6 protocol because of their time complexity or high memory requirements. Therefore, we proposed a new algorithm for the longest prefix match based on a fast look-up operation using a hash function and the Tree Bitmap algorithm. The proposed algorithm was mapped to the hardware architecture and implemented as a pipeline of processing elements. The resulting hardware architecture is able to achieve the wire-speed throughput for 100 Gbps network. Moreover, the results show that our implementation of Hash LPM is able to achieve constant time complexity and use much less hardware resources than TCAM or the Tree Bitmap algorithm alone.

Keywords: Packet Filter, Hash, Tree Bitmap, FPGA, NIFIC

1  Introduction

The growth of computer networks provides more opportunities for suspicious activities. The amount of worms, viruses and network attacks is steadily increasing. Moreover, with the increasing speed of data networks, network security devices need more computation power to be able to process traffic on current multi-gigabit network links. Standard software tools cannot be used even for currently common 10 Gbps networks.

One of the common and widely used network security devices is a packet filter which performs filtration of incoming traffic according to a configuration given by a set of rules. High-performance hardware-accelerated packet filter NIFIC [2] was developed by the Liberouter project of CESNET. The packet filter uses COMBOv2 [4] card to achieve wire-speed throughput for two 10 Gbps lines and supports up to 1000 filtration rules. The core of the device is a classification algorithm that is based on problem decomposition [FPGA09]. The NIFIC packet filter has been deployed as a hardware firewall for the Liberouter intranet network.

The NIFIC packet filter provides high throughput and supports a large number of filtering rules. However, so far it has been limited only to the IPv4 protocol. This technical report describes an extension of the NIFIC packet filter that enables support for the IPv6 protocol. The implementation of IPv6 support requires to extend the longest prefix match (LPM) part of the classification algorithm in order to be able to perform IPv6 prefix look-up. We analyzed possibilities of existing algorithms and architectures for LPM with regard to the speed of IPv6 address processing and the amount of resources consumed in the FPGA chip on the COMBOv2 acceleration card. Finally, we decided to develop a new algorithm which is based on the Tree Bitmap algorithm and uses hash functions to increase processing speed. The proposed algorithm is well scalable and is able to balance between time and space complexity.

The technical report is divided into five sections. The introduction is followed by an overview of the classification algorithm used in the NIFIC packet filter. Section 3 describes existing algorithms for LPM look-up and Section 4 introduces the new Hash LPM algorithm. Section 5 contains experimental results showing the performance of the algorithm and Section 6 concludes the report.

2  The Classification Algorithm

The whole algorithm of packet classification is described in detail in [4] and [2], we therefore only give a short summary here. As mentioned in the introduction, the classification algorithm is based on the problem decomposition, where the classification of a packet is divided into several steps. The crucial parts of the classification are the longest prefix match algorithm, which finds the most specific prefix value for IP addresses, and the perfect hash [5] function with purposeful collisions, which is used for the rule number search.

The classification algorithm follows:

  1. Header extraction. Relevant header fields are extracted from the packet header.
  2. Longest prefix match (LPM). The extracted IP addresses are in parallel processed by the LPM algorithm to find the best matching value from a given set of prefixes. Then each prefix is mapped to a unique number, the bit width of which is guaranteed to be smaller than the original one. We take advantage of this bit-width reduction feature also for the rest of header fields, therefore all header fields are processed by some kind of the LPM algorithm. We use different approaches like the Tree Bitmap [Tree], TCAMs or simple table for different header fields. The result of the LPM algorithm run on all header fields is an output word - the concatenation of all single LPM results.
  3. Perfect hash. The result of the LPM algorithm serves as an input to a specially constructed perfect hash function which uses the random acyclic graph search. This function uses purposeful collisions for mapping all pseudorules (combination of LPM outputs, which should match to single rule) to one rule number.
  4. Final comparison. The result of perfect hash may be a false positive (a packet always matches to some rule), therefore the final comparison of header fields with the rule is needed. If the comparison is not successful the packet did not match any rule.

The algorithm was implemented using the NetCOPE platform, a simplified hardware implementation of the classification algorithm is shown in Figure 1.

[Image]

Figure 1. Implementation of the classification algorithm

An incoming packet is parsed by the Header Field Extractor and the header fields are sent to the LPM stage to be mapped to a unique number. Different implementations of LPM are used:

The results of LPM engines (prefix numbers) are concatenated into a single word and sent to the perfect hash block. In this block, the perfect hash algorithm is performed and the rule number is given as the output.

The rule number is sent to the final comparison block where this number is used as the address to the rule table. The rule saved in the rule table contains all header fields from the original rule and an action which should be performed with the matching packet (pass, drop). The packet header fields parsed by HFE are then compared to header fields saved in the rule. If the match is positive, the valid action is given as the output.

The most difficult part of extending the classification with IPv6 support is the implementation of LPM for IPv6 addresses. As was mentioned in the previous section, the IPv4 prefix number look-up is performed by Tree Bitmap engines and MAC addresses are processed by TCAMs. Both approaches could be also used for the IPv6 prefix look-up.

Using TCAMs for prefix look-up is a straightforward and widely used approach that can be easily implemented. The TCAM LPM engine works as shown in Figure 2. The TCAM memory could return more than one positive match but because the data are ordered by the prefix length, the most specific prefix can be found and the unique number of this prefix is returned as the result.

[Image]

Figure 2. Implementation of the IPv6 prefix look-up by TCAM engine.

As there is no external TCAM on the COMBOv2 card the TCAM is implemented inside the FPGA. The implementation has a significant drawback: it consumes large number of FPGA logic resources (LUTs - Lookup Tables) which grows linearly with the number of supported IPv6 prefixes. Therefore the TCAM in the FPGA is usually small and can support only limited number of IPv6 prefixes.

The Tree Bitmap algorithm is described in detail in [3]. Our implementation of the algorithm is based on the prefix tree, which is divided to sub-tree nodes where every node stores information for four bits of input data. In fact, the whole structure is a tree of trees. In each node, the following information is stored:

The Tree Bitmap engine goes through each sub-tree node until there is no successive node. When a valid prefix is found its number is saved. If more valid prefixes are found during the traversal, the old one is always re-written by the one that has just been found, therefore the number corresponding to the most specific prefix is returned as the result.

Figure 3 illustrates the principle of the algorithm.

[Image]

Figure 3. Implementation of the IPv6 prefix look-up by the Tree Bitmap engine.

The implementation of the Tree Bitmap engine has two significant disadvantages. Firstly, it needs one clock cycle to process a single node. As the IPv6 address is 128 bytes long, is takes 32 clock cycles to process the whole address, therefore the only way for achieving the full throughput is parallel usage of several Tree Bitmap engines. Secondly, the representation of the whole address needs up to 31 different nodes (4 bits of input data per node) in the worst case scenario, which results in high memory requirements.

The matching speed of the Tree Bitmap algorithm could be increased by processing more bits within one clock cycle, but only at the cost of even higher memory requirements. The size of one Tree Bitmap node grows exponentially with the amount of bits processed within one step.

4  Hash-based LPM Algorithm

The TCAM implementation is not a satisfactory solution for the IPv6 prefix look-up due to limited number of IPv6 prefixes that may be supported. The Tree Bitmap implementation is not suitable either because of high memory requirements for representing a single IPv6 prefix, which also limits the number of supported IPv6 prefixes.

Other algorithmic approaches like Patricia Trie, Luela Trie, Hexa and other currently used longest prefix match algorithms are also not suitable for this task because of the same reason as the Tree Bitmap: they need many successive steps to process a single IPv6 address.

Therefore, we designed a new algorithm with a good balance between the processing speed and memory requirements. Our goal was to achieve the wire-speed throughput for two 10Gbps ports and reduce as much as possible the size of the memory which is used to store IPv6 prefixes.

[Image]

Figure 4. Stages of the Hash LPM algorithm.

The proposed algorithm consists of three stages which are shown in Figure 4. At first, multiple hash functions are used to match prefixes of different lengths in parallel (Prefix hash of length 1 to k). For every length PH1, PH2, ... PHk, one prefix can be found. Because of the use of hash functions, the prefix is matched in constant time. In the second step, the longest prefix is selected and sent to subsequent matching using the Tree Bitmap algorithm. The Tree Bitmap algorithm starts the processing at the address given by the hash value calculated in the first stage and compares only bits which were not processed by the hash function.

As hash functions are used to match a part of the IP address, the traversal through the tree of IPv6 prefixes is significantly faster than the Tree Bitmap algorithm alone. Using a hash function, the algorithm can jump directly to any level of the tree within a single step, so sequential processing by the Tree Bitmap algorithm is significantly reduced and higher throughput is achieved. Moreover, throughput is simply scalable by the number of parallel hash functions, because the maximal amount of the Tree Bitmap steps is given by the distribution of hash lengths in the tree of prefixes.

A pass through the prefix tree for parallel hash functions h1, h2, h3 and h4 is shown in Figure 5. The lengths of prefixes for h1, h2, h3 and h4 are labeled PHL_1, PHL_2, PHL_3 and PHL_4. We can see in the figure that hash functions can go directly to the level of the tree marked with the dashed line. Then the child nodes can be reached by the Tree Bitmap algorithm. For the IP address in Figure 5, hash function h2 matches the longest prefix and goes directly to level PHL_2. After that only two Tree Bitmap iterations are used in order to find the longest prefix in the tree.

[Image]

Figure 5. IP address look-up in the tree of IPv6 prefixes.

The proposed algorithm was mapped to the hardware architecture in the Figure 6 and supports longest prefix match for both IPv4 and IPv6 addresses. The processing of IP addresses starts in the Hash Prefix Units (HPU), which are used for fast look-up of IP prefix in the memory. Every HPU works with a different prefix length. As it can be seen in Figure 6, the number of HPU units is different for IPv4 and IPv6 addresses because of different depths of corresponding prefix trees.

[Image]

Figure 6. The hardware architecture for the Hash LPM algorithm.

The result of the HPU is the address of the Tree Bitmap node which is the starting point for procession in Tree Bitmap Units (TBU). The address is sent to the TBU through multiplexers. One multiplexer is used to switch between IPv4 and IPv6 parts of the architecture and the second multiplexer is used to select the longest prefixes matched by HPUs. The lengths of prefixes in HPU are organized in the same order as multiple xor inputs. Therefore the control logic of multiple xors can be simple and do not increase the latency of the system.

TBU units use an address from the input interface to read the Tree Bitmap node from the memory and perform one iteration of the Tree Bitmap algorithm. Both TBU units are pipelined and share one dual-port memory. The result of every TBU is a matched prefix or address to the next Tree Bitmap node. As every node describes four bits of IP address and two TBU units are available in the architecture, up to eight bits can be matched by the Tree Bitmap algorithm just after the hash match in the HPU.

The proposed architecture is fully pipelined and is able to return one result of longest prefix match in every clock cycle. Moreover, the implementation of the architecture is able to operate at 156 MHz in the Virtex-5 LX155T FPGA. This means that more than 150 million prefix matches can be performed within one second and the architecture can be used for wire-speed processing of a 100 Gbps network. For the NIFIC device, the performance requirements are much lower, because NIFIC has only two 10Gbps input ports. Therefore source and destination IP addresses can be matched by the same hardware unit without performance loss.

5  Results

To evaluate the proposed algorithm and architecture, we made speed and area comparison of the Hash LPM algorithm with already implemented LPM engines.

We compared the speed of the longest prefix match look-up using TCAM LPM engine, single computation unit of the Tree Bitmap algorithm and the Hash LPM algorithm.

The results of the comparison are in Table 1. The TCAM engine needs only one clock cycle per single look-up, therefore the IPv6 address could be processed in every clock and the throughput for 100 MHz (almost maximum frequency of TCAM and Tree Bitmap engines) is 67 Gbps.

On the other hand, the Tree Bitmap algorithm needs 32 clock cycles for IPv6 processing, therefore the throughput of a single computation Tree Bitmap unit is only 2.1 Gbps for the implementation running at 100 MHz.

The Hash LPM algorithm is implemented as a pipeline of computation elements. The pipelining makes it possible to reach high frequency and also to give the result of a prefix look-up every clock cycle. The implementation running at 100 MHz therefore has the throughput of 67 Gbps. However, the throughput of 100Gbps can be achieved for implementation running at a frequency higher than 156 MHz.

IPv4

IPv6

TCAM

f [MHz]

100

100

Cycles

1

1

T [Gbps]

67

67

Tree bitmap

f [MHz]

100

100

Cycles

8

32

T [Gbps]

8.4

2.1

Hash LPM

f [MHz]

100

100

Cycles

1

1

T [Gbps]

67

67

Table 1. Comparison of processing speed for TCAM, Tree Bitmap and Hash LPM

Apart from the speed comparison we also made a comparison of consumed resources for all three LPM engines, where the look-up engines were implemented to process two 10 Gbps lines at wire speed. The Tree Bitmap engine is therefore composed of 10 Tree Bitmap computation units to achieve 20 Gbps throughput. We synthesized the implementations for different numbers of minimum supported IPv6 prefixes to Virtex-5 LXT155T technology (97280 LUTs, 97280 flip-flops, 212 BlockRAMs). The results are summarized in Table 2.

TCAM

Tree Bitmap

Hash LPM

IPv6 addr.

LUT

FF

BRAM

LUT

FF

BRAM

LUT

FF

BRAM

64

3207

402

0

4669

4158

20

1820

1697

8

128

5883

467

0

4839

4168

35

1820

1697

8

256

11225

596

0

4859

4178

70

2740

2407

10

512

21847

853

0

4889

4178

130

3632

3117

13

1024

43310

1366

0

4859

4178

255

3632

3117

13

2048

86277

2391

0

4859

4178

510

4568

3827

15

4096

--

--

--

--

--

--

6296

5247

20

8172

--

--

--

--

--

--

12620

10217

38

Table 2. The amount of resources for different numbers of minimum supported IPv6 prefixes

The results show that TCAM and Tree Bitmap engines could be used only for maximum of hundreds of IPv6 prefixes, for a larger number both engines consume large amount of resources – LUTs in the case of TCAM and BRAMs in the case of the Tree Bitmap engine.

The Hash LPM algorithm is suitable even for thousands of IPv6 prefixes. It consumes 13 % of LUTS, 11 % of flip-flops and 18 % of BRAMs for 8172 supported IPv6 prefixes.

6  Implementation of IPv6 Filtering in NIFIC

In the first phase, we decided to re-use already implemented TCAM look-up engines to make the implementation of the IPv6 classification as fast as possible. As was shown in the previous section, the implementation of the IPv6 prefix look-up by TCAMs consumes large amount of FPGA resources, therefore we made the following two changes. Firstly, the classification by MAC addresses is omitted to save resources. Secondly, only the higher 64b of IPv6 address could be used for classification, so the TCAM memory size is halved. This two restrictions make it possible to support up to 128 unique 64b long IPv6 prefixes for classification. This first phase implementation is already integrated to NIFIC packet filter and will be released as a stable package at the end of 2010.

However, the support of only 128 64b IPv6 prefixes is not suitable for a larger IPv6 rule set. Therefore in the second phase the Hash LPM engine was integrated to the NIFIC packet filter architecture. The software support for Hash LPM engine is in the phase of development.

7  Conclusion

This technical report describes an extension of the NIFIC packet filter with the IPv6 protocol support. Since existing LPM algorithms are not suitable for IPv6 classification, we proposed a new Hash LPM algorithm which is based on fast look-up by hash function and several sequential Tree Bitmap steps. The proposed algorithm addresses time-space tradeoff and is able to balance among the amount of hardware resources, the size of the memory and the processing speed.

For the proposed Hash LPM algorithm, pipelined hardware architecture was designed and implemented in Virtex-5 LXT155T FPGA which is present on the COMBOv2 acceleration card. The implementation allows for wire-speed throughput on 100 Gbps links and supports thousands of IPv6 addresses. Moreover, as can be seen in experimental results, the proposed architecture consumes significantly less hardware resources in comparison to the TCAM or Tree Bitmap implementations.

On the basis of these findings, the NIFIC packet filter extension with IPv6 protocol was divided into two phases. In the first phase, the LPM for IPv6 was implemented by TCAM and only higher 64b of IPv6 could be used for classification. In the second phase, the TCAM engine was replaced by the new Hash LPM algorithm

The first phase is available as a package in the testing repository, the second phase is still under development because the configuration software is not finished yet.

8  Acknowledgments

This work is supported by the Research Intent of the Czech Ministry of Education MSM6383917201.

References

[1] NOVOTNÝ, J.; ŽÁDNÍK, M. COMBOv2 – Hardware Accelerators for High-Speed Networking. XILINX Academic Forum, San Jose, USA, 2008. Available online.
[2] PUŠ, V.; DEDEK, T. Designing a Hardware-Accelerated Firewall with Two 10 Gbps Ports. Technical Report 8/2008, Praha: CESNET, 2008.
[3] EATHERON, W.; VARGHESE, G.; DITTIA, Z. Tree bitmap: hardware/software IP lookups with incremental updates. SIGCOMM Computer Communication Review, vol. 34, no. 2, p. 97-122, 2004.
[4] PUŠ, V.; KOŘENEK, J. Fast and scalable packet classification using perfect hash functions. In Proceeding of the ACM/SIGDA international symposium on Field programmable gate arrays (FPGA '09). New York: ACM, 2009.
[5] CZECH, Z. J.; HAVAS, G.; MAJEWSKI, B. S. An optimal algorithm for generating minimal perfect hash functions. Information Processing Letters, vol. 43, no. 5, p. 257–264, 1992.
další weby:fond rozvojemetacentrumCzechLightpřenosyvideoservereduroameduID.cz