Study of 40/100GE card implementation
CESNET technical report 22/2010
Štěpán Friedl, Jiří Novotný, Ladislav Lhotka
CESNET, z.s.p.o.
Received 2. 12. 2010
Abstract
This technical report describes the evolution of the Ethernet standard from 10 Mb/s up to 100 Gb/s. The basic principles are explained and the architectures of different versions are compared. The newest standard of 40/100 Gb/s Ethernet is described in some detail. Finally, a proposal for a 40 Gb/s Ethernet implementation based on Xilinx FPGA is presented and discussed.
Keywords: 40GE, 100GE, 100G Ethernet, IEEE 802.3ba, FPGA
1 Introduction
CESNET, in cooperation with Masaryk University and Brno University of Technology, has a relatively long tradition of designing hardware-accelerated network cards. The very first COMBO6 card was developed in 2002 and an interface card for 10GE (COMBO-2XFP) was developed within the SCAMPI project in spring of 2004. In the year 2008, a new family of cards – COMBOv2 – was designed with the projected link rate limit of up to 100 Gb/s. COMBOv2 is a development platform consisting of a high-performance main card and various interface cards. Currently available is the main card COMBO-LXT and three interface cards COMBOI-1G4 (4×1GE), COMBOI-10G2 (2×10GE) and COMBOI-10G4TXT (4×10GE). The latter interface card uses the Virtex XC5VTX150T chip and 10GE interface circuits AEL 2005. This card may be used as a “commodity” hardware-accelerated card with four 10GE ports but it has also been used for preliminary experiments with 40GE. The 40GE architecture implies that four independent 10GE channels with data synchronization marks. Even though the 10GE interface circuits on the COMBO-10G4TXT card do not support synchronization marks, it is still possible to perform basic tests over short distances. A full-fledged solution for 40GE depends on the availability of suitable interface circuits. This version is planned to be realized in 2011.
This technical report describes the evolution of the Ethernet standard from 10 Mb/s up to 100 Gb/s. After explaining the basic principles and comparing architectures of the different versions, we describe the newest standard of 40/100 Gb/s Ethernet in some detail, using mainly the recent standard [4]. Finally, we present and discuss our plans for building a 40 Gb/s Ethernet implementation based on Xilinx field-programmable gate arrays (FPGA).
1.1 Ethernet Evolution
Ethernet [3] is currently the most widely used networking technology. While it was originally intended for deployment in local area networks (LAN), later it started being used in metropolitan area networks (MAN). Nowadays it is used even in wide area networks (WAN), gradually replacing the SDH/SONET technology that was once dominant. The design of 1GE and higher versions enables its use for communication of individual blocks in computer and communication systems.
The first version of Ethernet was created by the Xerox company in 1973–1975. The first draft standard was published in 1980 and the final standard was ratified as IEEE 802.3 in 1982. The original Ethernet utilized a coaxial cable and a collision detection method known as CSMA/CD at the link rate of 10 Mb/s. In 1983, 3COM presented the first commercial 10MbE card.
In mid 1980s, a new Ethernet standard utilizing a twisted pair cable (4 pairs) and RJ45 connector was introduced. In this case, hosts are not connected to the same cable but rather have a point-to-point connections with a hub device.
In 1995, IEEE published the 802.3u standard defining the 100 Mb/s Ethernet with autonegotiation where both interconnected parties negotiate the best possible mode of communication (10 Mb/s versus 100 Mb/s and half-duplex versus full-duplex). 1 Gb/s Ethernet (including 1000BASE-TX) was standardized in 1999 as IEEE 802.3ab and 10GE as IEEE 802.3ae in 2002.
In December of 2007, the development of the 40/100 GE standard – IEEE 802.3ba – was started and this standard was finally ratified in June 2010. For the first time in history, the standard includes two link rates – 40 and 100 Gb/s. The discussions about which of the two rates is more appropriate for the new Ethernet standard thus remained unresolved and the final version contains both.
David Metcalfe, one of the original Ethernet designers, expects the availability of a terabit Ethernet around 2015 [6].
Despite the fact that the Ethernet speed has grown by five orders of magnitude and different transmission media are used, the format of the Ethernet frame remains essentially the same. Currently, the two most widely used Ethernet variants are the twisted pair with the RJ45 connector (in LANs) and optical fibres for longer distances.
2 Ethernet Architecture
In the OSI model, Ethernet spans the first (physical) and second (data link) layers. The main goal of this technical report is a detailed description of the physical layer.
2.1 Data Layer
Data from the data layer are passed to the MAC block which formats the data to the Ethernet frame and controls the physical layer. The format of the most common frame type is shown in Table 1.
| byte count | function |
|---|---|
| 7 | preamble |
| 1 | start symbol |
| 6 | destination address |
| 6 | source address |
| 4 | optional VLAN tag |
| 2 | length/type |
| 42-1500 | data |
| 4 | checksum |
Table 1. Ethernet frame format.
Data frames are separated from each other by an inter-packet gap.
2.2 Physical Layer
In the following part, we will concentrate on the most important aspects of previous Ethernet versions. Properties of 10MbE, 100MbE and 1GE will be mentioned only briefly but more emphasis will be placed on the 10GE version as its properties are important for describing the 40/100GE version.
The physical layer is divided into sublayers. The structure is identical for all speeds from 100 Mb/s up to 100 Gb/s (the original 10 Mb/s Ethernet was different). The functions of individual sublayers (100 Mb/s - 100 Gb/s) are as follows: Reconciliation sublayer (REC) provides logical interconnection between data layer Table 1. and physical layer Table 2. The next sublayers are connected via Media Independent Interface (MII), which is not intended to be physical interface but rather to establish a logical connection between sublayers in a device. Physical Coding Sublayer (PCS) performs data encoding/decoding from/to MII and transfers encoded data to the Physical Medium Attachment (PMA) sublayer in order to be able to support different physical media. The Physical Medium Dependent (PMD) sublayer is responsible for interfacing to the transmission medium which is connected through Medium Dependent Interface (MDI).
![[Image]](eth_lay_osi.png)
Figure 1. Ethernet physical layer
The basic division of the physical layer for link rates ranging from 10 Mb/s to 100 Gb/s is shown in Table 2.
| 10 Mb/s | 100 Mb/s | 1G b/s | 10 Gb/s | 40 Gb/s | 100 Gb/s |
|---|---|---|---|---|---|
| REC | REC | REC | REC | REC | REC |
| MII | MII | GMII | XGMII | XLGMII | CGMII |
| PLS | PCS | PCS | PCS | PCS | PCS |
| AUI | PMA | PMA | PMA | PMA | PMA |
| PMA | PMD | PMD | PMD | PMD | PMD |
| MDI | MDI | MDI | MDI | MDI | MDI |
| MEDIUM | MEDIUM | MEDIUM | MEDIUM | MEDIUM | MEDIUM |
Table 2. Physical layer structure of 10MbE through 100GE.
The meaning of the acronyms in Table 2 is as follows:
- MAC
- Media Access Control,
- REC
- Reconciliation,
- MII
- Media-Independent Interface,
- GMII
- Gigabit MII,
- XGMII
- 10-Gigabit MII,
- XLGMII
- 40-Gigabit MII,
- CGMII
- 100-Gigabit MII,
- AUI
- Attachment User Interface,
- PLS
- Physical Layer Signalling,
- PCS
- Physical Coding Sublayer,
- PMA
- Physical Medium Attachment Sublayer,
- PMD
- Physical Medium-Dependent Sublayer,
- MDI
- Medium-Dependent Interface.
2.2.1 10MbE
The original Ethernet version was designed for link rate of 10 Mb/s. Its physical layer structure is simpler compared to the later versions. An encoding of the Manchester type is used which utilizes 50 % of the frequency band. The following physical media are used: coaxial cable (several types), 10BASE-T with the RJ45 connector and multi-mode optical fibre 10BASE-F. Connectors for these media types are either attached to the interface card (in later versions) or the AUI connector with a converter for the given interface is used.
2.2.2 100MbE
The physical layer model for the 100MbE version is somewhat more complicated than for 10MbE but it remains essentially the same for all subsequent versions (with a few exceptions that will be described together with the corresponding versions). Character encoding is performed by the PCS sublayer, conversion to the given medium by the PMA sublayer and actual connection to the medium by the PMD. The most common version 100BASE-X uses the 4B/5B encoding with the transmission band utilization of 80 %. In practice, the most widely used have been the versions with the RJ45 connector (100BASE-TX) and optical fibre (100BASE-FX).
2.2.3 1GE
For 1GE, the complex 8B/10B encoding was chosen for the transmission over optical fibres, with bandwidth utilization of 80 %. This encoding ensures a signal rich on edges, which is important for the signal clock refresh, and with the same number of zeroes and ones, which is necessary for retaining the average signal level. The most common media are 1000BASE-SX (multi-mode fibre) and 1000BASE-LX (single-mode fibre). For transmission over a twisted pair cable (1000BASE-T), the 4D-5PAM encoding is used and data is transmitted over all four pairs in both directions with echo cancellation. This system is based on subtracting the known level of the transmitted signal from the level that is actually received. Modern cards allow for link rates 10 Mb/s, 100 Mb/s and 1 Gb/s with the same RJ45 connector. Autonegotiation is recommended for 10MbE and 100MbE and mandatory for 1GE. The transceiver is either a part of the interface card or a cage is used (GBIC, SFP) to which a transceiver (optical or RJ45) may be inserted. The cage is connected to the 1000BASE-X interface. In the RJ45 transceiver, a relatively complicated circuit performs the conversion from 1000BASE-X to 1000BASE-T.
2.3 10GE
For 10GE, the mapping of data from the data layer to the physical layer was changed. The previous versions use 8 data bits, clock and control signals in the MII or GMII interface. The clock rate for 1GE is 125 MHz, which is a frequency that today's circuits handle without problems. In the case of 10GE, however, keeping the same interface would require increasing the clock rate to 1.25 GHz, which is too high for currently available circuits. Therefore, the data width was increased to 32 with four control signals (one for every byte, where “0” means data and “1” means control character). The clock rate is 156.25 MHz DDR (Double Data Rate – data is transferred on both the rising and falling edge of the clock signal). Due to the number of wires needed for interconnecting the integrated circuits with the XGMII interface, the 802.3 standard defines an optional XAUI sublayer, see Figure 2.
![[Image]](xaui_lay.png)
Figure 2. Extension of the physical layer with XAUI interface.
The new symbols in Figure 2 are:
- XGCS
- XGMII Extender Sublayer,
- XAUI
- Extension interface.
The extension interface XAUI (see Table 3) consists of four pairs of differential wires (for both directions). For data transfer, a serial transmission is used at the speed of 3.125 Gb/s and with 8B/10B encoding (the encoding mechanism is the same as for 1GE). The use of XAUI reduces the number of wires (16 versus 74 for XGMII), simplifies the printed circuit board considerably and suppresses crosstalks. On the other hand, it is necessary to guarantee the alignment of all four channels and use a specialized high-speed circuit (which can be embedded inside a larger circuit).
| XGMII | D7..D0/C0 | D15..D8/C1 | D23..D16/C2 | D31..D24/C3 | CLK |
|---|---|---|---|---|---|
| XAUI | Lane 0 | Lane 1 | Lane 2 | Lane 3 | — |
Table 3. Comparison of XGMII and XAUI interfaces.
Besides the most common variant 10GBASE-R that will be described later, the standard defines the following variants:
- 10GBASE-LX4 – four wavelengths, 8B/10B encoding,
- 10GBASE-W – WAN sublayer is inserted between PCS and PMA.
10GBASE-R uses the 64B/66B encoding which is substantially simpler than 8B/10B, utilizes more than 96 % of bandwidth and performs the subsequent scrambling with the following polynomial:
G(x) = 1 + x39 + x58.
If XAUI is used, the individual groups (data + control signal) are mapped to the corresponding serial channel.
The beginning of every frame must be aligned to Lane 0. It
contains the Start symbol (0xFB/1) and the preamble,
which consists of 6 characters 0xAA/0 and one character
0xAB/0. The initial part of the frame has eight bytes
transmitted within one clock tick (the DDR transmission is
used). Data starts again at Lane 0.
| Lane 0 | Lane 1 | Lane 2 | Lane 3 |
|---|---|---|---|
| 1111 1011 /1 | 1010 1010 /0 | 1010 1010 /0 | 1010 1010 /0 |
| 1010 1010 /0 | 1010 1010 /0 | 1010 1010 /0 | 1010 1011 /0 |
Table 4. Structure of the preamble.
The frame is terminated by the End character (0xFD/1)
followed by the inter-packet gap. As with 1GE, the transceivers
can either be an integral part of the interface board or a cage
can be used. In contrast to 1GE, there is a wider choice of
transceivers. Among the most frequently used are XENPACK, XFP and
SFP+.
2.4 40/100GE
By the end of the year 2007, the standardization process for the latest (so far) version of Ethernet started. The standard was finally ratified in June 2010. Unlike the previous versions, two link rates of 40 Gb/s and 100 Gb/s are defined. While a construction of a 40GE network card is fully within the reach of the current technology (one possible design will be described in the next section), 100GE will be much more difficult to build.
The mapping of the data layer to the physical layer was changed compared to the previous 10GE version. The interface to reconciliation sublayer is now 8 bytes wide with the corresponding control signals and clock. For the 40GE version (XLGMII), a clock frequency of 312.5 MHz DDR is needed or, alternatively, a wider data bus may be used (128 b at 156.25 MHz), which can be realized with the existing FPGA chips. For the 100GE version (CGMII), the data width is 64 bits and the clock rate has to be 781.25. While it is also possible to use a wider data bus, both solutions hit the limits of currently available circuits.
| D7..D0/C0 | D15..D8/C1 | ... | D63..D55/C7 | CLK |
Table 5. XLGMII and CGMII interfaces.
The 64B/66B character encoding and scrambling in the PCS layer, which have been reused from 10GE, are followed by dividing data into lanes (4 lanes for 40GE and 20 lanes for 100GE), see Figure 3. Alignment markers are used for the synchronization of individual lanes (Figure 4).
![[Image]](block-distribution.png)
Figure 3. Division into lanes [4].
![[Image]](line-aligment.png)
Figure 4. Alignment markers [4].
After the block alignment, the links from the PCS layer are remapped in the PMA layer to the links of the output medium. To simplify the interconnection of circuits, an Attachment Unit Interface is defined (XLAUI for 40GE and CAUI for 100GE). Physical implementation of the 40GE interface (PMD layer) is defined in the standard by using 4 physical links while 10 or 4 physical links are recommended for 100GE. The interface between the PMA and the PMD intended for physical implementation is XLPPI/CPPI (more general option), but often is the XLAUI/CAUI used instead, as the PMA physical multiplexer is a part of the transceiver. Apart from these, it is also possible to use other mappings (for instance, mapping on a single physical channel), although their implementation may be beyond real technical feasibility. Multiple optical fibres may be used for shorter distances and DWDM (multiple wavelengths on the same optical fibre) for longer distances.
![[Image]](eth_layers.png)
Figure 5. Detailed description of the 40/100GE physical layer.
The new symbols in Figure 5 are:
- XLPPI
- 40 Gb/s Parallel Physical Interface,
- CPPI
- 100 Gb/s Parallel Physical Interface.
| 40GE | 100GE | |
|---|---|---|
| At least 1m backplane | 40GBASE-KR4 | - |
| At least 10m copper cable | 40GBASE-CR4 | 100GBASE-CR10 |
| At least 100m MMF | 40GBASE-SR4 | 100GBASE-SR10 |
| At least 10km SMF | 40GBASE-LR4 | 100GBASE-LR4 |
| At least 10km SMF | - | 100GBASE-ER4 |
Table 6. Physical media for 40/100GE.
3 40/100GE Components
In this section, we discuss key components necessary to build the 40/100GE card:
- transceivers, representing the PMD and a portion of the PMA layer,
- PHY devices implementing the PMA and PCS.
In this report, we don't consider the MAC and higher layers, which have not changes compared to 10GE with the exception of a higher speed interface.
3.1 Transceivers
Even though IEEE 802.3ba defines physical layers for copper cabling (40GBASE-CR4 and 100GBASE-CR10) and such cabling and transceivers are indeed commercially available, we will only deal with optical units because they are more common. The following modules are considered to be widely used for the 40/100GE:
- CFP[1] is the only form factor for pluggable transceivers supporting both 100G and 40G Ethernet and telecommunication applications. The optical interface are SC, LC or MPO connectors, electrical interface is CAUI or LXAUI. The size of CFP modules is 145mm×82mm×14mm. Transceivers for 40GBASE-SR4, 40GBASE-LR4, 100GBASE-LR10 and 100GBASE-LR4 are currently available from Finisar, Sumitomo Electric, Opnext, Santur and others.
- CXP [5] module is optimized for multi-mode fibres and includes 12 transmit and 12 receive channels in a compact package – the size is 45mm×27mm, making it much smaller than CFP. The typical application will be 100GBASE-SR10. 24-fibre MPO-style connectors are used in the case of separable optical modules, the other option being active cables (cable with the CXP interface at each end). The electrical interface has 84 pins suitable for CAUI. CXP transceivers are also commercially available from Finisar and others.
- QSFP [7] contains four transmit and receive channels of single-mode or multi-mode fibres. Its low power consumption, size (52.4mm×18.35mm×8.5mm) and low cost makes QSFP a suitable module for 40G applications such as 40GBASE-SR4. Similar to CXP, the active cable assemblies or separate transceivers are available. 40GBASE-SR4 modules are currently available from several vendors, 40GBASE-LR4 modules should be available in the near future.
3.2 PHY Devices
The PHY device implements most functions of the PHY layer: 40/100GBASE-R block encoding and decoding including insertion and deletion of the alignment marker, and the PMA functions – bit multiplexing of individual PCS lanes into PMA lanes, their serialization and deserialization, clock recovery, the XAUI/CAUI interface drivers and management functions. The host interface for MAC connection is XLGMII/CGMII and transceivers are connected via XLAUI/CAUI or XLPPI/CPPI. The PHY devices may be implemented in two ways – as a specialized ASIC chip or in an FPGA.
3.2.1 ASIC Chips
For 10G and slower rates, the ASIC (application-specific integrated circuit) implementation of the PHY layer was common, as shown in Figure 6. The situation is different for 40/100GE as it is designed to integrate the PCS and MAC into one physical chip, e.g. in the FPGA as described in the next section. The main reason for this design is width and speed of the reconciliation sublayer (XLGMII/CGMII). In contrast to 10GE, no extension sublayer is defined for high-speed serial connection.
![[Image]](ceth_10g.png)
Figure 6. Ethernet implementation on a FPGA based 10GE card.
3.2.2 FPGA Implementation
FPGA (field-programmable gate array) is a programmable logic structure – array of programmable logic elements with a flexible interconnect matrix. The basic logic element – a LUT (look-up table) – can perform any function based on up to 6 inputs. The array of LUTs is supplemented by flip-flops and other more complex structures such as RAM blocks, DSP cells, clock management blocks and fast serial transceivers. The main advantage of FPGA chips is their high flexibility while the most significant drawback is a lower working frequency range and higher power consumption.
All recent FPGAs with 10Gb capable transceivers directly support the 10GE – they implement PMA and the most of 10GBASE-R PCS as sub-components of the serial transceiver blocks (64/66 encoder/decoder, FIFOs etc.). However, due to architectonic differences mentioned in Section 3.2.3, these blocks cannot be used for the 40GBASE-R PCS, have to be bypassed and the implementation must be done in the FPGA fabric. The leading edge FPGAs available today have over 500000 LUTs, which is enough for this purpose and also their working frequencies are sufficient. Fast serial transceivers may still be used for the PMA layer with the XLAUI and CAUI interface to external transceiver modules.
The following list summarizes currently available FPGAs together with their transceiver configurations.
- Xilinx Virtex-5 – used on all cards of the COMBOv2 family. Virtex-5 contains up to 48 GTP or GTX transceivers, each capable to transmit up to 6.5 Gb/s.
- Xilinx Virtex-6 HXT – up to 24 GTH transceivers, each capable to transmit up to 11.18Gb/s.
- Altera Stratix IV GT – up to 24 transceivers with 11.3 Gb/s.
- Altera Stratix V GX/GS – up to 66 transceivers with 12.5 Gb/s.
- Altera Stratix V GT – includes 4 transceivers, each capable of 28 Gb/s and 32 transceivers for 12.5 Gb/s.
Several 40/100GE PCS implementations suitable for FPGAs are commercially available as IP cores – for instance Sarance Technologies (suitable for both Xilinx Virtex-5/6 and Altera Stratix IV), MoreThanIP (Altera Stratix IV only) or Avalon Microeletronics. The cores are consuming around 22000 LUTs for the 40GE and 50000 LUTs for 100GE in the FPGA fabric. Our own PCS implementation is under development (see Section 5.2).
To decrease the implementation complexity and save resources in the FPGA fabric, Altera announced an embedded hard copy block (ASIC block inside the FPGA) for 40/100GE to be available for the Stratix V GS/GX/GT FPGA. However, these devices are not available at this time.
![[Image]](ceth_40g.png)
Figure 7. FPGA based 40GE card implementation.
3.2.3 Reuse of Existing 10GE Devices
40GBASE-SR4/LR4 is similar to four parallel 10GBASE-SR/LR lines, so our first approach was to use COMBOI-10G4TXT card with four 10GE interfaces and join them into one 40GE channel. Each interface on COMBOI-10G4 has an independent PHY device (AEL2005) implementing 10GBASE-R PCS and PMA layers. MAC and packet processing is done in the FPGA, which is connected to the PHY via four XAUI channels. Although the architecture of 40/100GE is similar to 10GE, there are several major differences that make the reuse of current 10GE components difficult.
From final IEEE802.3ba specification we found out that this simple approach cannot be used for implementing 40GBASE-R. The main reasons are:
- The PHY device cannot handle alignment markers which must be inserted or deleted after the encoding and scrambling performed by the PHY device. It is impossible to implement them inside the FPGA.
- Scrambling and descrambling must be done over all transmitted/received data bits so that it is not possible to scramble data independently in the four parallel channels.
- It should be possible to bypass the 10GBASE-R logic in PHY devices and perform encoding and scrambling in the FPGA, but the XAUI interface between the FPGA and PHY device is problematic because it does not allow to pass encoded and scrambled data.
4 Card Architecture Proposal
The card will be compatible with the COMBOv2 card family which means that all existing applications for COMBOv2 (flow monitoring, packet filtering) will also work at the 40GE or 100GE link rates. The requirements are as follows:
- one or more optical interfaces with a pluggable optical transceiver,
- 40/100GE PMA and PCS implementation compliant to the IEEE 802.3ba,
- the ability to run in 1x100GE, 3x40GE and 12x10GE mode,
- host interface for interconnection with the COMBOv2 family cards,
- size of the standard PCIe card,
- reasonable power consumption.
4.1 Optical Interface and Transceivers
The most versatile interface for 40/100GE applications is undoubtedly the CFP. However, the CFP module cannot be used due to card size limitations – it is too big to fit the PCIe size requirement. The best option is the CXP for 100GE and QSFP for 40GE. The size and power consumption are acceptable and, moreover, two or more interfaces can be placed on the card. With the use of the breakout “octopus” cable, the CXP should serve well also for the 40GE and 10GE.
4.2 40/100GE PMA and PCS
For architectural reasons, the PMA and PCS layers are best implemented in the FPGA. An extra benefit of doing so is that additional packet preprocessing such as sampling can be done in the same FPGA. Moreover, the card may operate in one of 1x100GE, 3x40GE and 12x10GE modes.
The PCS implementation mentioned in Section 5.2 could be used for the 40GE mode, whereas the 100GE mode needs certain improvements. For PMA, the integrated 10G transceivers will be used. While the PCS may be implemented in all FPGA families mentioned in Section 3.2.2, the requirement of PMA on 10Gb transceivers limits the current selection to Xilinx Virtex-6 HXT or Altera Stratix IV GT. Because of our previous experiences with the platform, we chose the Virtex-6 chip.
5 Current Progress
5.1 40GE on COMBOI-10G4TXT
In order to be able to experiment with reusing the current COMBOI-10G4TXT, we have implemented an experimental design to verify the ability to transfer data at 40 Gb/s over four independent 10GE interfaces. Due to the reasons mentioned in Section 3.2.3, the implementation is not IEEE 802.3ba compliant – the individual lanes don't contain alignment markers, so the receiver is not able to properly reassemble the incoming packet when the lanes are skewed with respect to each other. Therefore, our experiment was primarily focused on investigating the impact of the skew among the lanes.
On the transmitting side, the experimental design contains a fixed packet generator, a splitter for distributing the data into four lanes and four XAUI interfaces for transmitting the data to an external PHY. On the receiving side, the incoming packet was reassembled and the consistency was checked using the Xilinx Chipscope analyzer.
![[Image]](10g4_exp.png)
Figure 8. 40 Gb/s on COMBOI-10G4 – experimental setup.
The results show that lane skew of 6.4-12.8 ns occurs even in ideal conditions (equal length of all cables). The skew doesn't vary during operation, but it differs each time after the card is reset. We suppose that the reason for this behaviour are elastic FIFOs inside the PHY whose alignment after the reset is unpredictable.
Consequently, it would be necessary to implement a method for inserting some kind of alignment markers to allow the receiver to align the lanes. However, this solution would still violate the IEEE 803.3ba standard. For a compliant mode, a new card with architecture proposed in Section 4 with the regular 40GBASE-R PCS will be developed.
5.2 40GE PCS Implementation
The implementation of key blocks of the 40GBASE-R standard is aimed at Xilinx Virtex-6 FPGA with 10Gb-capable transceivers. As soon as such a card is available, we should be able to build a 40GBASE-SX4 or -LX4 compliant device. Until then, the implementation should be tested on a lower-speed card such as COMBOI-1G4, which is capable to run in a (non-compliant) mode at up to 26 GB/s.
The necessary firmware will be implemented in VHDL and verified by a ModelSim simulation. Because of limited resources available on the current COMBO cards, we had to limit our experimental design to the basic blocks necessary for correct transmit and receive operations; the implementation of the missing blocks (BER monitor, pattern and PRBS generator, management functions) is planned to for the final version. The block structure of the transmit and receive paths is shown in Figures 9 and 10.
5.2.1 Transmit Path
The TX path block realizes the complete 40GBASE-R transmit operation as defined in 802.3ba [4], Clause 82. The XLGMII interface with a width of 256-bit and a clock frequency of 156.25 MHz is used for the MAC layer connection. The PMA (realized by four GHT transceivers) is connected via four 64-bit buses with frequencies about 161.13 MHz. Total resources utilized by the implementation in the Virtex-6 FPGA are 3381 flip-flops, 8298 LUTs and 8 BlockRAMs. The main building blocks of the TX path are:
- 64/66 encoder – performs the 10GBASE-R transmission encoding. Each encoded 66-bit block consists of the synchronization header and the payload – data or control blocks. Data blocks contain eight data characters as received via the XLGMII. Control blocks begin with the 8-bit block type field that indicates the format of the reminder of the block, described in detail in IEEE 802.3ba section 8.2.3.3. The implementation of this block is taken from XAPP775[8] with some modifications because of the 256-bit wide data bus. The resources utilized in the Virtex-6 FPGA are 2192 LUTs and 1396 flip-flops.
- FIFO – the asynchronous FIFO does the clock rate compensation between the XLGMII and the PMA interface. It also compensates for the data rate difference caused by the alignment marker insertion by deleting the idle control characters. The FIFO, which is 32 items deep, is implemented in embedded BlockRAMs and consumes 1178 LUTs, 565 flip-flops and 8 BlockRAMs.
- Scrambler – the payload of each 66-bit block is processed by a self-synchronizing scrambler as described in Section 2.3 . A parallel form with data width of 256-bit is implemented, which leads to a 7-level cascade of 3-input XOR gates and 58 D flip-flops. In Virtex-6 FPGA, the whole scrambler logic occupies 448 LUTs and 58 registers.
- Alignment marker inserter periodically inserts an alignment marker after every 16383 blocks simultaneously to all four lanes. Markers are not scrambled in order to allow the receiver to find them and deskew the PCS lanes. The marker consists of 6 byte fixed value, which is unique for each PCS lane, and a 2 byte BIP field – result of bit interleaved parity calculation over all of previous bits of a given PCS lane. The implementation occupies 347 LUTs and 312 flip-flips.
- Gearbox adapts between the 66-bit width of the blocks and the 64-bit PMA interface. The implementation from XAPP775 is used. 1095 LUTs and 330 flip-flops are consumed in the FPGA by each instance of the gearbox.
![[Image]](tx_path.png)
Figure 9. 40GBASE-R PCS implementation – transmit path block structure.
5.2.2 Receive Path
The full architecture of the receive path of the 40GBASE-R PCS is designed although the implementation has not been completed yet. The main building blocks of the TX path are:
- Block synchronizers look for synchronization headers in the incoming bitstream and try to find the correct position of 66-bit blocks. The basis for the synchronizer is a 66-bit wide barrel shifter controlled by a state machine, which periodically tests the presence of the synchronization header. The block lock is obtained if 64 consequent valid synchronization headers are found. The implementation is taken from XAPP775 and consumes 4×1462 LUTs and 4×347 flip-flops in the Virtex-6 FPGA.
- Lane alignment and reorder looks for alignment marker blocks on all four lanes. After the marker lock is obtained, it deskews the lanes to be mutually aligned – a FIFO with “skip” capability is used for this purpose. The alignment markers are then removed from the data stream. The implementation is still incomplete.
- Descrambler realizes the reverse operation to the scrambler and its implementation is also very similar to the scrambler. The resources consumed in the FPGA are 315 LUTs and 514 flip-flops.
- FIFO – the asynchronous FIFO performs the clock rate compensation between the PMA and XGMII clock domains. It also compensates for data rate difference caused by the alignment marker deletion by inserting idle control characters. The FIFO, which is 32 items deep, is implemented in embedded BlockRAMs and consumes 1214 LUTs, 565 flip-flops and 8 BlockRAMs.
- Decoder realizes the reverse operation to the encoder – converts the 40GBASE-R transmission encoding to XLGMII. A slightly modified implementation from XAPP775 is used, 3100 LUTs and 1748 flip-flops are consumed.
![[Image]](rx_path.png)
Figure 10. 40GBASE-R PCS implementation – receive path block structure.
The implementation of both receive and transmit path is modular so as to allow them to be simply extended to the 100GE version in the future. The main differences in the 100G version will be 20 PCS lanes instead of 4 and the 20:10 multiplexer for the PMA interface. The number of PCS lanes can be changed easily by changing a numeric constant in the source code, however this will lead to inference of a much more complicated logical structure that may not be able to meet the timing criteria (working frequency), so it may turn out to be necessary to add some pipeline stages or optimize some components. On the other hand, the 20:10 PMA multiplexer is a simple generic component, which will not be difficult to implement.
6 Conclusions
The design of the family of COMBO cards allows for receiving, sending and preprocessing data at link rates up to 1 Tb/s, provided that appropriate transceivers and FPGA chip are available. Currently, the main bottleneck for rates over 10 Gb/s is the throughput of the PCIe interface, and this limitation is likely to remain for some time. The role the data preprocessing in hardware is this be even more important than before.
An interesting twist may be observed in the history of Ethernet: The physical layer is purely serial for link rates up to 1 Gb/s (except for twisted pair interfaces) and the width of the data layer interface is one byte. For higher versions, the width of the physical layer interface was increased to 4 or 8 bytes for 10GE and 40/100GE, respectively. At the same time, the use of multiple optical transmission channels was standardized for 10GE (10GBASE-LX4) but it wasn't much used in practice. The 40/100GE standard recommends variants with 4 or 10 optical channels. A single optical channel transmission is considered in the standard but its physical realization will be rather difficult.
One can see that the ever-increasing link rate motivates a transition to wider data buses and, in particular, to serio-parallel transmission even on optical fibres. This evolution correlates with the processor clock rate that has remained the same for the last few years. In general, further increase of computing performance now depends on parallel processing, which is somewhat problematic given the established methods of program development. Since the speed of computing elements now approaches physical limits while the density of transistors in integrated circuits continues to grow, the developers will be forced to concentrate more on parallel approaches, which may in turn lead to significant changes in programming paradigms and algorithms.
![[Image]](ether-band.png)
Figure 11. Growth of network bandwidth compared to the growth of processing power. [2].
The situation is even more complicated in the area of data communications: The demand for transmission capacity increases with a higher rate (Gilder's Law) than the available processing power (Moore's Law), see Figure 11.
In the following years, the development of Ethernet will bring new problems and their solutions, particularly in the area of optical systems and their deployment in computing and communication systems. It will also be interesting to see whether Robert Metcalfe's prediction about the arrival of 1TbE in 2015 will be fulfilled or not.
References
| [1] | AVAGO TECHNOLOGIES; FINISAR Corp.; OPNEXT, Inc.; SUMITOMO Electric Industries, Ltd. CFP MSA Hardware Specification. 2010 [cit. 2010-11-01]. Available online. |
| [2] | D’AMBROSIA, J.; LAW, D.; NOWELL, M. 40Gigabit Ethernet and 100Gigabit Ethernet Technology Overview . [cit. 2010-11-01]. Available online. |
| [3] | IEEE 802.3-2008. 26 December 2008 [cit. 2010-11-01]. ISBN 973-07381-5796-2. Available online |
| [4] | IEEE 802.3ba-2010. 22 June 2010 [cit. 2010-11-01]. ISBN 978-0-7381-6322-2. |
| [5] | INFINIBAND TRADE ASSOCIATION. 120 Gb/s 12x Small Form-factor Pluggable (CXP). 2009 [cit. 2010-11-01]. Available online. |
| [6] | METCALFE, R. M. Toward Terabit Ethernet. In OFC/NFOEC, San Diego, February 2008. Available online. |
| [7] | SFF COMMITTEE. QSFP (Quad Small Form-factor Pluggable) Transceiver. 2006 [cit. 2010-11-01]. Available online. |
| [8] | Xilinx Inc. 10 Gigabit Ethernet/Fibre Channel PCS Reference Design. 2004 [cit. 2010-11-01]. Available online. |