<?xml version="1.0"  encoding="ISO-8859-2"?> 
<!DOCTYPE zprava SYSTEM "techrep.dtd"> 
<zprava cislo="34/2006" jazyk="en"> 
<nazev>Interconnection System for the NetCOPE Platform</nazev> 
<autor>Tomáš Martínek and Jiří Tobola</autor>
<datum>12.12.2006</datum> 

<h1>Abstract</h1>

<p> 
   This technical report describes the interconnection system for the general
   platform for rapid development of network applications (NetCOPE).  The
   main objective of this interconnection system is to realize effective
   packet data transfers between components placed in FPGA and the host RAM
   memory. The basic parts of the interconnection system are internal
   bus, local bus, control bus and programmable DMA controller.  The report
   describes in detail the architecture of these buses and shows
   how the programmable DMA controller works during packet
   reception and transmission.
</p>

<p>
<b>Keywords:</b>
   Platform for Network Applications, Internal Bus Architecture,
   Programmable DMA Controller, FPGA, PCI-X, PCI-Express, PowerPC
</p>

<!-- ********************************************************************* -->
<!--                       Introduction                                    -->
<!-- ********************************************************************* -->
<h1>Introduction</h1>
<p>
   Within the Liberouter project, several network applications have
   been already built based on the family of COMBO cards, such as the
   network interface card (NIC), NIC with hardware filtering and
   forwarding, the FlowMon probe <cite href="Zl05"/>, Scampi
   monitoring adaptor, Intrusion Detection System <cite href="Kkh06"
   /> etc. The development of these applications was usually carried out separately 
   and the projects shared only few common components. With the rising
   number of projects under development we have realized that it will be
   useful to share not only the common components, but the entire
   development framework.  For these reasons, we embarked on designing a
   high performance general platform for rapid development of network
   applications called NetCOPE (Network Combo Pipe) with the following features:
</p>
   <ul>
      <li>Network interface blocks for packet reception and transmission
      compliant with the IEEE 802.3 standard</li>
      <li>High throughput internal bus dedicated to packet transfers
      between FPGA and host RAM</li>
      <li>Access to the PCI-X and PCI Express buses</li>
      <li>Programmable bus Master DMA controller implemented using the PowerPC
      processor</li>
      <li>Efficient generic inter-component protocol with variable data
      width</li>
      <li>Set of generic IP Cores for packet analysis, classification,
      modification, timestamp processing etc.</li>
   </ul>
<p>
   With the NetCOPE platform, a designer implements only the application logic
   such as IDS or FlowMon probe and doesn't care about network input and output
   blocks, data transfer to host PC etc. He can also utilize available IP
   cores (Header-Field-Extractor processor, Look-Up processor, Output
   Packet Editor, etc.) to further shorten the development cycle. The
   block diagram of NetCOPE platform is shown in <a href="#fig1">Figure</a>.
</p>

<obr id="fig1" src="netcope">
   NetCOPE architecture
</obr>

<p>
   One of the most important parts of the NetCOPE platform is the
   <i>internal bus system</i>. The basic requirements for such a bus
   system include:
</p>

   <ol>
      <li> High throughput internal bus for transferring packet data between
      internal NetCOPE components and host RAM memory. This requirement is
      important especially for multi-gigabit network applications.</li>

      <li> Easy connectivity of the system to the PCI host interface (PCI, PCI-X
      and PCI-Express)</li>

      <li> Programmable DMA controller for effective control of DMA
      operations between the FPGA adaptor and host RAM. Programmability is very
      important as various network application have different
      demands for DMA transfers.</li>
   </ol>

<p>
   This technical report describes the architecture of the interconnection
   infrastructure in some detail. The system consists of three types of buses -- (1)~high throughput <i>internal bus</i>, (2)~low speed <i>local bus</i> and
   (3)~<i>control bus</i> dedicated for control messages between
   NetCOPE components and the <i>programmable DMA controller</i> that is used for
   efficient controlling of DMA operations.
</p>

<p>
   This report is organized as follows: <a href="#oview">Section</a> shows the basic overview of
   NetCOPE interconnection system. The following sections then
deal with the individual components -- internal, local and control buses and the programmable DMA controller is shown. Conclusions are
   drawn in <a href="#conc">Section</a>.
</p>

<!-- ********************************************************************* -->
<!--                 Overview of Internal Bus System                       -->
<!-- ********************************************************************* -->
<h1 id="oview">Overview of Internal Bus System</h1>
<p>
   The NetCOPE platform covers three different types of buses (see <a
   href="#nbs1">Figure</a>).  The most important one is the <i>internal
   bus</i>. It is a high throughput bus designed for transferring huge
   amounts of data between the FPGA and the low level software driver.
   In other words, the internal bus connects components that require high
   throughput (e.g., Software Receive/Transmit Buffers, DRAM controller,
   potentially PowerPC processors etc.) with the the host PCI interface (PCI, PCI-X or
   PCI-Express).
</p>
<p>
   The components, that don't require high bandwidth, can be connected using the
   <i>local bus</i>. Typically, the local bus is used for transferring
   configuration data, programs for microcontrollers, debug/status
   information, etc. The local bus is connected to the internal bus using
   the <i>LB bridge</i> component.
</p>
<p>
   The last one is the <i>control bus</i>, that is reserved for
   of control data transfers between the <i>programmable DMA controller (PDMA)</i> and
   the components that require sending or receiving data to or from the host RAM memory.
   The PDMA controller is a key component that controls all DMA operations. It
   typically downloads <i>scatter-gather (SG)</i> lists from host RAM,
   performs necessary DMA transfers and uploads modified SG lists back to the
   software driver. As the programmability of DMA operation is an important
   feature of NetCOPE platform, we utilize the embedded PowerPC <cite
   href="Xrg03" /> processor as the main component of PDMA controller.
</p>

<obr id="nbs1" src="bus_system">
   NetCOPE internal bus system architecture
</obr>

<p>
   The following sections contain detailed information about each bus type.
</p>

<!-- ********************************************************************* -->
<!--                        Internal Bus                                   -->
<!-- ********************************************************************* -->
<h1>Internal Bus</h1>
<p>
   The <i>internal bus</i> is a high throughput bus dedicated for transferring
   of data internally between FPGA components as well as between FPGA
   components and the PCI interface. As shown in <a href="#ib1">Figure</a>,
   the internal bus utilises a tree structure composed of
   <i>root</i>, <i>switches</i> and <i>endpoints</i>. The root component is
   a part of the <i>PCI bridge</i> that provides communication with host PCI
   system.  The switches control communication inside the internal bus and
   endpoints usually connect components that require huge
   data transfers to or from the software driver. Typical examples of
   components connected to endpoints are: software receive/transmit buffers,
   DRAM controller, internal to local bus bridge, potentially the PowerPC
   processor farm <cite href="Xpbg05" /> etc.
</p>

<obr id="ib1" src="ib_archtree">
   Internal Bus architecture
</obr>

<p>
   Depending on the configuration, the width of an internal bus link is either 64 or 128 bits and its clock rate is
   125 MHz. Each link is full duplex so that data can
   be transferred simultaneously in both directions. This feature is very
   useful for point-to-point network application cores and allows for an
   easy connection to the PCI-Express host interface <cite href="Abs03" />.
</p>
<p>
   The tree topology enables the endpoints to communicate with each other in
   separate branches without wasting the bandwidth of the global internal bus. As
   shown in <a href="#ib1">Figure</a>, the tree architecture can also be
   redrawn as a bus architecture with pipeline stages in the form of switch
   components. This kind of pipelining is very important in the FPGA context,
   because such wide buses are usually sensitive to distance due to limited
   FPGA wire routing resources.
</p>
<p>
   The internal bus uses a packet-based communication protocol. Each packet
   consists of a header with necessarily control information and packet
   data. The timing diagram of the communication protocol is depicted in <a
   href="#ib2">Figure</a>.  The packets are marked by the Start-Of-Packet (SOP)
   and End-Of-Packet (EOP) signals. Packet data transfer is controlled by means of the 
   Source Ready (SRC_RDY) and Destination Ready (DST_RDY) signals. Using
   these two signals, the communication can be easily stopped by either the
   receiver or the transmitter. One of the main advantages of the internal bus
   communication protocol is that in the basic mode this protocol is compatible
   with FrameLink and Xilinx Local Link protocols. Due to this compatibility,
   it is possible to connect some of Xilinx IP Cores directly to
   internal bus infrastructure. The Aurora IP Core that provides
   communication via multigigabit transceivers is an example of
   such a component.
</p>

<obr id="ib2" src="ib_timing_diagram">
   Internal bus communication protocol
</obr>

<!-- ********************************************************************* -->
<!--                           Local Bus                                   -->
<!-- ********************************************************************* -->
<h1>Local Bus</h1>
<p>
   The components that send or receive moderate amounts of data can be
   connected using the <i>local bus</i>. Typically, the local bus is used for
   transferring configuration data, programs for microcontrollers,
   debug/status information etc. As shown in <a href="#lb1">Figure</a> the
   local bus has again a tree structure with main components being <i>root</i>,
   <i>switch</i> and <i>endpoint</i>. In comparison to the internal bus the
   local bus is much simpler.  The communication can be initiated only by
   root, all endpoints work as slave nodes and the switch component simply
   forwards transactions from upstream ports to all downstream ports without
   any routing mechanism. Like in teh case of the internal bus switch component, the
   local bus switch components represent pipeline stages to reduce the
   sensitivity to distance.
</p>
<obr id="lb1" src="lb_architecture">
   Local bus architecture
</obr>
<p>
   A local bus link is 16 bits wide and operates at 125 MHz. Each link
   is bidirectional, but the communication is usually performed only in one
   direction. (Bidirectional links are needed to avoid tri-state
   buses.) The communication protocol is shown in <a
   href="#lb2">Figure</a>. It is a simple protocol that distinguishes read
   and write transactions. Each transaction starts with an address followed by
   the data to be read or written with appropriate control signals. One of
   the main advantages of this communication protocol is that it can be easily
   transferred between multiple FPGA chips using just auxiliary registers -- no
   dedicated bridge component is needed. The inter-FPGA communication only
   affects the latency of read transactions.
</p>

<obr id="lb2" src="lb_timing_diagram">
   Local bus communication protocol
</obr>

<!-- ********************************************************************* -->
<!--                          Control Bus                                  -->
<!-- ********************************************************************* -->
<h1 id="sec-cb">Control Bus</h1>
<p>
   The <i>control bus</i> is reserved for control data transfers
   between the <i>programmable DMA controller (PDMA)</i> and components that
   need to send or receive data to/from the host RAM memory. Typical examples
   of such components are software receive/transmit buffers, DRAM controller,
   etc.  Like the internal and local buses, the control bus
   is based on a tree structure (see <a href="#cb1">Figure</a>). The root component
   is a part of the PDMA controller, the switches simply forward transactions
   from upstream port to all downstream ports and the endpoints connect the
   components that require DMA operations.
</p>

<obr id="cb1" src="cb_architecture">
   Control bus architecture
</obr>

<p>
   A control bus link is full duplex, 16 bits wide and operates at 125 MHz
   synchronously with other parts of the NetCOPE bus system. Transactions are
   transferred as packets containing headers and data. The
   communication protocol is compatible with FrameLink. As shown in <a
   href="#cb2">Figure</a> a packet is marked by the Start-Of-Packet (SOP) and
   End-Of-Packet (EOP) signals. Data are transferred using the Source (SRC_RDY)
   and Destination Ready (DST_RDY) signals.  The root component can send a packet
   to any of the endpoints and any endpoint can send a packet to the root. However, two endpoints are not allowed to exchange packets.
</p>

<obr id="cb2" src="ib_timing_diagram">
   Control bus communication protocol
</obr>

<p>
   The root is a key component of the control bus system. As shown in <a
   href="#cb3">Figure</a>, the root architecture consists of sixteen
   receive and transmit queues, control registers and a small controller. All
   incoming messages from endpoints are split into sixteen receive (RX) queues
   based on the endpoint identification. (Note: control bus supports
   only sixteen endpoints.) 
</p>

<obr id="cb3" src="cb_queues">
   Root component architecture
</obr>

<p>
   All RX queues are implemented inside two-port embedded BlockRAM memory
   blocks.  The first port is reserved for packets coming from the 
   control bus.  The second port is accessible from the user component that
   utilizes the root of the control bus. (Note: In our case the root component is
   utilized by the PDMA controller or the PowerPC processor). Start and end pointers of individual RX queues are stored in the root's
   status registers. The contents of the appropriate RX queue are available via
   a common memory interface. Finally, the end pointer can be moved by changing
   root's control registers.
</p>

<p>
   Packet reception process operates in the following steps:
</p>
   <ol>
      <li>The packet coming from the control bus interface is stored into the
      appropriate RX queue based at the source address (endpoint
      identification).</li>

      <li>The number of items in the RX queue as well as the RX queue start
      pointer are updated in the array of root's status registers.</li>

      <li>The user component reads the content of the packet via direct RX queue
      memory access.</li>

      <li>As the data are read, the user component writes the number of read
      items into the dedicated root's control register and the RX queue end
      pointer is moved to the appropriate position.</li>
   </ol>

<p>
   TX queues are realized in the same manner as RX queues. The root component
   contains sixteen TX queues; each of them is reserved for the appropriate
   endpoint. If the user component needs to send a packet from root to
   an endpoint, it simply writes the content of the packet into the TX queue
   memory and uses the root's status/control registers for packet
   transmission.
</p>

<p>
   Packet transmission process operates in the following steps:
</p>
   <ol>
      <li>The user component reads root's status registers to obtain the TX queue
      start pointer</li>

      <li>The user component writes the packet content into the appropriate
      memory position.</li>

      <li>The user component writes the number of written items into the
      dedicated root's control register.</li>

      <li>The root controller reads the required number of items from TX queues and
      sends them as a packet to the specified endpoint.</li>

      <li>After the packet transmission, TX queue end pointer is updated.</li>
   </ol>


<!-- ********************************************************************* -->
<!--                     Programmable DMA Controller                       -->
<!-- ********************************************************************* -->
<h1>Programmable DMA Controller</h1>
<p>
   The main goal of the programmable DMA controller (PDMA) is to manage all DMA
   operations between the FPGA adaptor and host RAM memory. Typically, PDMA
   operates in the following steps:
</p>
   <ol>
      <li>Download scatter-gather lists (SG) into its local memory</li>

      <li>Process all items in the lists, where each item represents a DMA
      operation</li>

      <li>Modify appropriate parameters of SG lists</li>

      <li>Finally, upload modified SG lists back to host RAM and
      possibly generate an interrupt.</li>
   </ol>
<p>
   The programmability of the DMA controller is a very important feature,
   because each application has different requirements and parameters
   relevant to its DMA operations. As <a href="#pdma1">Figure</a> shows, the
   architecture of PDMA is composed of a PowerPC processor (PPC), control bus
   root component and local memories for a PPC program and data. CB root's TX
   and RX queues are mapped to the PPC prefetchable PLB address space. Using this
   connection, all TX and RX messages can be processed very quickly inside
   internal PPC cache. Then, CB root's status and control registers are
   mapped into OCM interface.
</p>

<obr id="pdma1" src="cb_ppcarch">
   PDMA architecture
</obr>

<p>
   As was mentioned in <a href="#sec-cb">Section</a>, all communication between PDMA
   and control bus endpoint components is based on sending messages. For
   example, if the new packet arrives into the system, the message is send to
   PDMA. Similarly, if a packet is processed the acknowledge message is sent
   from PDMA to endpoint. 
</p>
<p>
   The previous implementation of PDMA was based on direct reading instead of
   sending messages. However, direct reading suffers very high
   latencies for read operations, which is not acceptable for high speed
   network application based on 10Gigabit Ethernet. In the next part, we will
   show examples how the PDMA works during the packet reception and transmission
   process.
</p>
<p>
   The packet transmission process operates in the following steps:
</p>
   <ol>
      <li>PPC sends a request to a DMA controller, such as the controller placed inside
      the PCI bridge, to transfer the SG list from host RAM into the local PPC data
      memory.</li>

      <li>As soon as the transfer is finished, the DMA controller sends
      an acknowledgement to PPC. (Note: PPC polls the RX queue status register for information about new incoming messages.)</li>

      <li>PPC processes each SG list item in the following way:
      <ol>
         <li>PPC sends a request to DMA controller to transfer the packet from host
         RAM into the internal buffer of the software transmit buffer (SW_TXBUF)
         component.</li>

         <li>As soon as the transfer is finished, PPC sends a message to SW_TXBUF
         containing information about packet offset (inside internal
         SW_TXBUF buffer), packet length and potential flags.</li>

         <li>SW_TXBUF forwards the required packet to network interface and sends an
         acknowledgement message back to PPC.  </li>
      </ol></li>

      <li>As soon as all items of the SG list are processed, PPC sends a request
      for downloading a new SG list.  </li> 
   </ol>
  
<obr id="pdma2" src="cb_swtxbuf"> 
   Example packet transmission process 
</obr>

<p>
   Packet reception process operates in the following steps:
</p>
   <ol>
      <li>PPC sends a request to a DMA controller, such as the controller placed inside
      the PCI Bridge) to transfer the SG list from host RAM into the local PPC data
      memory.</li>

      <li>As soon as the transfer is finished, the DMA controller sends
      an acknowledgement message to PPC.</li>

      <li>PPC processes each SG list item in the following way:
      <ol>
         <li>If a new packet arrives into the system, the software receive buffer
         (SW_RXBUF) generates a message to PPC containing information about the
         packet offset, packet length and potential flags.</li>
 
         <li>PPC sends request to the DMA controller to transfer the packet from
         SW_RXBUF component to host RAM.</li>

         <li>As soon as the transfer is finished, the DMA controller sends
         an acknowledgement message back to the PPC processor.  </li>

         <li>PPC sends an acknowledgement message to the SW_RXBUF component and the
         incoming packet can be released from the SW_RXBUF internal buffer.
         Concurrently, PPC modifies the appropriate parameters or flags (e.g., interface number, packet error status etc.) in the SG list item.</li>
      </ol></li>

      <li>As soon as all items of SG list have been processed, PPC sends a request to
      the DMA controller for uploading the modified SG list back to host RAM and
      downloading a new SG list to the PPC local data memory.</li>
   </ol>

<obr id="pdma3" src="cb_swrxbuf">
   Example packet receiving process
</obr>

<p>
   Finally, an example of the physical connection between the SW_RXBUF and SW_TXBUF
   components is shown in <a href="#pdma4">">Figure</a>. PPC uses two DMA controllers. The
   first one is placed inside an PDMA internal bus endpoint and controls all DMA
   operations between the PDMA local memory and host RAM. The second one is placed
   inside the SW_RXBUF and SW_TXBUF components and controls all DMA operation
   between SW_TXBUF, SW_RXBUF components and host RAM. Both DMA controllers
   are connected using control bus endpoints and can be controlled via two TX
   and RX queues of PDMA. Two other control bus endpoints are used for
   communication with SW_RXBUF and SW_TXBUF components, so the next to RX and
   TX queues are reserved for them.
</p>

<obr id="pdma4" src="nic_arch">
   Example of PDMA connection
</obr>

<!-- ********************************************************************* -->
<!--                                Conclusions                            -->
<!-- ********************************************************************* -->
<h1 id="conc">Conclusions</h1>
<p>
   This report proposes the architecture of the NetCOPE interconnection
   system. We described the internal, local and control buses together with
   their communication protocols. Moreover, the programmable DMA
   controller (PDMA) was presented and
   its functionality illustrated by examples of packet reception
   and transmission.
</p>

<!-- ********************************************************************* -->
<!--                                References                             -->
<!-- ********************************************************************* -->
   <seznamknih>

   <kniha id="Zl05">
      Žádník M., Lhotka, L.:
      <i>Hardware-Accelerated NetFlow Probe</i>,
      Technical Report 32/2005, CESNET, Praha, 2005.
   </kniha>

   <kniha id="Kkh06">
      Kobierský P., Kořenek J., Hank A.:
      <i>Traffic Scanner</i>,
      Technical Report <a href="http://www.cesnet.cz/doc/techzpravy/2006/ids/">33/2006</a>, CESNET, Praha, 2006.
   </kniha>

   <kniha id="Tk06">
      Tobola J., Košek M.:
      <i>Frame Link Tools</i>,
      Technical Report, in preparation, CESNET, Praha, 2006.
   </kniha>

   <kniha id="Abs03">
      Anderson D., Budruk R., Shanley, T.:
      <i>PCI Express System Architecture</i>, 
      MindShare, Inc., September 4, 2003
   </kniha>

   <kniha id="Xrg03">
      Xilinx, Inc.:
      <i>PowerPC Processor Reference Guide</i>, 
      September, 2003
   </kniha>

   <kniha id="Xpbg05">
      Xilinx, Inc.:
      <i>PowerPC 405 Processor Block Reference Guide</i>, 
      July, 2005
   </kniha>

   </seznamknih>
</zprava>

