<?xml version="1.0"  encoding="ISO-8859-2"?> 
<!DOCTYPE zprava SYSTEM "techrep.dtd"> 
<zprava cislo="8/2004" jazyk="en"> 
<nazev>Overview of NetFlow Monitoring Adapter </nazev> 
<autor>Martin žádník</autor> 
<datum>3.11.2004</datum> 

<h1>Abstract</h1>

<p> Speed of communication among computers and other related
   devices grows very fast and computers are not able to process these data.
   Limiting factor is bus between hardware and software. New approaches must be
   introduced to deal with it.  One of them could be implementation of desired
   functionality right in the card.  COMBO6 card offers flexible solution
   for many applications, additional changes and features.  One of these applications
   can be NetFlow(network monitoring) which demands a lot of data to be passed
   to software. Possible better solution is to aggregate data in hardware then
   send processed data via PCI bus to software.  </p>

<h1>Introduction</h1>



<p>One of many applications suitable for hardware acceleration is to measure
   network traffic. Providing accurate information is basic for planning new
   networks, guaranteed bandwidth, detecting DoS attacks, billing and
   accounting.  </p>

<p> This document contains main idea of architecture which implements network
   traffic measurement also known as NetFlow. This architecture is convenient
   for VHDL and synthesizable in FPGA.  </p>

<p>Main blocks are:</p>
<ul compact="1">
<li>IBUF -- Input Buffer,</li>
<li>TSU  -- Timestamp Unit, </li>
<li>HFE  -- Header Field Extractor,</li>
<li>HASH -- Hash Unit,</li>
<li>CAM  -- unit which controls TCAM(ternary content address memory),</li>
<li>MAN  -- management unit for CAM and SRAM,</li>
<li>SRAM -- unit which controls SSRAM(synchronous SRAM),</li>
<li>FIFO -- FIFO which stores unified header information before they are passed to SRAM,</li>
<li>SW_FIFO -- short time FIFO with expired records from SRAM. It allows software to read those records via PCI bus.</li>
</ul>


<obr src="netflowarch" id="mc">Main idea of architecture</obr>

<p>Optional:</p>
<ul compact="1">
<li>TSUP -- Precise Timestamp Unit on COMBO PTM card, </li>
<li>FILT -- Filtering Unit, </li>
<li>SAU  -- Sampling Unit. </li>
</ul>




<h1>Content of Cards</h1>

<p> Several different types of cards have been introduced during period of
   Liberouter project.  </p>

<p> There is one basic mother card called COMBO6 that has to be 
       	always used.  It contains
   one FPGA(Virtex II v1000) surrounded by various chips as TCAM, 3x SSRAM, PLX, SDRAM etc.  Its
   duty is to provide enough computing power for interface card and a good
   connection with PC via PCI bus.  </p>

<p>There are several add-on interface cards for COMBO6 card, which provide
	flexible connection between various types of network interfaces and
	mother card. COMBO4-MTX with four copper Gbps interfaces is one of
	these cards. This card is equipped with FPGA(Virtex II v1000), 2xSSRAM
	(in future design), TCAM (in future design) etc. </p>

<p>This architecture of cards supports our design for NetFlow monitoring adapter.</p>

<h1>Block Description</h1>
 
<h2>IBUF</h2>

<p>The Input Buffer (IBUF) is used as a storage for incoming packets. Packets
	are received from GMII interface and only those one with correct CRC
	together with assigned packet timestamp are saved into the internal
	IBUF memory.</p>

<h2>TSU</h2>

<p>Implemented as 37 bits counter at frequency of 100MHz (T=10ns).  For next
	processing only 32 most significant bits are considered so output of this
	unit is 32 bit long timestamp with precision of 320 ns (2^5*10ns). This 
	precision allows to distinguish 
	two following packets and it is unique for 1300 s (2^37*10ns). This
	approach requires reading this register to software and to interpolate
	final values of timestamps with UNIX system time.   </p>

<p>Example:</p>

<tab sloupce="llll">
<tr>
   <th> </th> 
   <th>  Value in HW </th>
   <th>  Interpolation  </th>
   <th> Value in SW </th>
</tr>
<tr>
   <td> TSU register </td>
   <td> 0x0000000A  </td>
   <td> none -- just remember </td>
   <td> 12:30:00,000 </td>
</tr>
<tr>
   <td>TSU register </td>
   <td> 0x00008000  </td>
   <td> none -- just remember </td>
   <td> 12:30:00,021 </td>
</tr>
<tr>
   <td> Start timestamp </td>
   <td> 0x00000800  </td>
   <td> 0x0000000A + 0x00000800 = 0x0000080A </td>
   <td> 12:30:00,001 </td>
</tr>
<nazev>Example of interpolation</nazev>
</tab>

<h2>HFE</h2>

<p> The Header Field Extractor is intended for analyzing of input packets. It
	is a processor based on RISC architecture controlled by specific
	instruction set. HFE reads data of packet from Input Buffer, analyzes
	control information in its headers, extracts required fields from IP
	and TCP/UDP headers and assemble the unique key which designates each
	flow.  This key consists of IP source address, IP destination address,
	source port, destination port, transport layer protocol, type of
	service (ToS).  After processing each datagram HFE is also able to
	provide packet information for FIFO.  </p>

<h1>FIFO</h1>

<p> The packet FIFO is used as a storage for information about incoming packets
	from HFE processor. Records are stored in this block until the control
	path (HASH->TCAM) is passed through. FIFO contains following records
	for every incoming packet: number of bytes, timestamp, flags, key
	(NetFlow packet identification).  These records are provided during
	update or creation of record in SSRAM.  </p>

<h2>HASH</h2>

<p> Hash block implements a hash function (for example CRC). Input is six main
	fields of datagram. IP destination and source address, destination and
	source port, protocol, ToS. There is need to hash these fields because
	TCAM is configured as 32 768 entries x 68(64) bits wide. Probability that
	two different flows would map to the same entry is (supposing 200 000
	flows in one second (200000/2^64)=1E-14).Therefore this value is a
	unique identifier for every flow in reasonable time period (1 s).
</p>
 
<h2>CAM</h2>

<p> CAM block consists of TCAM (Ternary Content Address Memory). This memory  is configured to 32 768 records with
    64 bits length. CAM driver tries to match a hash number in TCAM.  If there
    is a record for this flow it returns pointer to SSRAM. If there is no
    record for this flow it creates new record and returns pointer to allocated
    memory.  </p>

<p>CAM also obeys instructions given by MAN and frees expired records
   and then sends acknowledge back to MAN.  </p>

<h2>MAN</h2>

<p> Management between TCAM and SSRAM is provided by MAN. Its basic function is
   to hold information about flows which are stored in TCAM and SSRAM and to
   add or free flows to assure enough free space for incoming flows.</p>

<p> Disposing of inactive flows is implemented as a 3 bit-field. There is a pointer which goes round and
	round this field and decrement value in each row. If it reads 010 value
	then record is inactive and has to be disposed (2^3-2, that is
	precision 1/6 of set value).  Speed of pointer circling depends on
	value given by software. When new flow is created or matched in TCAM value in appropriate
        row is updated to 111b.</p> 


<p> MAN also contains information about occupation of SSRAM that tells 
	what rows are empty.  That also 
	allows MAN to tell SRAM whether current operation is update,
	new record or delete. When SRAM requires to delete record which 
resides too long MAN manage this operation.</p>

<p>For reason of  pipelined processing MAN has one reserved value in this
       bit-field called waiting for
   delete. It holds record whether flow entry was deleted in TCAM or not.</p>

<p> There is a register that holds number of records either in TCAM or SSRAM
	that solves the problem with of overflow with too many records in either TCAM or SSRAM.  
	This register also influence mode of disposing records. That means if
	there are too many records stored in SSRAM then MAN tries to dispose
	records aggressively.
</p>

<p> To sum up, MAN gives CAM and SRAM commands (for example delete record,
   update) and also listen to their commands. MAN should bound CAM and SRAM to
   seem like one unit.  </p>

<h1>SRAM</h1>

<p> Purpose of this unit is to load and update data in SSRAM.  </p>

<p> For every flow following record must be stored:</p>

<tab sloupce="lll">
      <tr> 
	 <th>Name</th>
	 <th>Size</th>
	 <th>Description</th>
      </tr>
      <tr> 
	 <td>Start timestamp </td>
	 <td> 4 Bytes </td>
	 <td> When the flow has begun </td>
      </tr>
      <tr>
	 <td>End timestamp </td>
	 <td> 4 Bytes </td>
	 <td> When the flow ended </td>
      </tr>
      <tr>
	 <td>Number of bytes </td>
	 <td> 8 Bytes </td>
	 <td> Total number of bytes for this flow </td>
      </tr>
      <tr>
	 <td>Number of datagrams </td>
	 <td>  4 Bytes </td>
	 <td> Total number of datagrams for this flow </td>
      </tr>
      <tr>
	 <td>Flags </td>
	 <td> 1 Byte </td>
	 <td> Aggregated flags from packets </td>
      </tr>
      <tr>
	 <td>Source IP address </td>
	 <td> 16 Bytes </td>
	 <td> Either IPv4 or IPv6 source address </td>
      </tr>
      <tr>
	 <td>Destination IP address </td>
	 <td> 16 Bytes </td>
	 <td> Either IPv4 or IPv6 destination address </td>
      </tr>
      <tr>
	 <td>Source port </td>
	 <td> 2 Bytes </td>
	 <td> Source port of transport protocol </td>
      </tr>
      <tr>
	 <td>Destination port </td>
	 <td> 2 Bytes </td>
	 <td> Destination port of transport protocol </td>
      </tr>
      <tr>
	 <td> Transport layer protocol </td>
	 <td> 1 Byte </td>
	 <td> Transport layer protocol </td>
      </tr>
      <tr>
	 <td>Type of Service </td>
	 <td>  1 Byte </td>
	 <td> Type of Service field in IP packet header</td>
      </tr>
      <tr>
	 <td>Together  </td>
	 <td>  59 Bytes </td>
	 <td>  This field is not part of record</td>
      </tr>
<nazev>Structure of record</nazev>
</tab>

<p> Possible extensions are available to 64 B (for example extension of
   timestamp).  </p>

<p> During update Start timestamp is checked against active time register and
   disposed or stored again according to interval for expiration of active
   flow.  </p>

<p> When record is stored end timestamp, number of packet, number of bytes and
   flags are updated. This format of data allows us to gather information for 1600 seconds
   (T/(G/B/S)=2^32/(2^30/8/48)) considering the worst case.  It means that there is only
   one flow at speed of 1 Gbps assuming one packet comes every 500 ns.  Because
   TSU provides only 32 bits with range of 1200s it has no sense to detect
   overflow of number of packets or number of bytes. They must be read before
   they overflow.  There is another bad case when every packet creates new
   flow. That is a problem especially for reading via PCI bus to computer
   because there is no compression of data. Possible solution is to discard
   data which cannot be read or  implement some aggregations in hardware (see
   section below).</p>

<tab sloupce="ll">
      <tr> 
	 <th>Symbol</th>
	 <th>Description</th>
      </tr>
      <tr> 
	 <td>T </td>
	 <td> Timestamp </td>
      </tr>
      <tr>
	 <td>G </td>
	 <td> 1 Gbps </td>
	 
      </tr>
      <tr>
	 <td>B</td>
	 <td>Byte</td>

      </tr>
      <tr>
	 <td>S</td>
	 <td>Shortest length of packet(in Bytes)</td>
 </tr>
 <nazev>Description of used symbols</nazev>
</tab>
 

<h2>SW_FIFO</h2>

<p> In future design will be possible to introduce another type of aggregation.
   For example to place another unit between SRAM and PCI bus instead of
   SW_FIFO which serves as a short time buffer. This unit would be able to
   process data according to various aggregation schemes. These schemes would be
   changeable according to software wish.</p>

<h1>Conclusion</h1>

<p>This document suggests main idea of hardware
   architecture which could be implemented in COMBO6 cards. As result there
   should be hardware acceleration card for NetFlow monitoring.  </p> 

</zprava>

