<?xml version="1.0" encoding="utf-8"?>
<!-- $Id: combo6-header-lookups.xml,v 1.13 2003/01/10 13:48:01 xantos Exp $ -->
<!DOCTYPE zprava SYSTEM "techrep.dtd">

<zprava cislo="1/2003" jazyk="en">
  <nazev>Packet header matching in Combo6 IPv6 router</nazev>
  <autor>David Antoš, Jan Kořenek, Kateřina Minaříková,
    Vojtěch Řehák<footnote>This work has been partially supported by
      the GACR grant No. 201/03/0509.</footnote>
  </autor>
  <datum>9/1/2003</datum>

  <h1>Abstract</h1>

  <p>
    The report describes current proposal and arising implementation of
    header matching engine in the <i>Combo6 IPv6 router</i> accelerator card.
  </p>

  <h1>Introduction</h1>

  <p>
    Combo6 is a PCI acceleration card for IPv6 (and IPv4) packet filtering
    and forwarding. This report covers <i>header matching</i> performed by the
    card. We divide the problem into two separate parts; the hardware
    processor that performs fast header matching, and the software support,
    preparing contents of processor's tables. From another point of
    view, the software's task is to compute the look-up program performed
    by the hardware processor.
  </p>

  <p>
    The paper also shows the main principle we decided to follow,
    interconnecting the accelerator into the Un*x environment in the most
    standard way as possible. It means that the card should behave as usual
    network adapters, it only increases the performance in case of packets
    that can be forwarded by it. It allows incremental development and
    adding features in the future. It also does not force us to change
    standard tools for routing and network configuration, making us
    independent on the development of those tools.
  </p>

  <p>
    The report is organized as follows.
    <a href="#hardware">Section</a> describes hardware aspects of
    the look-up machine.
    <a href="#hardware-io">Part</a> shows the blocks surrounding the look-up
    machine and <a href="#hardware-look-up">part</a> describes the
    processor itself.
    Special interest should be given to
    <a href="#verification">part</a> where the group of formal
    verifications colaborating with the hardware
    team has shown their power.
    Finally, <a href="#hardware-simulation">section</a> describes the
    reasons and methods why we chose to simulate the machine in software
    first.
  </p>

  <p>
    <a href="#software">Section</a> describes interconnection of the
    accelerator card into the operating system environment. We start
    describing system interfaces.
    <a href="#software-rf">Part</a> shows the concept of
    routing/firewalling table that should allow routing and packet
    filtering in the only look-up structure traversal.
    <a href="#software-look-up-computation">Part</a> discusses problems of
    the computation of effective version of routing/firewalling tables for
    the hardware look-up processor.
  </p>
  
  <h1 id="hardware">Hardware</h1>

  <p>
    Look-up operations (together with nearly all processes in the router)
    are performed by programmable hardware,
    <i>Field Programmable Gate Arrays (FPGA)</i>.
    We use two types of memories to store the look-up program. We have a
    Content Addressable Memory<footnote>CAM gets a (wide) word of bits and
      answers with the address, where the input data is kept in it.</footnote>
    (CAM from now on). We use Micron MT75W16Y136HBB. This type allows
    configuring as 4K words having 272 bits (it is one of the widest
    available CAMs). Moreover, it lets us specify <uv>don't care</uv> bits.
    Typical access time is about 80 ns.
    The other type of memory is an ordinary static RAM (SRAM)
    with typical access time about 10 ns.
  </p>

  <h2 id="hardware-io">Inputs and outputs</h2>

  <p>
    The input of the header matching engine comes in a structure called
    Unified-header, physically stored as a set of registers. The
    Unified-header is a fixed structure keeping the relevant information
    extracted from the packet's header.
    The content of Unified-header was proposed by Ivo Hažmuk.
    For detailed description, see the
    report of his group (to appear soon).
    For illustration, we give a short summary of main fields:
    <ul>
      <li>L2 and L3 status flags</li>
      <li>Source and Destination MAC</li>
      <li>VLAN id</li>
      <li>Source and Destination IP address and port</li>
      <li>&ldots;</li>
    </ul>
    The structure comes over 440 bits in length.
  </p>

  <p>
    There are four header extracting machines, one for each interface. Each
    machine has (at least) four sets of registers to store Unified-headers.
    Whole packets are stored into the main dynamic RAM, therefore the only
    part of the packet the header extractor must pass through is an
    identification of the packet (memory allocation block number).
    Each engine reads the Unified-header and computes a pointer to an
    editing program. The editing program denotes how the packet should be
    changed and which interfaces the packet should be sent out. Note that
    there may be more interfaces the packet must be replicated to. The
    Unified-header is destroyed after processing.
  </p> 

  <p>
    One of the output interfaces is software. This interface behaves like a
    normal network adapter from the operation system's point of view. This
    way we can
    <ol>
      <li>answer to erroneous packets,</li>
      <li>receive packets sent to the host computer,</li>
      <li>trace the network traffic using usual system tools,</li>
      <li>process unusual packets we do not understand.</li>
    </ol>
    Of course, packets coming through the operating system are slowed down,
    therefore we plan to move some error handling to the hardware in future
    versions.
  </p>

  <h3>Hardware realization</h3>

  <p>
    Unified-header is physically stored in a buffer. The buffer is filled
    by the header field extractor and read by the look-up processor. The read
    and write operations have to be fully independent. Therefore the buffer
    is stored in dual port Block SelectRAM in the FPGA chip.
  </p>

  <p>
    Each buffer consists of four sets of registers, each containing a
    Unified-header. The buffer is controlled by a status register; one bit
    for one set of registers. The bit signalizes whether the set of
    registers is free or contains a Unified-header. Header field extractor
    can write a Unified-header to the set only if the bit is zero. When the
    Unified-header is fully written to the buffer, the bit is set. The
    buffer is read by the look-up processor. It can start handling the
    Unified-header only if the bit is set. When it finishes the work it resets
    the bit. This way we guarantee that the header field extractor never
    rewrites any stored Unified-header and the look-up processor never
    processes the Unified-header that has not been fully stored.
  </p>

  <h2 id="hardware-look-up">Look-up processor</h2>

  <p>
    The processor performs a traversal through a tree structure (or finite
    automaton from other point of view), the
    program is started from the beginning for each incoming header.
    Instruction set was proposed by Pavel Zemčík.
  </p>

  <p>
    There are three kinds of instructions:
    <ul>
      <li><prikaz>CAM Step, List</prikaz> instruction matches a subset of
	registers against CAM memory. We can choose part of the
	Unified-header with a bit
	mask <prikaz>List</prikaz>. The resulting address returned by CAM
	serves as the pointer to the continuation of the look-up
	program.</li>
      <li>Comparison instructions contain a set of instructions for
	comparing registers from Unified-header (or even upper and lower
	parts of them) to constants. The tests include equality, bigger,
	lower and so on.</li>
      <li><prikaz>EXE Queue</prikaz> stops processing of the header and
	places the identification of the packet together with the
	<prikaz>Queue</prikaz> parameter into the output queue. The
	parameter is a pointer to the editing program.</li>
    </ul>
    Instructions of the first two sorts are relative
    conditional jumps. The length of the jump is driven by their parameter
    <prikaz>Step</prikaz>. For implementation reasons, we must restrict
    ourselves to a discrete set of possible steps, partly because of
    dearth <!-- dearth je nouze, spise nez uplny nedostatek -->
    of space in the binary
    encoding of instructions and also because of time complexity of general
    adder in hardware.
  </p>

  <p>
    Instructions are encoded into 36-bit words. This is not a typo, we
    need to keep up to 32 bits of arguments in the instruction code,
    therefore we shall use parity bits of SRAM, too.
  </p>

  <p>
    It is worth noting that <i>erroneous packets</i> are detected by the
    header extractor. It sets a flag in the L2 or L3 register. We must
    match the bit (most probably with <prikaz>CAM</prikaz> instruction) and
    mark the packet with the editing program that sends the packet to the
    operating system. In future versions, we plan generating ICMP
    messages for common errors in the editing engine.
    We also have a <uv>trash</uv> interface that only
    frees the dynamic memory, the memory is physically deallocated by the
    editing engine.
  </p>

  <p>
    Once more, the look-up program acts as a traversal of a tree (trie)
    structure. <prikaz>CAM</prikaz> represents first levels, other
    instructions the rest.
    Important observation is that
    various levels of the tree have <uv>distinct abilities,</uv>
    for example <prikaz>CAM</prikaz> is able to specify <uv>don't care</uv>
    bits but cannot do a less-or-equal comparison. The opposite is true for
    the rest of the tree, for other instructions.
  </p>

  <h3>Hardware level</h3>

  <p>
    Look-up processor is a hardware entity.
    <a href="#look-up-interface">Figure</a> shows supposed interfaces of the
    entity. The interface consists of signals
    connecting the look-up processor with neighboring components--buffer,
    CAM, SRAM, and replicator.
  </p> 

  <obr src="symbol" id="look-up-interface">Look-up processor interface</obr>

  <p>
    Execution of an instruction starts by loading it from SRAM. Look-up
    processor decodes the instruction and parameters from its
    operating code. This step is the same for all instructions. The rest of
    processing differs according to the classification of instructions
    given above.
  </p>
  
  <p>
    <prikaz>CAM</prikaz> instruction, the only instruction belonging to the
    first kind, is performed by the content addressable memory (CAM)
    connected to the FPGA chip. The look-up processor only has to load
    subset of registers from a buffer to CAM. Registers are specified by
    <prikaz>List</prikaz> parameter obtained from the instruction word. The
    parameter is organized as bit map and the look-up processor has to
    convert the bit map to a sequence of register addresses.
    The registers are loaded into CAM. After the look-up operation
    (consuming latency about 80 ns) CAM fetches the next instruction from
    SRAM.
  </p>
  
  <p>
    Branching instructions compare registers to constants. This operation is
    done fully by the look-up processor. In the first step the parameters from
    the instruction word we obtained are <prikaz>Address</prikaz>,
    <prikaz>Step</prikaz>, <prikaz>Mask</prikaz>, and
    <prikaz>Constant</prikaz>. The register specified by
    <prikaz>Address</prikaz> is loaded from the buffer, masked with
    <prikaz>Mask</prikaz> and compared to the <prikaz>Constant</prikaz>.
    The type of the comparison depends on the instruction code. When the
    instruction succeeds we jump relatively the current address incremented
    by <prikaz>Step</prikaz>. Otherwise we go to the following instruction.
    It is worth noting that both the addresses are computed concurrently
    with the comparison and the desired address is copied to program
    counter.
  </p>

  <p>
    Finally, <prikaz>EXE</prikaz> instruction finishes the processor's run
    and contains information for the packet replicator in its operation
    code. When processing reaches <prikaz>EXE</prikaz> the look-up machine
    copies the instruction's argument together with an identification of
    the packet to the output queue. Then it checks whether there is another
    Unified-header in the buffer and if so, it starts processing over.
  </p>
  
  <h3>Sharing CAM and SRAM</h3>

  <p>
    Let us recall that each interface has its own look-up processor. The
    processors share one CAM and one SRAM. Therefore it is necessary to
    ensure fair mutual exclusion of accesses to those resources among the
    processors. The mutual exclusion is an obvious must, moreover we want
    the accesses to be fair, protecting the interfaces from starving and
    blocking.
  </p>

  <p>
    Each look-up machine has its time-slot to access SRAM. Sharing CAM gets
    more difficult. <prikaz>CAM</prikaz> instruction processing ends with
    accessing SRAM, so the begins of <prikaz>CAM</prikaz> instructions are
    driven by the finishing SRAM time-slot.
  </p>
  
  <obr src="time_new" id="time-diag">Time diagram of sharing CAM
    and SRAM</obr>

  <p>
    <a href="#time-diag">Figure</a> shows an example of processor's
    work. Small horizontal lines divide the graph into 10 ns slots.
    LUP1 to LUP4 are look-up processors and horizontal bars show what
    they do. Load data section means that registers are loaded into CAM,
    this time CAM cannot do anything else. CAM performs the match in the
    <uv>latency time,</uv> in this phase it can be loaded another data to
    match. A small <uv>SR</uv> slot at the very end of CAM processing is
    getting the result of the match out of SRAM. This time-slot must get
    just to the SRAM time-slot for that machine. The machine gets an
    instruction to process, the instruction is decoded and executed during
    COMP phase. Following SR slots are SRAM accesses, fetching the
    following instruction.
  </p>

  <p>
    At the figure bottom, there is an overview of SRAM slots with numbers
    of machines that are allowed to access SRAM. Finally, <uv>CAM slot</uv>
    is a place where particular machine may access CAM <i>if and
      only if</i> CAM is not in use by another machine. More preciselly, if
    no machine uses CAM in the beginning of the CAM slot, the machine
    owning the slot is allowed to start load data phase in the beginning of
    the slot.
    For example, when
    LUP1 accesses CAM in the beginning of the figure, it blocks LUP2, LUP3,
    and LUP4 from accessing CAM for the first four time-slots. Even if LUP4
    asked for CAM at the very beginning of the picture, it would get it in
    the eighth time-slot (exactly as shown).
  </p>

  <p>
    It is not obvious that this mechanism protects from blocking and
    provides fair scheduling. It can be shown that there is an upper bound
    of waiting time. <a href="#verification">Section</a> deeply discusses
    this topic.
  </p>

  <!-- Vojta -->
  <h2 id="verification">Verification of Sharing CAM and SRAM</h2>

  <p>
    Decision on correctness of the arranged schedule is a feasible
    challenge to verification section of our team. Demanded properties were
    verified by the symbolic model checker NuSMV. Model
    checking<footnote>For more information
      see <cite href="Bar02"/>.</footnote> is
    a formal method which allows one to automatically prove whether a model
    of the system at the suitable level of abstraction satisfies given
    specification. 
  </p>

  <p>
    Our model consists of five synchronous modules, <tt>timer</tt> and four
    look-up processors (lups from now on).
    The <tt>timer</tt> counts 10 ns slots modulo 4 in
    variable <tt>time</tt>. Lups <tt>lup0</tt>, <tt>lup1</tt>,
    <tt>lup2</tt>, and <tt>lup3</tt> simulate four look-up processors
    sharing CAM and SRAM. Each lup can be in one of six states, changing
    only when the variable <tt>time</tt> is equal to its rank.
    Meaning and behaviour of each state is as follows:
  </p>

  <p>
    The <tt>sleep</tt> state simulates behaviour of processor that has empty input
    buffer. Next state depends on whether any packet comes and whether CAM
    is not in use by another machine. If no packet comes then the look-up
    processor remains in <tt>sleep</tt> else look-up processor changes to
    <tt>wait</tt> or <tt>load_data</tt> state.
    The choice of <tt>wait</tt> or <tt>load_data</tt> depends
    on whether CAM is in use by another machine or not.
  </p> 

  <p>
    The processor is in <tt>wait</tt> state if it wants to start processing of a packet 
    but CAM is busy. Remaining in <tt>wait</tt> depends on 
    availability of CAM. If CAM is free, the next state is
    <tt>load_data</tt>.
  </p>

  <p>
    The <tt>load_data</tt> state represents loading data into CAM--the
    critical section of CAM sharing. The next state is <tt>latency1</tt>.
  </p>

  <p>
    States <tt>latency1</tt> and <tt>latency2</tt> represent
    only waiting for result.
    In addition, <tt>latency2</tt> includes the finishing SRAM time-slot.
    <tt>Latency2</tt> is followed by a sequence of <tt>comp</tt> states.
  </p>

  <p>
    The <tt>comp</tt> states simulate computation using SRAM. A
    <tt>comp</tt> state corresponds to performing one instruction; in the
    end of the processing SRAM is accessed. The number of instructions is
    not limited. It is obvious that the next states are <tt>comp</tt>,
    <tt>sleep</tt>, <tt>wait</tt>, and <tt>load_data</tt>.
  </p>

  <p>
    We abstract away from the emptiness of the input buffer and the number of
    instructions in computation using SRAM. It means that each decision
    based on that features is replaced by a non-deterministic choice.
    The code below shows the real implementation of described
    CAM and SRAM sharing algorithm.
  </p>

  <pre>
MODULE timer_type(time)
ASSIGN
  next(time) := (time+1) mod 4;


MODULE lup (me, time, CAM_busy)
VAR
  state : {sleep, wait, load_data, latency1, latency2, comp};
ASSIGN
  init(state) := sleep;
  next(state) :=
    case
      !(time=me)               : state;
      state=sleep &amp; ! CAM_busy : {sleep, load_data};
      state=sleep              : {sleep, wait}; 
      state=wait &amp; ! CAM_busy  : load_data;
      state=wait               : wait;
      state=load_data          : latency1;
      state=latency1           : latency2;
      state=latency2           : comp;
      state=comp &amp; ! CAM_busy  : {comp, sleep, load_data};
      state=comp               : {comp, sleep, wait};
    esac;
DEFINE
  SRAM_used := time=me &amp; (state=latency2 | state=comp);

MODULE main
VAR
  time : 0..3;
  lup0 : lup(0,time,CAM_busy);
  lup1 : lup(1,time,CAM_busy);
  lup2 : lup(2,time,CAM_busy);
  lup3 : lup(3,time,CAM_busy);
  timer: timer_type(time);
ASSIGN
  init(time) := 0;
DEFINE
  CAM_busy := lup0.state = load_data | 
              lup1.state = load_data | 
              lup2.state = load_data |
              lup3.state = load_data;
  </pre>

  <p>
    We checked all interesting properties of this model such as mutual
    exclusion of access to SRAM, mutual exclusion of access to CAM, and
    fairness of using CAM (no starving, no blocking). Additionally we
    computed that the maximal length of waiting for CAM access is 120 ns.
    The verified formulas and results of verification are presented below.
  </p>

  <pre>
-- Mutual exclusion of accesses to SRAM 
-- specification
AG (!(lup0.SRAM_used &amp; lup1.SRAM_used | 
      lup0.SRAM_used &amp; lup2.SRAM_used | 
      lup0.SRAM_used &amp; lup3.SRAM_used | 
      lup1.SRAM_used &amp; lup2.SRAM_used | 
      lup1.SRAM_used &amp; lup3.SRAM_used | 
      lup2.SRAM_used &amp; lup3.SRAM_used))
is true

-- Mutual exclusion of accesses to CAM 
-- specification
AG (!(lup0.state = load_data &amp; lup1.state = load_data | 
      lup0.state = load_data &amp; lup2.state = load_data | 
      lup0.state = load_data &amp; lup3.state = load_data | 
      lup1.state = load_data &amp; lup2.state = load_data | 
      lup1.state = load_data &amp; lup3.state = load_data | 
      lup2.state = load_data &amp; lup3.state = load_data))
is true

-- Fairness of using CAM (no starving, no blocking)
-- LTL specification
G ((lup0.state = wait ->  F lup0.state = load_data) &amp; 
   (lup1.state = wait ->  F lup1.state = load_data) &amp; 
   (lup2.state = wait ->  F lup2.state = load_data) &amp; 
   (lup3.state = wait ->  F lup3.state = load_data))
is true

-- The maximal length of waiting for the CAM access
-- MAX(lup0.state = wait, lup0.state = load_data) is 12
-- MAX(lup1.state = wait, lup1.state = load_data) is 12
-- MAX(lup2.state = wait, lup2.state = load_data) is 12
-- MAX(lup3.state = wait, lup3.state = load_data) is 12
  </pre>
  <!-- Vojta -->

  <!-- Start of K.M. -->
  <h2 id="hardware-simulation">Software simulation of the look-up engine</h2>

  <p>
    We have decided to do a prototype of the header matching engine in C
    before we start coding in VHDL<footnote>VHDL is a high level language
      for hardware programming; its syntax is similar to Ada.</footnote>.
    VHDL brings completely different approach in programming compared to
    usual languages. VHDL is designed to describe hardware, therefore it
    covers parallelism of processes and signals.
  </p>

  <p>
    The main reason for the software simulation is
    that we want to test behavior of the look-up machine and related
    software before we fix the structure of hardware. Changes in the
    simulator are much cheaper and easier, moreover even a small
    change in the instruction set can cause a complete hardware redesign.
  </p>

  <p>
    We will simulate real behavior of
    the header matching engine by passing real packet headers to its input.
    Collecting experience from those experiments, we will be ready to fix
    the structure of the hardware.
  </p>

  <h3>Look-up processor simulator</h3>
  
  <p>
    The simulator simply interprets the look-up program (described above).
    First, it reads contents of CAM and SRAM memories, parses them, and
    checks it syntactically. We decided to store the memories as text
    files, it is less efficient (it does not matter in the simulator),
    but it allows us to check (and even create) them by hand.
  </p>
  
  <p>
    The simulator is especially simple, it reads the content of the
    Unified-header on the input and starts interpreting the look-up
    program. It processes the instructions until it reaches <prikaz>EXE
      Queue</prikaz>. Then it writes its value to the output queue.
  </p>

  <h3>Interfaces</h3>

  <p>
    The input of the simulator is generated by L2 striper and header field
    extractor; those blocks prepare the Unified-header out of the real
    packet. There are particular simulators of the blocks, too. Look-up
    processor receives Unified-header stored in 32 16-bit registers from
    those engines. We also want to test the behavior of the interfaces
    among the blocks. We have been working on so called <uv>parallel
      simulator</uv> that performs simultaneous run of all blocks. The
    simulators communicate using shared memory. The parallel simulator
    executes four instances of header parsers and one look-up processor.
    (It is to be changed according to the last changes in the proposal.)
    The main result of the simulation is an analysis of time dependencies in
    the engine.
  </p>

  <h3>Future plans</h3>
  
  <p>
    We are about to finish interconnection of the header parser block and
    the look-up processor. Simulation is very limited at this first
    stage--it is because of the fact that today's header parser simulator
    works just at the L2 level. This means that Unified-header is not 
    filled properly. L2 level related registers are the Source and the
    Destination MAC addresses. But later on--when L3 level will be
    ready--simulation may be very useful. We will be able to observe used
    instructions according to the type of  packet on the input by means of
    instruction counters. 
  </p>   
  <!-- End of K.M. -->
  
  <h1 id="software">Software</h1>

  <p>
    Let us recall that the final goal of the software proposal is to make
    it independent on the tools used for routing and network setting in the
    operating system. Therefore we will use standard operating system
    interfaces as widely as possible and all exceptions must be
    discussed and judged well.
  </p>

  <obr src="blok-struct">Block structure of header matching engine and
    supporting software</obr>

  <p>
    We describe current proposal of software support. Let us start with
    interfaces of the software support module.
  </p>
    
  <h2>Interfaces</h2>

  <p>
    In the lowest level, the
    module should produce programs for the look-up machine. It feeds the
    program to the memories through a driver. The driver exports a usual
    Un*x device (<soubor>/dev/combo</soubor>) and has an API on that. The
    API covers low level operations altering the contents of the memories.
    Typical operations of the API are reading/writing 36 bit words
    from/into SRAM, the same for blocks (for SRAM burst mode),
    reading/writing a line of CAM (it is 272 bits), the same for blocks of
    lines. The driver serves to the needs of all supporting software;
    it provides uploading firmware, reading status information,
    transferring packets, etc. This kind of communication is beyond scope of
    this report. Probably, the <soubor>/dev/combo</soubor> device will be
    split into several devices, according to the structure of inner
    memories of the card.
  </p>

  <p>
    Before we turn to the opposite interface, let us sum up what the
    operating system does to enable packet forwarding and filtering.
    Kernel keeps its routing table. The table is either configured
    statically or is maintained by a routing daemon (this will be surely
    our case). Firewall rules are configured using some kind of
    firewall description language, such as ipchains. When a packet arrives,
    kernel compares its headers with the tables and decides what to do with
    that packet.
  </p>

  <p>
    Simply, we want to override this mechanism and let the hardware
    accelerator make the decision in case it knows how to treat the packet;
    for the most common packets first, adding other types (in accordance
    with a priority list) in later stages of development. Therefore we have
    to know the configuration of the interfaces, and current routing and
    firewalling tables.
  </p>

  <p>
    There are two ways to obtain the tables. We can either modify the
    routing daemon (that tells the kernel what to do normally) to tell us
    the routing table, too. This approach has the main disadvantage that it
    requires modifying code out of our control, and maintaining changes
    in the process of development of the daemon. It also restricts the
    users of the <i>Combo6</i> router to the set of daemons we support. The
    other way (we decided to follow) is to create a daemon that listens the
    changes in the kernel's routing table and lets us know about them.
  </p>

  <p>
    The separate problem arises with packet filtering. Tools for
    firewalling provide many different features and many mechanisms. There
    is no common standard in one operating system kernel, and the situation
    gets much worse when thinking about portability. This problem is being
    discussed and we plan to make a study about abilities of
    firewalling mechanisms compared to the possibilities and capabilities
    of the look-up program described above.
  </p>

  <p>
    As we have said above, we have a daemon that watches changes in the
    kernel routing table and announces the differences. The hardware table
    computation mechanism is connected to it through so called <i>RT-callback
      interface</i>. The daemon itself should be divided into a system
    independent part and a small system dependent layer converting the data
    to a format suitable for our processing. Something similar should be
    done with firewalling rules.
  </p>

  <p>
    To enable network diagnostics, tools like <prikaz>tcpdump</prikaz>
    modify kernel tables to insert filters that check for the desired
    information. The modification of this kind must be propagated to the
    RT-callback interface. We suppose that packets that should be checked
    by <prikaz>tcpdump</prikaz> will be routed by software. In order not to
    degrade the performance too much, we nevertheless should use operating
    system to treat as small superset of the affected packets as possible.
    This is left to thorough discussion in future versions.
  </p>
  
  <h2 id="software-rf">Routing/firewalling table</h2>

  <p>
    The RT-callback interface announces the changes of routing and
    firewalling tables together with the settings of network diagnostics
    tools. Today we have a very first draft of the interface suggested by
    Zdeněk Salvet. We shall probably keep our internal copies of the tables
    for efficiency reasons.
  </p>

  <p>
    As we have the only machine for routing and firewalling in hardware, we
    must combine the information from the tables and we must perform the
    decision where to send the packet and whether to send it in the only
    look-up operation.
  </p>

  <p>
    We created a concept of <i>routing/firewalling</i> table (r/f table
    from now on). The r/f table is the routing table with firewalling rules
    applied on each of its lines. In general, a routing rule may be spread
    into a number of r/f table entries.
    This structure must be analysed well, especially compared to the
    expression power of other <uv>firewall describing</uv> languages like
    ipchains.
  </p>

  <p>
    When a change of input tables occurs, in the first development
    iteration we plan only recomputing the whole r/f table. Later we want
    to study how to make local changes only. Nevertheless, as we will see
    later, recomputation on this stage is not the critical part of the
    system.
  </p>

  <p>
    Note that the r/f table also provides a connection to the editing
    program<footnote>Editing program describes what should be done with the
      packet, where to send it and how to change it.</footnote>. We expect the
    set of editing programs to be fixed, therefore the actions coming from
    the tables together with type of the packet denote the editing program.
  </p>

  <p>
    It is worth saying that the r/f table is hardware independent. We also
    want to keep it as system independent as possible, but we may sacrifice
    some small part of this quality for efficiency reasons.
  </p>

  <h2 id="software-look-up-computation">Look-up program computation</h2>

  <p>
    Look-up program is physically stored as the contents of CAM and SRAM.
    It is computed out of the r/f table. The reader should keep in mind
    that the hardware form of the look-up structure does not have to allow
    reconstruction of the original r/f table for various reasons,
    e.g., it may contain expanded
    strings. We suppose that the hardware look-up program will be strongly
    optimised.
  </p>

  <p>
    In the first development iteration, we plan that the update of CAM
    and SRAM is possible for the whole structure at once. The generator
    takes the whole r/f table, <uv>thinks up</uv> the optimization, and
    computes the look-up program in terms of the instructions described
    above. Then it asks the driver to upload the table into the memories.
    An easy way to update CAM is to divide it into halves, one is
    used for production work, the other serves for uploading new strings.
    When the upload is finished, the parts are atomically switched. It
    reduces the problem of updates only to timing of the CAM accesses. Note
    that SRAM is <uv>magically</uv> switched too, the uploadable part is
    not accessible from the production part of CAM and vice versa.
    The software driver must also keep an allocation information about the
    memories.
  </p>

  <p>
    For future versions we want to enable updating only parts of the
    structure. Then, the look-up program generator would get an instruction
    <uv>changed from-to, upload it.</uv> To demonstrate problems of
    this approach, we show the following. CAM, when more rows match the
    query, returns the very first matching row<footnote>It is possible not
      only because of duplicated strings, but due to don't care
      bits.</footnote>.
    It means that more special cases must be located closer to the address
    zero in CAM. How do we add a more special rule, when CAM happens to be
    full in the location we have to use?
  </p>

  <p>
    Overwriting of data brings also problems with validity of records. A
    change of look-up program in the memories must be atomic in the sense
    that we cannot use new data for production work until it is fully
    uploaded. (It is better to use obsolete routing table for a second or
    two than to make a complete mess.)
  </p>

  <p>
    We suppose that a good response time for propagating changes through
    the whole structure should be in order of seconds. The time complexity
    will probably depend on the optimization technique used. In fact, the
    problem could be formulated as minimization of a special finite
    automaton with strong bounding conditions. This problem is one of the
    (several tens of) biggest challenges in the whole project.
  </p>

  <h1 id="conclusion">Conclusion</h1>

  <p>
    As the reader surely noticed, <uv>level of abstraction</uv>
    increases with the chapter number in this report. This tracks the state
    of the project well.
  </p>
  
  <p>
    We have created an implementation plan. Now (December 2002)
    we have a simulation of the
    look-up processor. Hardware part is specified in blocks and the blocks
    are being refined. Currently, the most of the effort is routed to the
    specification of interfaces. Then we plan to take care of r/f table
    and look-up program computation. In this state it can be tested on the
    card prototype where the headers are completely treated by the software
    simulator. Concurrently, hardware programming will take place.
  </p>

  <p>
    The stages described above run more or less concurrently. The main point
    is that we want to hold up the hardware programming until we are really
    sure about the detailed structure of the instructions.
  </p>

  <p>
    We have described proposed structure of the <i>Combo6</i> IPv6 card. We
    covered both hardware and software aspects of the design. We have shown
    the power of model-checking methods to make hardware design easier. We
    have found common language with the group of formal verification,
    reaching the desired point where theory meets practice.
  </p>

  <!-- References -->
  <seznamknih>
    <kniha id="Ant02">
      Antoš D.:
      <i>Overview of Data Structures in IP Lookups</i><br/>
      CESNET technical report 9/2002
    </kniha>
    
    <kniha id="Bar02">
      Barnat J., Brázdil T., Krčál P., Řehák V., and Šafránek D.:
      <i>Model checking in IPv6 Hardware Router Design</i><br/>
      CESNET technical report 8/2002
    </kniha>

    <kniha id="Nov02">
      Novotný J., Fučík O., Kokotek R.:
      <i>Schematics and PCB of COMBO6 card</i><br/>
      CESNET technical report 14/2002
    </kniha>

    <kniha id="WWW">
      WWW pages <tt>http://www.openrouter.net/</tt>
    </kniha>
  </seznamknih>
  
</zprava>


