<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE zprava SYSTEM "techrep.dtd">
<zprava cislo="20/2003" jazyk="en">
  <nazev>IBP Deployment Tests and Integration with DiDaS
  Project</nazev>

  <autor>Lukáš Hejtmánek, Petr Holub</autor>

  <datum>2003-11-23</datum>

  <h1>Abstract</h1>

  <p>In this report we describe testing setup of Internet
  Backplane Protocol (IBP) and its integration with
  Distributed Data Storage project (DiDaS). We give a short
  overview of IBP followed by description of developer
  interfaces for IBP as well as end user tools for
  manipulating files stored in IBP infrastructure. We have
  created a library for developers that offers easy
  interface to IBP resembling traditional UN*X file
  operation calls and we have also implanted IBP
  capabilities into several end user software tools using
  this library. RAID arrays benchmarks and preliminary IBP
  benchmarks are presented as well.</p>

  <h1>Introduction</h1>

  <p>In late 2002 CESNET Development Fund decided to support
  a project for building distributed data storage called
  DiDaS as an extension of Grid activities of MetaCenter
  project <cite href="Meta" />. The DiDaS project aims to
  build large distributed storage capacity based on probably
  the most extensive project in this field: Global
  Distributed Network Storage developed at University of
  Tennessee, Knoxville <cite href="SIG02" />, <cite
  href="IBP" />. Furthermore DiDaS project addresses support
  several pilot user communities that will use the storage
  infrastructure.</p>

  <p>One of these pilot groups is a project developing
  Distributed Encoding Environment <cite href="DEE" /> for
  multimedia processing,which will use both DiDaS storage
  capacity and MetaCenter computing capacity (esp. Linux PC
  clusters that are available at large now). Contemporary
  video processing tools suffer from low processing power
  and low disk capacity available on single computer.
  Building large multiprocessor systems with shared memory
  and large disk capacity attached locally is fairly
  expensive while video transcoding can be run using more
  cost-efficient PC cluster environment very well. PC
  clusters can be distributed across various sites (which
  holds true for MetaCenter PC clusters) and thus
  distributed storage capacity is very natural solution for
  storing video that is to be transcoded, as well as for
  storing intermediate temporary files resulting from
  transcoding process and final encoded video. Coordination
  between distributed storage and scheduling systems on some
  level is very promising as it allows either to optimize
  location of files with respect to available computing
  capacity, or vice versa to optimize location of
  computational processing with respect to where the data
  are available in distributed storage.</p>

  <p>As a part of the DiDaS project we have developed a
  library called <i>libxio</i> that allows easy
  incorporation of IBP into existing UN*X applications.
  Resulting from collaboration between DiDaS and DEE
  projects we have enhanced a video processing tool and
  video rendering tool with capabilities of using IBP
  storage. In this report we also present an overview of IBP
  and <i>libxio</i> library from both developer&#39;s and
  end-user&#39;s point of view to give other groups and
  projects an opportunity to extend their tools to adopt IBP
  capabilities. We conclude with some important issues we
  have encountered while using these prototypes.</p>

  <h1>Network storage stack overview</h1>

  <p>Network storage stack that uses IBP consists of three
  layers: the IBP layer, L-Bone and Ex-Node layer, and LoRS
  layer. An overview of this stack is shown in <a
  href="#stack">table</a>. An application built using this
  stack is thus interacting with LoRS layer only.</p>

  <p><tab id="stack" sloupce="c"><tr><td>Application</td></tr><tr><td>LoRS</td></tr><tr><td>L-Bone
  + Ex-Node</td></tr><tr><td>IBP</td></tr><nazev>Network
  storage stack overview</nazev></tab></p>

  <h2>IBP</h2>

  <p>Similar to IP protocol for network connections, the IBP
  offers unreliable network block storage for data. The IBP
  uses soft consistency model with time-limited allocation
  thus operating in <uv>best effort</uv> mode. Basic atomic
  unit of IBP is a <i>byte array</i> providing abstraction
  independent on physical device the data are stored on. An
  <i>IBP depot</i> (server) is the basic building block of
  IBP infrastructure offering disc capacity. The IBP defines
  three classes of requests described below: allocation,
  read/write and management.</p>

  <h3>Allocation</h3>

  <p>When an IBP client needs to store data to some depot it
  allocates some amount of capacity first. Client may
  specify several attributes for the allocation:</p>

  <p><ul><li><b>permanent vs. time-limited allocation:</b>
  The client can specify whether the storage is required to
  be persistent or whether the server is free to delete it
  after specified time period.</li><li><b>volatile vs.
  stable:</b> The client can specify whether the allocation
  may be revoked by the server at any time or whether it
  must be kept for the specified time period.</li><li><b>byte-array/pipe/circular-queue:</b>
  This attribute is used for specification of access mode.
  Available options are <i>append-only array</i>, <i>FIFO
  queue</i>, or <i>circular queue</i>.</li></ul></p>

  <p>IBP depot owner may specify capabilities of the depot
  so it might offer e.g. only volatile and time-limited
  allocations of size less then 10GB.</p>

  <h3>Read/Write</h3>

  <p>The IBP protocol has APIs for reading data from the
  server(s) and writing data to the server(s). Additionally
  it offers APIs for requesting data transfer directly
  between two IBP depots without involving the client that
  initiates the transfer in actual data transfer. This
  functionality allows to avoid unnecessary data transfers
  through network end nodes and to optimize transfers by
  utilizing fast backbone links that are more likely to be
  available between two IBP depots than between IBP depot
  and client.</p>

  <h3>Management</h3>

  <p>Allocation parameters and attributes may be changed and
  tuned during existence of allocation: the client can
  modify time-limit or increase size of allocation. However
  such request may be refused by IBP servers when no depots
  with requested parameters are available. The client is
  also able to receive information on current allocation
  status and both time and space constraints.</p>

  <h2>L-Bone and Ex-Node</h2>

  <h3>L-Bone</h3>

  <p>L-Bone layer creates a compounding layer over
  particular IBP depots that is able to offer locations to
  clients meeting their requests to maximum extent. L-Bone
  servers form a directory layer based on LDAP mechanism. An
  example might be a depot that both meets client&#39;s
  requirements and is the geographically closest one (or the
  closest one from network point of view). Each created IBP
  depot must be registered to L-Bone server to be available
  for L-Bone server brokering.</p>

  <h3>Ex-Node</h3>

  <p>The Ex-Node is a construct remotely resembling UN*X
  I-node concept. It comprises particular IBP allocations
  and forms representation of a whole file. Individual
  allocations inside one Ex-Node neither need to be of same
  size nor need to form continuous block. The allocations
  inside one Ex-node may even be overlapping. The Ex-Node
  uses XML representation when saved (serialized) to local
  file. The Ex-Node keeps track of end-to-end services: each
  block may contain MD5 checksum and may be encrypted using
  DES, AES, or XOR algorithms. The blocks may also be
  compressed using UN*X compress algorithm available in
  <i>libz</i>.</p>

  <h2>LoRS</h2>

  <p>The layer of logistical tools called LoRS is built on
  the top of L-Bone and Ex-Node layers. It offers interfaces
  for locating set of IBP depots, APIs for reading, writing
  and overwriting of the data in the depots with given
  number of copies and number of concurrent TCP streams to
  be used. This layer also provides functions for reading
  and writing Ex-Node from/to the local file using XML
  representation as mentioned above.</p>

  <h1>Prototype implementation of IBP for DiDaS project</h1>

  <p>There is an initial experimental implementation of IBP
  infrastructure for Distributed Data Storage project
  (DiDaS) designed for preliminary tests and benchmarks of
  IBP infrastructure as well as for experiments with
  incorporation of IBP functionality into some pilot
  applications. Production setup will be optimized based on
  experiences gained with this prototype.</p>

  <p>For preliminary tests and benchmarks we have two
  servers with internal disc arrays. One server has SCSI
  disc array and the other one has parallel ATA disc array.
  We are planing to expand IBP depots to seven servers
  distributed across academic high speed network CESNET2. We
  want to use both servers with external RAID arrays and
  servers with internal RAID arrays to compare both kinds of
  storage. We will offer about 6TB to 10TB of network
  storage interconnected with 1Gbps - 2Gbps network links.</p>

  <p>In our experimental setup the server with SCSI disc
  array hosts LDAP server, L-Bone server and IBP depot and
  also provides web interface to the L-Bone server while
  other the one hosts IBP depot only.</p>

  <h1>End-user how-to for IBP in DiDaS</h1>

  <p>Up to now we have modified two applications based on
  requests of end user community. <prikaz>transcode</prikaz>
  program <cite href="transcode" /> has been patched for the
  DEE project <cite href="DEE" /> to work with IBP in
  computational cluster infrastructure so that the <tt>transcode</tt>
  program can load and store files from/to IBP depots. The
  second application is a media player <prikaz>Mplayer</prikaz>
  <cite href="Mplayer" /> that is able to play the content
  directly from IBP depot. Besides these two applications
  there is a number of command line utilities available for
  the manipulation with files in the IBP depots.</p>

  <p>If user wants to access file in the IBP depot then URI
  in form <soubor>lors://host:port/local_path/file?bs=number&#38;duration=number&#38;copies=number&#38;threads=number&#38;timeout=number</soubor>
  is used. User can also use the short form URI
  <soubor>lors:///local_path/file</soubor>. When user wants
  to read an IBP file, the <soubor>local_path/file</soubor>
  specifies local file where is stored Ex-Node with XML
  specification. When writing an IBP file the
  <soubor>local_path/file</soubor> specifies local file
  where to store Ex-Node with XML specification of IBP
  depots used. The paremeters in the full form of URI mean
  as follows:</p>

  <p><ul><li><tt>host</tt> specification of L-Bone server
  location</li><li><tt>port</tt> specification of L-Bone
  server port (default 6767)</li><li><tt>bs</tt>
  specification of block-size for transfer in megabytes
  (default 10)</li><li><tt>duration</tt> specification of
  allocation duration in seconds (default 3600s)</li><li><tt>copies</tt>
  specification of number of copies (default 1)</li><li><tt>threads</tt>
  specification of number of threads (concurrent TCP
  streams) (default 1)</li><li><tt>timeout</tt>
  specification of timeout in seconds (default 100)</li></ul></p>

  <p>There are environment variables that can be used
  instead of URI parameters for the LoRS layer.</p>

  <p><ul><li><tt>LBONE_SERVER</tt> specifies L-Bone server
  location</li><li><tt>LBONE_PORT</tt> specifies L-Bone
  server port (default 6767)</li><li><tt>LORS_BLOCKSIZE</tt>
  specifies block-size for transfer in megabytes (default
  10)</li><li><tt>LORS_DURATION</tt> specifies duration of
  allocation in seconds (default 3600s)</li><li><tt>LORS_COPIES</tt>
  specifies number of copies (default 1)</li><li><tt>LORS_THREADS</tt>
  specifies number of threads (concurrent TCP streams)
  (default 1)</li><li><tt>LORS_TIMEOUT</tt> specifies
  timeout in seconds (default 100)</li></ul></p>

  <p>At least <tt>LBONE_SERVER</tt> or <tt>host</tt> must be
  set.</p>

  <h2><prikaz>transcode</prikaz></h2>

  <p><prikaz>transcode</prikaz> program can be used in
  almost traditional way. If user wants to store the file to
  the IBP depot then URI in form <soubor>lors:///local_path/file</soubor>
  is used instead of local file name. The environment
  variables should be set unless user specifies URI in its
  full form. The same form of URI can be used for loading
  the file from the IBP depot. Transcode can read and write
  ordinary local files as well. The prefix <tt>lors://</tt>
  is required for accessing IBP, otherwise local file is
  used.</p>

  <p>For example:</p>

  <p><prikaz>LBONE_SERVER=udomiel.ics.muni.cz; transcode -i
  lors:///video.dv.xnd -P1 -N 0x1 -y raw -o
  lors:///temp1-remux.avi.xnd -E 44100,16,2 -J resample</prikaz></p>

  <p>Transcode will use source file identified by serialized
  description of IBP Ex-Node stored in <soubor>video.dv.xnd</soubor>
  file on local disc in the current directory. The L-Bone
  server location is taken from environment variable
  (referring to <tt>udomiel.ics.muni.cz</tt>) and resulting
  serialized Ex-Node XML description will be stored in the
  current directory as <soubor>temp1-remux.avi.xnd</soubor></p>

  <h2><prikaz>Mplayer</prikaz></h2>

  <p><prikaz>Mplayer</prikaz> has been modified to use the
  same URI semantics as <prikaz>transcode</prikaz>. However
  we have encountered serious problem with reading latency
  from IBP when reading data and playing video in single
  thread (what is the way Mplayer internally works). More
  detailed description together with benchmarks can be found
  in second appendix.</p>

  <h2>Command line utilities</h2>

  <p>There is number of general purpose command line
  utilities available, which can be used for manipulation
  with files stored in the IBP infrastructure. Some of the
  parameters can be specified in <soubor>~/.xndrc</soubor>
  file to avoid repeating specification of the same
  information when using command-line tools.</p>

  <h3><prikaz>lors_upload</prikaz></h3>

  <p>The <prikaz>lors_upload</prikaz> utility is used for
  storing files into the IBP depots. Basic usage is as
  follows:</p>

  <p><prikaz>lors_upload -f -H host -n my_file</prikaz></p>

  <p>The file <soubor>my_file</soubor> will be stored into
  the IBP depot returned by L-Bone server specified using
  <tt>host</tt> parameter. The <tt>-f</tt> option says that
  output Ex-Node file will be called <soubor>my_file.xnd</soubor>.
  The <tt>-n</tt> switch turns off any <i>end-to-end</i>
  services that can otherwise be DES (<tt>-e</tt>), AES (<tt>-a</tt>)
  or XOR (<tt>-x</tt>) encryption, compression (<tt>-z</tt>),
  or MD5 check sum (<tt>-l</tt>).</p>

  <p>There are other options available: <ul><li><tt>-o</tt>
  name for the output Ex-Node file</li><li><tt>-P</tt>
  L-Bone server port</li><li><tt>-l</tt> location hint (e.g.
  <tt>city=Brno</tt>)</li><li><tt>-d</tt> specification of
  expiration time</li><li><tt>-s</tt> volatile storage</li><li><tt>-h</tt>
  stable storage</li><li><tt>-m</tt> maximum number of
  depots returned by location hint</li><li><tt>-b</tt>
  specifies logical block size of input file and chunk size
  of resulting Ex-Node</li><li><tt>-c</tt> number of copies</li><li><tt>-F</tt>
  number of blocks, which input file should be divided into;
  an alternative to <tt>-b</tt> switch</li><li><tt>-t</tt>
  maximum number of threads used for transfer (meaning
  number of concurrent TCP streams)</li><li><tt>-T</tt>
  timeout</li></ul></p>

  <h3><prikaz>lors_download</prikaz></h3>

  <p>The <prikaz>lors_download</prikaz> utility is used for
  downloading files from the IBP depots. Basic usage is as
  follows:</p>

  <p><prikaz>lors_download -o my_file exnode.xnd</prikaz></p>

  <p>The data described by <tt>exnode.xnd</tt> will be
  stored to local file called <tt>my_file</tt>. If <tt>-o</tt>
  is omitted, the data is sent to <tt>stdout</tt>.</p>

  <p>Other options are available: <ul><li><tt>-r</tt>
  maximum number of threads used for downloading (meaning
  again number of concurrent TCP streams)</li><li><tt>-R</tt>
  download will resume if working on a partially downloaded
  file (auto-detection is employed)</li><li><tt>-q</tt>
  specification of the number of blocks to pre-buffer before
  storing into file</li><li><tt>-C</tt> specifying number of
  blocks to cache</li></ul></p>

  <p><tt>-b</tt> and <tt>-t</tt> options have the same
  meaning as with <tt>lors_upload</tt>.</p>

  <h3><prikaz>lors_trim</prikaz></h3>

  <p>The <prikaz>lors_trim</prikaz> utility is used to
  decrease reference count for specified Ex-Node. If
  reference count is decreased to zero, the allocation is
  removed from the IBP depot. Basic usage is as follows:</p>

  <p><prikaz>lors_trim -o new_exnode.xnd -d exnode.xnd</prikaz></p>

  <p>This command means that the original <soubor>exnode.xnd</soubor>
  can be deleted while <soubor>new_exnode.xnd</soubor> is
  created. If <tt>-f</tt> switch is used then the original
  Ex-Node file <soubor>exnode.xnd</soubor> will be
  overwritten by updated Ex-Node.</p>

  <h3><prikaz>lors_augment</prikaz></h3>

  <p>The <prikaz>lors_augment</prikaz> utility can be used
  for increasing reference count for any particular Ex-Node.
  <tt>-c</tt> switch can be employed to determine how many
  copies are requested. We can use the same options as with
  <tt>lors_upload</tt>. If <tt>-f</tt> switch is used then
  the original Ex-Node file <soubor>exnode.xnd</soubor> will
  be overwritten by updated Ex-Node.</p>

  <h3><prikaz>lors_modify</prikaz></h3>

  <p>The <prikaz>lors_modify</prikaz> utility may be used to
  remove the specified capabilities from an Ex-Node.
  Following capabilities can be removed: read (<tt>-r</tt>),
  write (<tt>-w</tt>), or manage (<tt>-m</tt>). Note: there
  is no utility that adds capabilities back to Ex-Node.</p>

  <h3><prikaz>lors_ls</prikaz></h3>

  <p>The <prikaz>lors_ls</prikaz> utility lists allocation
  blocks for given Ex-Node(s). The listing includes offsets,
  lengths, and locations. Basic usage is</p>

  <p><prikaz>lors_ls exnode.xnd</prikaz>.</p>

  <h1>Developer how-to for IBP in DiDaS</h1>

  <p>We have developed a abstraction library called <i>libxio</i>
  that is now available for the developers. It provides
  standard UN*X I/O interface that allows to access local
  files as well as with files represented by URI <tt>lors:///local_path/exnode</tt>.
  The functions <i>xio_open</i>, <i>xio_close</i>, <i>xio_read</i>,
  <i>xio_write</i>, <i>xio_ftruncate</i>, <i>xio_lseek</i>,
  <i>xio_stat</i>, <i>xio_fstat</i>, and <i>xio_lstat</i>
  are available. These functions have the same semantics as
  standard UN*X I/O functions except for IBP files can not
  be opened in <tt>O_RDWR</tt> mode at the moment.</p>

  <p><pre>#include &#60;xio.h&#62;

int xio_open(const char *pathname, int flags);
int xio_close(int fd);
ssize_t xio_read(int fd, void *buf, size_t count);
ssize_t xio_write(int fd, const void *buf, size_t count);
int xio_ftruncate(int fd, off_t length);
off_t xio_lseek(int fildes, off_t offset, int whence);
int xio_stat(const char *file_name, struct stat *buf);
int xio_fstat(int filedes, struct stat *buf);
int xio_lstat(const char *file_name, struct stat *buf);
</pre></p>

  <p>For accessing IBP infrastructure the <i>libxio</i>
  library uses <i>lors</i> library that provides interface
  described in subsequent subsections.</p>

  <h2>Creating a new file</h2>

  <p><pre>lorsGetDepotPool(LorsDepotPool *dp, char *host, int port, IBP_depot *dpt, 
                 int maxdepot, char *location_hint, unsigned long storage_size,
                 int storage_type, time_t duration, int max_threads,
                 int timeout, int opts)</pre></p>

  <p>This function returns a list of appropriate IBP depots
  (<tt>dp</tt>) meeting given criteria: <i>location_hint</i>
  [city=Brno], <i>storage_size</i> (in megabytes), <i>storage_type</i>
  [<tt>IBP_SOFT|IBP_HARD</tt>] (volatility), <i>duration</i>
  (expiry time in seconds).</p>

  <p><pre>lorsExnodeCreate(LorsExnode **ex)</pre></p>

  <p>This function returns a newly created Ex-Node.</p>

  <h2>Opening a file for reading</h2>

  <p><pre>lorsFileDeserialize(LorsExnode **ex, char *filename, char *schema)</pre></p>

  <p>This function opens an Ex-Node XML file called <i>filename</i>
  and creates a corresponding Ex-Node structure in memory.
  The <i>schema</i> parameter is currently unused.</p>

  <p><pre>lorsUpdateDepotPool(LorsExnode *ex, LorsDepotPool **dp, char *lbone_server, 
                    int lbone_server_port, char *location_hint, int nthreads, 
                    int timeout, int opts)</pre></p>

  <p>This function returns a list of appropriate IBP depots
  for given Ex-Node. L-Bone server may have <tt>NULL</tt>
  value in which case no location hinting will be used and
  IBP depots are selected randomly.</p>

  <h2>Writing to a file in IBP</h2>

  <p><pre>lorsQuery(LorsExnode *ex, LorsSet **set, longlong offset, longlong size, 
          int opt)</pre></p>

  <p>This function returns a set (<i>set</i>) of allocations
  containing data block beginning at <i>offset</i> and
  having <i>size</i> length. <tt>LORS_QUERY_REMOVE</tt> for
  <i>opt</i> parameter is used.</p>

  <p><pre>jrb_empty(set-&#62;mapping_map)</pre></p>

  <p>This function returns <i>true</i> if and only if the
  <i>set</i> does not contain any data block. This means
  that data will be written to empty (unallocated) block.</p>

  <h3>Writing to an empty block</h3>

  <p><pre>lorsSetInit(LorsSet **set, ulong_t data_blocksize, int copies, int opts)</pre></p>

  <p>This function returns a newly created empty set of
  blocks.</p>

  <p><pre>lorsSetStore(LorsSet *set, LorsDepotPool *dp, char *buffer, longlong offset,
             longlong length, LorsConditionStruct *lc, int nthreads, 
             int timeout, int opts)</pre></p>

  <p>This function stores <i>set</i> into the IBP depots.
  The parameter <i>LorsConditionStruct</i> is always
  <tt>NULL</tt> at this time. It is recommended to use value
  <tt>LORS_RETRY_UNTIL_TIMEOUT</tt> for the <i>opts</i>
  parameter.</p>

  <h3>Overwriting an existing block</h3>

  <p>First we need to assign the number of copies and block
  size to the particular set:</p>

  <p><pre>set-&#62;copies = number
set-&#62;data_blocksize = size</pre></p>

  <p><pre>lorsSetUpdate(LorsSet *set, LorsDepotPool *dp, char *buffer, longlong offset,
              longlong length, int nthreads, int timeout, int opts)</pre></p>

  <p>This function overwrites existing blocks with new data.
  All parameters have already been described above. More
  precisely this function splits the existing allocation
  into pieces and decreases reference count of overwritten
  blocks.</p>

  <p><pre>lorsAppendSet(LorsExnode *ex, LorsSet *set)</pre></p>

  <p>This function adds <i>set</i> to Ex-Node that describes
  file we work with.</p>

  <p><pre>lorsSetFree(LorsSet *set, int opt)</pre></p>

  <p>This function frees <i>set</i> that is no longer
  necessary (e.g. when it has already been added to
  Ex-Node).</p>

  <p><pre>lorsFileSerialize(LorsExnode *ex, char *filename, int read_only, int opts)</pre></p>

  <p>This function saves Ex-Node into the local file.
  Setting parameter <i>read_only</i> to <tt>1</tt> means
  that we want to save read-only capability while write and
  manage capabilities will not be saved. The parameter
  <i>opts</i> must be always set to 0.</p>

  <h2>Reading from a file in IBP</h2>

  <p>We select a set from Ex-Node using <tt>lorsQuery()</tt>
  function. This set covers requested data blocks.</p>

  <p><pre>lorsSetLoad(LorsSet *set, char *buffer, longlong offset, long length, 
            ulong_t block_size, LorsConditionStruct *lc, int nthreads, 
            int timeout, int opts)</pre></p>

  <p>This function reads requested data up to <i>length</i>.
  On success, the number of read bytes is returned. Zero
  indicates end of file while negative value means error.</p>

  <h2>Closing a file</h2>

  <p>Ex-Node data should be saved using <tt>lorsFileSerialize()</tt>
  function before we close a file. We can free structures
  used by calling <tt>lorsExnodeFree(LorsExnode *ex)</tt>
  and <tt>lorsFreeDepotPool(LorsDepotPool *dp)</tt>functions.</p>

  <h2>Truncating a file</h2>

  <p>Again we start off by creating a set using <tt>lorsQuery()</tt>
  function. This set contains truncated blocks.</p>

  <p><pre>lorsSetTrim(LorsSet *set, longlong offset, longlong length, int nthreads, 
            int timeout, int opts)</pre></p>

  <p>This function decreases reference count for blocks
  beginning at <i>offset</i> having length of <i>length</i>.
  The parameter <i>opts</i> has to have the value <tt>LORS_TRIM_ALL</tt>.</p>

  <h2>Getting size of a file</h2>

  <p>The actual size of data that described by Ex-Node is
  kept in <tt>LorsExnode-&#62;logical_length</tt> variable.</p>

  <h1>Conclusions</h1>

  <p>Our first version of <i>libxio</i> stores data to the
  IBP depot with each single <i>xio_write</i> call. This
  results in the important latency issue that does not harm
  a lot while video transcoding but it also results in very
  large Ex-Node XML description. Each call adds
  approximately 400B to Ex-Node. We are using a write buffer
  of 10MB size at the moment. The buffer reduces latency and
  rapidly reduces the Ex-Node XML file at the cost of
  unnecessary <i>memcpy</i> call. However we believe that
  extra hit to extremely fast memory is a lot better then
  extra hit to relatively slower network.</p>

  <p>Our preliminary tests have shown IBP storage as very
  suitable for DEE project. We don&#39;t not use any
  location hinting so far but as we are planing to spread
  IBP depots over Czech academic network, the location
  hinting and some form of integration with scheduling
  system will be necessary. As integration with scheduling
  system currently in use - PBS <cite href="PBS" /> - seems
  to be rather tough task and new generation of scheduling
  systems with required capabilities seems to come rather
  shortly (e.g. from DataGrid Project <cite href="EDG" />),
  we plan to do the integration of PBS and IBP on
  application layer so that application will give hints to
  PBS where the jobs should be scheduled to.</p>

  <p>The experiments with <prikaz>Mplayer</prikaz> have been
  somewhat worse since the read latency has very negative
  impact on <prikaz>Mplayer</prikaz>. We can use read buffer
  as well but this only helps on fast network with low
  latency. The only chance is to start another thread that
  will try to read as big blocks as possible from IBP
  depots. The authors of IBP technology promise a new
  generation of IBP that will meet real-time requirements in
  better way. We will go on with modifying our library to
  use read thread until the new version of IBP is be
  developed.</p>

  <h1>Appendix A Software and Hardware RAID arrays
  benchmarks</h1>

  <p>Internal discs array was equipped with single Intel
  Pentium 4 Xeon 2.4GHz processor and 1GB RAM. On this
  computer was installed operating system Linux with vanilla
  kernel 2.4.22 with patch for XFS filesystem. Disc arrays
  were benchmarked with <cite href="IOZone" /> program that
  does sequential reading and sequential writing test to XFS
  partition.</p>

  <p>We used parallel ATA Western Digital 250GB 7200RPM
  discs with 8MB cache, serial ATA Western Digital 250GB
  7200RPM discs with 8MB cache and SCSI Seagate Cheetah 73GB
  10000RPM discs.</p>

  <p>We did performance test on single disc, four discs in
  software RAID 0 array, four discs in software RAID 5
  array, four discs in hardware RAID 0 array and four discs
  in hardware RAID 5 array. In the case of hardware array we
  increased shared memory size to 1GB, <tt>vm.min-readahead</tt>
  to 128, and <tt>vm.max-readahead</tt> to 256 in kernel.</p>

  <h2>IDE RAID arrays</h2>

  <p>In this case, each disc was switched to master mode.
  The Adaptec AAR 2400A card has four parallel ATA
  interfaces. The 3ware Escalade 8506-8 card has eight
  serial ATA interfaces. While benchmarking these arrays we
  have found that Linux driver for Adaptec AAR 2400A is
  unstable as of kernel version 2.4.22 and should not be
  used for any production system. Following setup was used
  for the IDE RAID arrays benchmarks:</p>

  <p><ul><li>single disc attached to internal Intel ICH3
  controller</li><li>four discs attached to PCI32 Adaptec
  AAR 2400A card</li><li>four discs attached to PCI64 3ware
  Escalade 8506-8 card with SATA interface</li></ul></p>

  <p><tab id="array_pata" sloupce="rlllll"><tr><th>PATA</th><th>Single
  disc</th><th>HW RAID 0</th><th>HW RAID 5</th><th>SW RAID 0</th><th>SW
  RAID 5</th></tr><tr><th>read</th><td>51.823MB/sec</td><td>84.359MB/sec</td><td>47.902MB/sec</td><td>95.111MB/sec</td><td>51.902MB/sec</td></tr><tr><th>write</th><td>50.785MB/sec</td><td>78.330MB/sec</td><td>15.349MB/sec</td><td>42.989MB/sec</td><td>48.616MB/sec</td></tr><nazev>Parallel
  ATA disc array benchmark</nazev></tab></p>

  <p><obr id="array_pata_eps" src="pata">Parallel
  ATA disc array benchmark.</obr></p>

  <p><tab id="array_sata" sloupce="rlllll"><tr><th>SATA</th><th>Single
  disc</th><th>HW RAID 0</th><th>HW RAID 5</th><th>SW RAID 0</th><th>SW
  RAID 5</th></tr><tr><th>read</th><td>45.273MB/sec</td><td>146.437MB/sec</td><td>58.186MB/sec</td><td>182.129MB/sec</td><td>80.294MB/sec</td></tr><tr><th>write</th><td>52.663MB/sec</td><td>119.643MB/sec</td><td>24.719MB/sec</td><td>115.402MB/sec</td><td>67.601MB/sec</td></tr><nazev>Serial
  ATA disc array benchmark</nazev></tab></p>

  <p><obr id="array_sata_eps" src="sata">Serial ATA
  disc array benchmark.</obr></p>

  <h2>SCSI RAID arrays</h2>

  <p>In this case, four discs were attached to one bus.
  Adaptec AIC card used posses only one internal SCSI
  interface. Following setup was used for the SCSI RAID
  arrays benchmarks:</p>

  <p><ul><li>single disc attached to PCI64 Adaptec AIC 7901A
  Ultra 320 Single channel card</li><li>four discs attached
  to PCI64 Adaptec AIC 7901A Ultra 320 Single channel card</li><li>four
  discs attached to PCI64 Adaptec ASR 2200S Dual Channel
  card</li></ul></p>

  <p><tab id="array_scsi" sloupce="rlllllll"><tr><th>SCSI</th><th>Single
  Disc</th><th>HW RAID 0 ASR</th><th>HW RAID 5 ASR</th><th>SW
  RAID 0 ASR</th><th>SW RAID 5 ASR</th><th>SW RAID 0 AIC</th><th>SW
  RAID 5 AIC</th></tr><tr><th>read</th><td>66.846MB/sec</td><td>125.564MB/sec</td><td>124.893MB/sec</td><td>102.670MB/sec</td><td>104.319MB/sec</td><td>258.230MB/sec</td><td>200.974MB/sec</td></tr><tr><th>write</th><td>61.191MB/sec</td><td>114.030MB/sec</td><td>42.864MB/sec</td><td>56.046MB/sec</td><td>36.956MB/sec</td><td>169.391MB/sec</td><td>89.744MB/sec</td></tr><nazev>SCSI
  disc array benchmark</nazev></tab></p>

  <p><obr id="array_scsi_eps" src="scsi">SCSI disc
  array benchmark.</obr></p>

  <h1>Appendix B IBP read latency benchmarks</h1>

  <p>We have done several experiments with playing content
  directly from IBP depots. LoCI laboratory <cite
  href="LoCI" /> uses command line utilities together with
  Mplayer for playing content from IBP depots. For example:</p>

  <p><prikaz>lors_download my_file.xnd | mplayer -</prikaz></p>

  <p>We patched <prikaz>Mplayer</prikaz> so that it can
  handle Ex-Nodes on its own. While Mplayer runs in single
  thread mode (and author of <prikaz>Mplayer</prikaz>
  insists on the fact that this is the way to go) we must
  look after read latency from IBP depot. We have set up an
  experiment to show how the downloading speed and latency
  depends on the chunk size. We have done this experiment
  both on local network and the network whose connection is
  a little bit slower. We have used one-stream transfers
  only as multiple streams require multiple threads.</p>

  <p><tab id="lors_latency_fast" sloupce="rrr"><tr><th>Chunk
  size</th><th>Downloading speed</th><th>Single read call
  duration</th></tr><tr><td>1B</td><td>0.16KB/sec</td><td>0.006sec</td></tr><tr><td>2B</td><td>1.09KB/sec</td><td>0.002sec</td></tr><tr><td>4B</td><td>2.22KB/sec</td><td>0.002sec</td></tr><tr><td>8B</td><td>4.44KB/sec</td><td>0.002sec</td></tr><tr><td>16B</td><td>8.85KB/sec</td><td>0.002sec</td></tr><tr><td>32B</td><td>17.73KB/sec</td><td>0.002sec</td></tr><tr><td>64B</td><td>35.11KB/sec</td><td>0.002sec</td></tr><tr><td>128B</td><td>70.30KB/sec</td><td>0.002sec</td></tr><tr><td>256B</td><td>139.35KB/sec</td><td>0.002sec</td></tr><tr><td>512B</td><td>270.71KB/sec</td><td>0.002sec</td></tr><tr><td>1KB</td><td>486.38KB/sec</td><td>0.002sec</td></tr><tr><td>2KB</td><td>973.24KB/sec</td><td>0.002sec</td></tr><tr><td>4KB</td><td>1770.69KB/sec</td><td>0.002sec</td></tr><tr><td>8KB</td><td>3049.94KB/sec</td><td>0.003sec</td></tr><tr><td>16KB</td><td>4668.81KB/sec</td><td>0.003sec</td></tr><tr><td>32KB</td><td>6526.62KB/sec</td><td>0.005sec</td></tr><tr><td>64KB</td><td>8096.14KB/sec</td><td>0.008sec</td></tr><tr><td>128KB</td><td>9201.35KB/sec</td><td>0.014sec</td></tr><tr><td>256KB</td><td>10032.53KB/sec</td><td>0.026sec</td></tr><tr><td>521KB</td><td>10451.11KB/sec</td><td>0.049sec</td></tr><nazev>LoRS
  speeds on local 100Mbps (FE) network</nazev></tab></p>

  <p><obr id="lors_fast_kb_eps" src="lors-fast-kb">LoRS speed on local 100Mbps
  (FE) network.</obr></p>

  <p><obr id="lors_fast_sec_eps" src="lors-fast-sec">LoRS latency on local
  100Mbps (FE) network.</obr></p>

  <p><tab id="lors_latency_slow" sloupce="rrr"><tr><th>Chunk
  size</th><th>Downloading speed</th><th>Single read call
  duration</th></tr><tr><td>1B</td><td>0.05KB/sec</td><td>0.018sec</td></tr><tr><td>2B</td><td>0.11KB/sec</td><td>0.019sec</td></tr><tr><td>4B</td><td>0.21KB/sec</td><td>0.018sec</td></tr><tr><td>8B</td><td>0.41KB/sec</td><td>0.019sec</td></tr><tr><td>16B</td><td>0.80KB/sec</td><td>0.019sec</td></tr><tr><td>32B</td><td>1.70KB/sec</td><td>0.018sec</td></tr><tr><td>64B</td><td>3.35KB/sec</td><td>0.019sec</td></tr><tr><td>128B</td><td>6.65KB/sec</td><td>0.019sec</td></tr><tr><td>256B</td><td>12.78KB/sec</td><td>0.020sec</td></tr><tr><td>512B</td><td>26.50KB/sec</td><td>0.019sec</td></tr><tr><td>1KB</td><td>49.75KB/sec</td><td>0.020sec</td></tr><tr><td>2KB</td><td>95.08KB/sec</td><td>0.021sec</td></tr><tr><td>4KB</td><td>183.11KB/sec</td><td>0.022sec</td></tr><tr><td>8KB</td><td>317.41KB/sec</td><td>0.025sec</td></tr><tr><td>16KB</td><td>538.05KB/sec</td><td>0.030sec</td></tr><tr><td>32KB</td><td>812.89KB/sec</td><td>0.039sec</td></tr><tr><td>64KB</td><td>1145.00KB/sec</td><td>0.056sec</td></tr><tr><td>128KB</td><td>1436.87KB/sec</td><td>0.089sec</td></tr><tr><td>256KB</td><td>1591.58KB/sec</td><td>0.160sec</td></tr><tr><td>521KB</td><td>1572.75KB/sec</td><td>0.326sec</td></tr><nazev>LoRS
  speeds on foreigner 100Mbps (FE) network</nazev></tab></p>

  <p><obr id="lors_slow_kb_eps" src="lors-slow-kb">LoRS speeds on foreigner
  100Mbps (FE) network.</obr></p>

  <p><obr id="lors_slow_sec_eps" src="lors-slow-sec">LoRS latency on foreigner
  100Mbps (FE) network.</obr></p>

  <p>First we thought that LoRS layer creates that latency
  thus we did the same experiment with IBP layer directly.</p>

  <p><tab id="ibp_latency_slow" sloupce="rrr"><tr><th>Chunk
  size</th><th>Downloading speed</th><th>Single read call
  duration</th></tr><tr><td>1B</td><td>0.16KB/sec</td><td>0.006sec</td></tr><tr><td>2B</td><td>0.29KB/sec</td><td>0.006sec</td></tr><tr><td>4B</td><td>0.62KB/sec</td><td>0.006sec</td></tr><tr><td>8B</td><td>1.22KB/sec</td><td>0.006sec</td></tr><tr><td>16B</td><td>2.43KB/sec</td><td>0.006sec</td></tr><tr><td>32B</td><td>5.04KB/sec</td><td>0.006sec</td></tr><tr><td>64B</td><td>9.82KB/sec</td><td>0.006sec</td></tr><tr><td>128B</td><td>19.32KB/sec</td><td>0.006sec</td></tr><tr><td>256B</td><td>38.15KB/sec</td><td>0.006sec</td></tr><tr><td>512B</td><td>69.12KB/sec</td><td>0.007sec</td></tr><tr><td>1KB</td><td>155.51KB/sec</td><td>0.006sec</td></tr><tr><td>2KB</td><td>278.38KB/sec</td><td>0.007sec</td></tr><tr><td>4KB</td><td>478.45KB/sec</td><td>0.008sec</td></tr><tr><td>8KB</td><td>642.54KB/sec</td><td>0.012sec</td></tr><tr><td>16KB</td><td>848.10KB/sec</td><td>0.018sec</td></tr><tr><td>32KB</td><td>1049.23B/sec</td><td>0.030sec</td></tr><tr><td>64KB</td><td>1241.19KB/sec</td><td>0.052sec</td></tr><tr><td>128KB</td><td>1633.99KB/sec</td><td>0.078sec</td></tr><tr><td>256KB</td><td>1637.22KB/sec</td><td>0.156sec</td></tr><tr><td>521KB</td><td>1715.69KB/sec</td><td>0.298sec</td></tr><nazev>IBP
  speeds on foreigner 100Mbps (FE) network</nazev></tab></p>

  <p><obr id="ibp_slow_kb_eps" src="ibp-slow-kb">IBP
  speeds on foreigner 100Mbps (FE) network.</obr></p>

  <p><obr id="ibp_slow_sec_eps" src="ibp-slow-sec">IBP latency on foreigner
  100Mbps (FE) network.</obr></p>

  <p>We compared IBP transfer speed with single stream ftp
  transfer speed using bbftp <cite href="bbftp" /> on the
  slower network.</p>

  <p><tab id="bbftp_latency_slow" sloupce="rrr"><tr><th>Chunk
  size</th><th>Downloading speed</th><th>Single read call
  duration</th></tr><tr><td>1B</td><td>0KB/sec</td><td>0.005sec</td></tr><tr><td>2B</td><td>0KB/sec</td><td>0.008sec</td></tr><tr><td>4B</td><td>0KB/sec</td><td>0.008sec</td></tr><tr><td>8B</td><td>1KB/sec</td><td>0.011sec</td></tr><tr><td>16B</td><td>0KB/sec</td><td>0.047sec</td></tr><tr><td>32B</td><td>3KB/sec</td><td>0.008sec</td></tr><tr><td>64B</td><td>1KB/sec</td><td>0.048sec</td></tr><tr><td>128B</td><td>3KB/sec</td><td>0.037sec</td></tr><tr><td>256B</td><td>28KB/sec</td><td>0.009sec</td></tr><tr><td>512B</td><td>40KB/sec</td><td>0.012sec</td></tr><tr><td>1KB</td><td>84KB/sec</td><td>0.012sec</td></tr><tr><td>2KB</td><td>105KB/sec</td><td>0.019sec</td></tr><tr><td>4KB</td><td>500KB/sec</td><td>0.008sec</td></tr><tr><td>8KB</td><td>331KB/sec</td><td>0.024sec</td></tr><tr><td>16KB</td><td>553KB/sec</td><td>0.029sec</td></tr><tr><td>32KB</td><td>915B/sec</td><td>0.035sec</td></tr><tr><td>64KB</td><td>945KB/sec</td><td>0.068sec</td></tr><tr><td>128KB</td><td>1330KB/sec</td><td>0.096sec</td></tr><tr><td>256KB</td><td>1460KB/sec</td><td>0.176sec</td></tr><tr><td>521KB</td><td>1610KB/sec</td><td>0.317sec</td></tr><nazev>bbftp
  speeds on foreigner 100Mbps (FE) network</nazev></tab></p>

  <p><obr id="bbftp_slow_kb_eps" src="bbftp-slow-kb">bbftp speeds on foreigner
  100Mbps (FE) network.</obr></p>

  <p><obr id="bbftp_slow_sec_eps" src="bbftp-slow-sec">bbftp latency on
  foreigner 100Mbps (FE) network.</obr></p>

  <seznamknih>
    <kniha id="Meta">MetaCenter Project, <adresa>http://meta.cesnet.cz/</adresa></kniha>

    <kniha id="SIG02">Beck M., Moore T., Planck J. S.
    <uv>An End-to-End Approach to Globally Scalable Network
    Storage</uv>. SIGCOMM 2002. <adresa>http://loci.cs.utk.edu/ibp/files/pdf/SIGCOMM02p1783-beck.pdf</adresa></kniha>

    <kniha id="IBP">Internet Backplane Protocol,
    <adresa>http://loci.cs.utk.edu/ibp/</adresa></kniha>

    <kniha id="LoCI">Logistical Computing and
    Internetworking Lab, <adresa>http://loci.cs.utk.edu</adresa></kniha>

    <kniha id="transcode">Linux Video Stream Processing
    Tool. <adresa>http://www.theorie.physik.uni-goettingen.de/%7Eostreich/transcode/</adresa>
    and <adresa>http://zebra.fh-weingarten.de/%7Etranscode/</adresa></kniha>

    <kniha id="DEE">Distributed Encoding Environment
    project, <adresa>http://sitola.fi.muni.cz/projekty/strizna/strizna.html</adresa>.</kniha>

    <kniha id="Mplayer">Linux Media Player Tool.
    <adresa>http://www.mplayerhq.hu</adresa></kniha>

    <kniha id="PBS">Portable Batch System: OpenPBS (<adresa>http://www.openpbs.org/</adresa>)
    and PBSPro (<adresa>http://www.pbspro.com/</adresa>)</kniha>

    <kniha id="EDG">EU DataGrid Project (<adresa>http://www.eu-datagrid.org/</adresa>)</kniha>

    <kniha id="IOZone">Disc I/O Benchmark Tool.
    <adresa>http://www.iozone.org</adresa></kniha>

    <kniha id="bbftp">bbFTP -- Large files transfer
    protocol. <adresa>http://doc.in2p3.fr/bbftp/</adresa></kniha>
  </seznamknih>
</zprava>
