Psock: A Parallel Socket Library

CESNET technical report number 12/2006
also available in PDF, PostScript, and XML formats.

Sven Ubik, Martin Čížek
30.11.2006

1   Abstract

We present design of a new parallel socket library called psock, which allows easy introduction of parallel transfers into network applications. Psock can effectively utilize unequal and varying available bandwidth of individual connections. We discuss requirements on socket buffer sizes, which are different from singular transfers. Finally, we present performance measurements in various scenarios.

Keywords: parallel transfers, network performance, buffer sizing, TCP performance

2   Why parallel data transfers?

Congestion control in standard TCP cannot utilize all available bandwidth in fast long-distance networks because its AIMD(1, 0.5) algorithm for adjusting congestion window (cwnd) is slow for networks with high bandwidth×delay products. One option is to use faster cwnd adjusting algorithm. This approach is used in various "fast TCP" variants. A comparison of several most commonly known TCP variants has been presented in [fasttcp]. Another option is to use several TCP connections in parallel.

A parallel transfer has several advantages. It can use several physical paths from the sender to the receiver and it does not require root access to patch, fine tune and reboot the operating system. On the other hand, its drawback lies in the necessity to modify the application that is to use parallel transfers. The underlying transport protocol for individual transfers of a parallel transfer can be standard TCP, some of the "fast TCP" variants or another transport protocol.

3   Possibilities of parallelism

Parallel communication can be implemented in different layers of a computer network:

Link layer
Frames can be distributed into alternative network links by a link-layer load sharing mechanism, such as Cisco EtherChannel or Spanning Tree Protocol (STP). The latter is a protocol primarily designed to create a loop-free layer 2 topology, but it can also be used for load balancing when multiple VLANs are created.
Network layer
Packets can be distributed into alternative network paths by static or dynamic routing. Routing protocols such as BGP or OSPF provide for multiple paths to a single destination IP subnet. Packets can also be sent to alternative destination IP addresses (and they may consequently use alternative network paths) with the help of dynamic DNS resolution.
Transport layer
Data can be transfered over several transport connections opened by the transport layer on the sending and receiving hosts.
Application layer
Data can be transfered over several transport connections opened by the sending and receiving parts of the application.

A link layer or network layer solution must be configured by a network operator. It cannot be influenced by application users, but it is transparent to the application.

A transport layer solution requires support in operating system on the communicating end hosts. It can be configured by users and it can be transparent to the application.

An application layer solution requires support inside the application and can therefore be easily configured by the user.

4   Related work

PSockets [psockets] is a C++ library, which implements parallel data transfers at the transport layer using a round-robin method of distributing data into individual transfers. As a result, throughput of the parallel transfer is limited by n times throughput of the slowest individual transfer, where n is the number of individual transfers. The primary goal of PSockets was to provide faster congestion window adjustment in the case when we cannot modify congestion control of the underlying TCP implementation (such as when we do not have root access).

SCTP [sctp] implements several individual transfers (called streams) using a single pair of source and destination ports, which allows easy passing through firewalls and NATs. It can use multiple IP addresses on the sending or receiving host, possibly interconnected by physically separate network paths. SCTP is a different transport protocol than TCP, applications must be therefore programmed to use SCTP.

Some applications use HTTP to implement parallel transfers at the application layer by opening several HTTP connections and requesting different parts of the transfered file by the Range header line. This solution can be used only when transfering stored files, it cannot be used when transfering data passed over a pipe from another application.

Similarly, some applications (e.g., [gridftp]) use FTP to implement parallel transfers by requesting different parts of the transfered file in separate FTP data connections.

pTCP [ptcp] is a modified version of TCP that can stripe data into multiple connections and tries to utilize their possibly varying achievable throughput. A practical difficulty with pTCP is that it requires a replacement of the transport protocol implementation in operating systems on both end hosts.

pscp is a parallel version of a popular scp tool for secure data transfer. It can be used also for transfering data passed over a pipe from another application.

5   Requirements and design decisions

We defined the following set of requirements to be fulfilled by our implementation of parallel transfers:

Based on the above requirements we decided to implement parallel transfers at the transport layer by an alternative socket library, called psock. This parallel socket library is linked with the application code instead of the standard BSD socket library. The syntax of most calls to library functions remain unchanged, the application code requires only minimal modifications in order to use parallel transfers. Parameters of parallel transfers can be configured using standard setsocktopt() and getsockopt() calls.

We decided to use TCP for individual transfers. Using TCP has the following advantages:

At the same time, TCP has also certain limitations:

In principle, there are two ways that the application code and the parallel socket library code could communicate - synchronous and asynchronous.

In synchronous communication, the code of one part is called from the code of the other part. That is, when the application code calls a function from the parallel socket library, the code implementing parallel transfers is executed. Synchronous communication would most likely result in ineffient use of time. Individual transfers can only be attended when the application code calls some function from the parallel socket library. At that time, a lot of accumulated events could delay return to the application code. The problem could be alleviated by attaching code dealing with individual transfers to some other system calls, such as poll(), select() or sleep().

Therefore, we decided to use a more efficient asynchronous communication, where parts of the code that can wait for some event run in independent threads.

6   Psock implementation

In our implementation we use two kinds of threads. The first is a library thread, which runs the application code and the library API code. The second is a protocol thread, which runs code that distributes data into individual transfers and controls the protocol behaviour. There is one protocol thread for each parallel transfer on the given machine. The architecture is shown in Figure.

[Figure]

Figure 1: Psock architecture

The library thread and the protocol thread communicate over interface that we call PSCIF (Parallel Socket Control Interface). This interface can be implemented using shared memory, UNIX sockets, message queues or in a future version it can also use TCP or UDP sockets or MPI (Message Passing Interface) library. In the current implementation we use message queues.

The protocol thread uses a control socket (1) to exchange control messages with the psock peer. These control messages serve to negotiate the number of individual transfers to use, to announce data socket port numbers, to agree on the parallel transfer driver (how is the data distributed into individual transfers, see below) and to exchange other information about a parallel transfer. Each peer announces its recommended nrcmd and maximum nmax number of individual transfers to establish. The number of individual transfers that will be used is determined as

min(max(nrcmd1,nrcmd2),nmax1,nmax2).

The protocol thread listens for events on data sockets (2) and reads block headers from them. The protocol thread then uses events on data sockets to prepare a parallel transfer schedule table. The library thread uses this table to control multiplexing and demultiplexing (3) of data to and from data sockets. In this way, events on data sockets are processed as soon as possible by the protocol thread and when the application wants to read or write data, the library thread already knows what individual transfers it should use (4). The schedule table also guarantees that the library thread cannot use a data socket at the same time when it is used by the protocol thread.

PSCIF communication includes three types of messages. Dispatching Instructions (5) are used for filling the parallel transfer schedule table. PSCIF Control Messages (6) are used for parallel socket manipulation (assigning a name, setting listen operations, accepting connections, setting/quering options, etc.). PSCIF Data Transfer (5) is a rarely used operation with the purpose to pass chunks of payload data when psock is about to be reconfigured (for example if some socket hangs).

6.1   Parallel transfer schedule table

The structure of the parallel transfer schedule table is shown in Table. It is formed by one circular buffer for data to be sent and one circular buffer for data to be received. The buffers do not store the data itself. Each buffer item includes the number of the individual transfer that should be used next to send or receive data and the number of bytes that can be sent or received using this individual transfer. There are also two pointers indicating which item should be used next in the sending and receiving buffer by the library thread. If this item is empty (no individual transfer is available for sending or receiving), then the library thread must wait until the table is filled by the protocol thread. The library thread can send or receive less than the number of bytes available in some individual transfer. For instance, the parallel transfer schedule table can look as shown in Table and the application may want to read 1600 bytes. Then the application will read 1400 bytes from individual transfer 2 and 200 bytes from individual transfer 0. It will leave 1900 bytes available in individual transfer 0 for future reading.

Read item: 0, Write item: 1
Buffer item 0 1 2
Read next data from this individual transfer/Read bytes 2/1400 0/2100 -
Write next data to this individual transfer/Write bytes - 0/2400 1/1600

Table 1: Parallel transfer schedule table

6.2   Round-robin driver

The method of distributing data into individual transfers is determined by the parallel transfer driver. The driver is implemented as a set of functions divided into the protocol thread and the library thread. The type of the parallel transfer driver to be used is agreed upon using control messages at the parallel connection startup. We implemented two parallel transfer drivers.

The first is the round-robin parallel transfer driver. The sender splits data into blocks of the same size and sends them over individual transfers in a round-robin fashion. The receiver retrieves data blocks from individual transfers in the same round-robin order. The round-robin driver is simple, but its disadvantage is that blocking of one individual transfer blocks the whole parallel transfer.

Throughput of a parallel transfer c with the round-robin driver is limited by a multiple of throughput of the slowest individual transfer: c=n mini ci, where ci is throughput of i-th individual transfer and n is the number of individual transfers.

6.3   Poll-all driver

The second parallel transfer driver is called poll-all, because it uses poll() system call on all sockets of individual transfers. Data is distributed into individual transfers as they are available for sending. When some sending socket buffer become full because its individual transfer is slow, the sender will distribute further data into other individual transfers. We can utilize different and varying available bandwidth of individual transfers. Data blocks can arrive at the receiver in different order than they were sent by the sender. Therefore each data block is preceeded by a header with a block number. The protocol thread on the receiver side waits for events on individual transfers, reads block numbers and prepares the parallel transfer schedule table. If some block arrives so early that it cannot fit in the table, its number is added to farsched_list structure connected to the item in the table, where it should fit later.

Throughput of a parallel transfer c with the poll-all driver should be the sum of achievable throughputs of individual transfers:

c = c1 + c2 + ... + cn,

where ci is throughput of i-th individual transfer and n is the number of individual transfers.

6.4   Socket buffer sizing

Proper sizes of sending and receiving socket buffers are important in order to achieve maximum possible throughput of a parallel transfer.

6.4.1   Individual TCP connection

In an individual TCP connection, socket buffers have two purposes:

When a receiver detects missing packets, it continues to store further received data in its socket buffer and waits until missing data arrives. A sender must also store data already sent until acknowledgement is received. It takes one RTT until sender can detect a missing acknowledgement or until receiver can get retransmitted data. A NewReno type of TCP now commonly used tries to retransmit several missing packets in several consecutive RTT periods until all sent data is acknowledged or until sender retransmission timer expires. When a sequence of n consecutive data segments gets lost, this mechanism works when socket buffers can store the volume of data B = n×RTT×bw, where RTT is round-trip time and bw is achievable throughput. However, the problem is which RTT and bw should be taken in this calculation. When we exceed capacity of router buffers, we decrease throughput due to increased packet loss. An interesting analysis of socket buffer sizing in relation to the size of the buffer in a router before the bottleneck link was presented in [buffers].

Some operating systems separate a part of the receiving socket buffer used to store data until the application can retrieve it from the rest of the receiving socket buffer used for stream reassembly. For example, in Linux the division of the receiving socket buffer can be configured by app_window and window_clamp sysctl variables. The motivation is that applications usually retrieve data in chunks and the volume of unread data in socket buffer therefore fluctuates. A separate part of buffer then prevents these fluctuations to propagate to the sender in the receiver window (rwin) of available buffer space announced by the receiver. Linux also does several internal modifications to configured buffer sizes, which should be considered when setting buffer sizes, see Figure.

[Figure]

Figure 2: Internal arithmetics applied by Linux to TCP buffer and window sizes

6.4.2   Parallel transfer

In a parallel transfer, socket buffers have an additional purpose:

Suppose that we have two individual transfers with different delay between the sender and receiver. Packets will be distributed on the sender into the two individual transfers in proportion of their achievable throughputs (when poll-all driver is used). However, packets arriving on a transfer with shorter delay will always have to wait for packets arriving on a transfer with longer delay for reassembly of the original stream that was split into individual transfers. This situation is illustrated in Figure.

[Figure]

Figure 3: Individual transfers with unequal one-way delays

The total delay consists from a) network delay on lines and in router buffers, b) delay in the sending socket buffer. The latter depends on the size of the sending socket buffer and the speed by which it is emptied to the network. When we do not provide receiver buffer space for this waiting, one individual connection will be blocked by another.

7   Psock one-path evaluation

In this section we demonstrate how psock performs over a single path between the sender and receiver.

To test psock behaviour we used setup shown in Figure with network path characteristics emulated by NIST Net [nistnet]. We configured link throughput to 100 Mb/s and RTT to 40 ms (roughly RTT across Europe). The path capacity was therefore

C = 100 Mb/s × 40 ms = 4 Mb = 500 kB.

We configured sender and receiver socket buffers to 0.2, 0.8, 1.0 and 1.5 multiple of C. These were actual buffer sizes after Linux internal arithmetics [linux-buffers]. We transfered 100 MB of data in each test run.

[Figure]

Figure 4: Testbed with NIST Net emulator

We used psock-testperf utility included in the psock distribution to generate and receive traffic. We did memory to memory transfers using the following commands:

# Start receiver on PC2
psock-testperf -recv -out mem -listen
# Start sender on PC1
psock-testperf -xmit -in mem -len 100000000 -connect pc2

It is also possible to do a disk to disk test with psock-testperf utility:

# Start receiver on PC2
psock-testperf -recv -out file -file dumpfile -listen
# Start sender on PC1
psock-testperf -xmit -in file -file sourcefile -connect pc2

7.1   One path, no losses

When we have one physical path between the sender and the receiver and the connection without losses, such as over a dedicated circuit, psock can still increase throughput in cases when we cannot configure socket buffers to cover the path capacity for some reason.

The measured throughput for different number of individual transfers is shown in Figure. The case of one individual transfer is equivalent to using a single TCP connection (not a parallel transfer). We can see that the parallel transfer increased throughput when the socket buffer size of an individual transfer was less than C. There was also some small increase in throughput of the parallel transfer for buffer sizes greater than C, because the cummulative congestion window (cwnd) of the parallel transfer approached sooner the path capacity in the slow start phase, but this is not important for transfers of high volumes of data.

[Figure]

Figure 5: Throughput over one path without losses

7.2   One path, packet losses

When packet losses occur during a TCP connection, its congestion window is managed by TCP congestion avoidance algorithm. A parallel transfer has its cummulative congestion window increased at the speed of a sum of increases of congestion windows of individual transfers. Therefore, throughput of a parallel transfer increases with the number of individual transfers until it gets most of the available bandwidth or until it increases packet loss so much that injecting more traffic into the network actually decreases throughput.

To test psock behaviour we used the same testbed as in the previous example and we configured NIST Net to emulate packet loss of 10-5. The measured throughput is shown in Figure. We can see that increasing the number of individual transfers also significantly increased total throughput, until we reached about 90% of available bandwidth.

[Figure]

Figure 6: Throughput over one path with packet loss of 10-5.

8   Test of parallel transfer drivers

In this section we demonstrate how psock performs over multiple paths between the sender and receiver. We particularly show behaviour of different parallel transfer drivers, that is the methods of distributing data into individual transfers.

In order to use multiple physical paths, the psock-testperf utility must be directed to listen on multiple local interfaces or to connect to multiple remote IP addresses. An application can do this by calling setpsockopt() with PSOCK_ADD_LOCAL_ADDR:

# Start receiver on PC2
psock-testperf -recv -out mem -listen \
   -max 2 -rcmd 2 -add_addr 10.0.0.2 -add_addr 10.0.1.2
# Start sender on PC1
psock-testperf -xmit -in mem -len 100000000 \
   -max 2 -rcmd 2 -connect 10.0.0.2 -connect 10.0.1.2

8.1   NIST Net testbed

We first used NIST Net to emulate two links between the sender and receiver. The emulated bandwidth of the first link varied from 0.1 to 0.9 in 0.1 steps of the emulated bandwidth of the second link. The sum of bandwidth of both links was 100 Mb/s. Ideally, the parallel transfer should be able to utilize bandwidth of both links even when one link is slower than the other. The measured throughput is shown in Figure.

[Figure]

Figure 7: Throughput over two paths emulated by NIST Net

When the round-robin driver was used, throughput of each individual transfer was equal to the slowest individual transfer. In our case the total throughput was twice the throughput of the slower individual transfer. On the other hand, when the poll-all driver was used, the total throughput was independent of the ratio of bandwidth in each individual transfer and was constantly near the sum of the installed bandwidth at about 95 Mb/s.

8.2   Real network

We then tested psock over two distinct physical paths between the sender and receiver, which were located about 260 km apart in two cities. The setup is illustrated in Figure.

[Figure]

Figure 8: Testbed with two real network paths

In the first test, one path had bandwidth of 155 Mb/s and the other path had bandwidth of 620 Mb/s. The bandwidth was configured by using different channels in a SONET circuit. RTT on an empty network was 3.1 ms.

We measured TCP throughput using iperf with socket buffer sizes set to slightly above the bandwidth * RTT product. The measured results can be summarized as follows:

In the second test, we used the same testbed as in the previous example, but we configured both links for the same bandwidth of 310 Mb/s. We measured TCP throughput of a parallel transfer with each of the drivers for 15 seconds in three scenarious, when we added another TCP stream to the network as cross-traffic as follows:

  1. Cross-traffic stream added to link 1 from 5 to 10 seconds

  2. Cross-traffic stream added to link 1 from 5 to 8 seconds and to link 2 from 8 to 11 seconds (the two cross-traffic streams did not overlap)

  3. Cross-traffic stream added to link 1 from 5 to 8 seconds and to link 2 from 7 to 10 seconds (the two cross-traffic streams overlapped)

Measured throughput for both drivers and each of the above cases is shown in Figure through Figure. The figures also include throughput of cross-traffic streams at the bottom. You can see that the poll-all driver was able to keep significantly higher throughput of the parallel transfer than the round-robin driver, by utilizing available bandwidth of a less loaded link, with only little effect on cross-traffic streams. The throughput was also more stable with the poll-all driver.

[Figure]

Figure 9: Throughput over two real links, cross-traffic on one link

[Figure]

Figure 10: Throughput over two real links, cross-traffic on both links

[Figure]

Figure 11: Throughput over two real links, overlapping cross-traffic on both links

9   Code example

We present below a code example illustrating the use of psock library and its similarity to the standard BSD socket library interface. Note that most calls are the same except that the letter "p" is prepended. For clarity, we do not check returned values of called functions for errors, which should normally be done.

#define SERVADDR "127.0.0.1" /* server address */
#define SERVPORT 2323        /* server port */

/* Usage:
 * run server: psock_sample (without arguments)
 * run client: psock_sample c */

int main(int argc, char *argv[])
{
    struct sockaddr_in servaddr; /* server address */
    int ps;                      /* parallel socket descriptor*/
    int streams;                 /* for setting stream counts */

    /* Create a parallel socket. */
    ps = psocket(AF_INET, SOCK_STREAM, IPPROTO_PSOCK);

    /* Set server address. */
    memset(&servaddr, 0, sizeof servaddr);
    inet_aton(SERVADDR, &addr.sin_addr);
    
    servaddr.sin_family = AF_INET;
    servaddr.sin_port = htons(SERVPORT);

    /* Set the default recommended data stream count. */
    streams = 3;
    psock_setopt(PSOCK_RCMD_STREAM_CNT, &streams, sizeof streams);
    
    /* Set the default maximal data stream count. */
    streams = 5;
    psock_setopt(PSOCK_MAX_STREAM_CNT, &streams, sizeof streams);
    
    if (argc <= 1 || argv[1][0] != 'c') {
        /* Server */
        int i, ps2;
        struct pstream_int pi;
    
        pbind(ps, (struct sockaddr*)&servaddr, sizeof servaddr);
        
        servaddr.sin_port = htons(2324);
        plisten(ps, 1);
        
        /* Add the following 5 ports to be used partial data transfers. */
        for (i = 1; i <= 5; i++) {
            servaddr.sin_port = htons(SERVPORT + i);
            setpsockopt(ps, SOL_PSOCK, PSOCK_ADD_LOCAL_ADDR,
                            &servaddr, sizeof servaddr);
        }
        ps2 = paccept(ps, NULL, NULL);
        
        /* Set send socket buffer of the 1st partial data stream. */
        pi.index = 0;
        pi.value = 500000;
        setpsockopt(ps2, SOL_PSOCK, PSOCK_SNDBUF, &pi, sizeof pi);
        
#define MSG "Hello there!\n"
        r = psend(ps2, MSG, sizeof MSG - 1, 0));
        
        psclose(ps2;
    } else {
        /* Client */
        char buf[128];
        int r;

        pconnect(ps, (struct sockaddr*)&servaddr, sizeof servaddr);
        precv(ps, buf, sizeof buf, 0));
        
        /* Write received message to the output. */
        write(1, buf, r);
        psclose(ps2;
    }
    
    return 0;
}

10   Conclusion

We presented an architecture and implementation of a parallel socket library called psock. This library allows easy use of parallel data transfers in applications to increase throughput. In contrast to other approaches to parallel data transfers, psock can use alternative drivers to distribute data into individual transfers. A poll-all driver provided as part of psock distribution can utilize different and varying available bandwidth of individual transfers. Psock uses a networking stack installed on end stations without modifications (a standard TCP or some kind of a "fast TCP"), that is it does not require kernel patching. Psock can be downloaded from http://www.ces.net/project/qosip#psock.

11   References

References

[fasttcp] Yee-Ting Li, Douglas Leith, Robert N. Shorten. Experimental Evaluation of TCP Protocols for High-Speed Networks, Hamilton Institute, 2005.
[psockets] H. Sivakumar, S. Bailey, R. L. Grossman. PSockets: The Case for Application-level Network Striping for Data Intensive Applications using High Speed Wide Area Networks, Supercomputing 2000, Dallas, Texas, USA.
[sctp] R. Stewart et al. Stream Control Transmission Protocol, RFC2960, Internet Engineering Task Force, October 2000.
[gridftp] W. Allcock et al. GridFTP: Protocol Extension for FTP for the Grid, GGF Document Series GFD.20, April 2003.
[ptcp] Hung-Yun Hsieh, Raghupathy Sivakumar. pTCP: An End-to-End Transport Layer Protocol for Striped Connection, Proceedings of IEEE ICNP, 2002.
[buffers] Manish Jain, Ravi S. Prasad, Constantinos Dovrolis. The TCP Bandwidth-Delay Product revisited: network buffering, cross traffic, and socket buffer auto-sizing, Technical Report, GIT-CERCS-03-02, Georgia Tech, 2003.
[nistnet] Mark Carson, Darrin Santay. NIST Net - A Linux-based Network Emulation Tool, Computer Communication Review, June 2003. Available online.
[linux-buffers] Sven Ubik, Pavel Cimbál. Achieving Reliable High Performance in LFNs, TNC 2003, Zagreb, Croatia.
další weby:fond rozvojemetacentrumCzechLightpřenosyvideoservereduroameduID.cz