Psock: A Parallel Socket Library
CESNET
technical report number 12/2006
also available in PDF,
PostScript, and
XML formats.
Sven Ubik, Martin Čížek
30.11.2006
1 Abstract
We present design of a new parallel socket library called psock, which allows easy introduction of parallel transfers into network applications. Psock can effectively utilize unequal and varying available bandwidth of individual connections. We discuss requirements on socket buffer sizes, which are different from singular transfers. Finally, we present performance measurements in various scenarios.
Keywords: parallel transfers, network performance, buffer sizing, TCP performance
2 Why parallel data transfers?
Congestion control in standard TCP cannot utilize all available bandwidth in fast long-distance networks because its AIMD(1, 0.5) algorithm for adjusting congestion window (cwnd) is slow for networks with high bandwidth×delay products. One option is to use faster cwnd adjusting algorithm. This approach is used in various "fast TCP" variants. A comparison of several most commonly known TCP variants has been presented in [fasttcp]. Another option is to use several TCP connections in parallel.
A parallel transfer has several advantages. It can use several physical paths from the sender to the receiver and it does not require root access to patch, fine tune and reboot the operating system. On the other hand, its drawback lies in the necessity to modify the application that is to use parallel transfers. The underlying transport protocol for individual transfers of a parallel transfer can be standard TCP, some of the "fast TCP" variants or another transport protocol.
3 Possibilities of parallelism
Parallel communication can be implemented in different layers of a computer network:
- Link layer
- Frames can be distributed into alternative network links by a link-layer load sharing mechanism, such as Cisco EtherChannel or Spanning Tree Protocol (STP). The latter is a protocol primarily designed to create a loop-free layer 2 topology, but it can also be used for load balancing when multiple VLANs are created.
- Network layer
- Packets can be distributed into alternative network paths by static or dynamic routing. Routing protocols such as BGP or OSPF provide for multiple paths to a single destination IP subnet. Packets can also be sent to alternative destination IP addresses (and they may consequently use alternative network paths) with the help of dynamic DNS resolution.
- Transport layer
- Data can be transfered over several transport connections opened by the transport layer on the sending and receiving hosts.
- Application layer
- Data can be transfered over several transport connections opened by the sending and receiving parts of the application.
A link layer or network layer solution must be configured by a network operator. It cannot be influenced by application users, but it is transparent to the application.
A transport layer solution requires support in operating system on the communicating end hosts. It can be configured by users and it can be transparent to the application.
An application layer solution requires support inside the application and can therefore be easily configured by the user.
4 Related work
PSockets [psockets] is a C++ library, which implements
parallel data transfers at the transport layer using a round-robin
method of distributing data into individual transfers. As a result,
throughput of the parallel transfer is limited by n times
throughput of the slowest individual transfer, where n is the
number of individual transfers. The primary goal of PSockets was to
provide faster congestion window adjustment in the case when we cannot
modify congestion control of the underlying TCP implementation (such
as when we do not have root access).
SCTP [sctp] implements several individual transfers (called streams) using a single pair of source and destination ports, which allows easy passing through firewalls and NATs. It can use multiple IP addresses on the sending or receiving host, possibly interconnected by physically separate network paths. SCTP is a different transport protocol than TCP, applications must be therefore programmed to use SCTP.
Some applications use HTTP to implement parallel transfers at the
application layer by opening several HTTP connections and requesting
different parts of the transfered file by the Range header
line. This solution can be used only when transfering stored files, it
cannot be used when transfering data passed over a pipe from another
application.
Similarly, some applications (e.g., [gridftp]) use FTP to implement parallel transfers by requesting different parts of the transfered file in separate FTP data connections.
pTCP [ptcp] is a modified version of TCP that can stripe data into multiple connections and tries to utilize their possibly varying achievable throughput. A practical difficulty with pTCP is that it requires a replacement of the transport protocol implementation in operating systems on both end hosts.
pscp is a parallel version of a popular scp tool for secure data transfer. It can be used also for transfering data passed over a pipe from another application.
5 Requirements and design decisions
We defined the following set of requirements to be fulfilled by our implementation of parallel transfers:
-
It should be easy to introduce parallel transfers into any network application that uses standard BSD socket library. Such applications should require minimal modifications.
-
Distribution of data into individual transfers should efficiently utilize achievable throughput of individual transfers. A slower individual transfer should not limit a faster one.
-
The method of distribution of data into individual transfers should be configurable, alternative methods should be possible.
-
Waiting of one component of implementation should not block any other component. Parallel execution of components should be possible on a parallel computer.
-
There should be an easy to use mechanism to configure various parameters of parallel transfers.
Based on the above requirements we decided to implement parallel transfers at the transport layer by an alternative socket library, called psock. This parallel socket library is linked with the application code instead of the standard BSD socket library. The syntax of most calls to library functions remain unchanged, the application code requires only minimal modifications in order to use parallel transfers. Parameters of parallel transfers can be configured using standard setsocktopt() and getsockopt() calls.
We decided to use TCP for individual transfers. Using TCP has the following advantages:
-
It is a universally accepted protocol with proven stability
-
It provides congestion control for individual transfers, we do not have to provide it ourselves inside the parallel transfer protocol
-
It has known interactions with network infrastructure (routers, firewalls, NATs)
At the same time, TCP has also certain limitations:
-
Doing congestion control separately for individual transfers may not be optimal for the parallel transfer
-
Retransmissions of lost packets must be done over the same individual transfer over which they were originally sent and which may be suffering from overload
In principle, there are two ways that the application code and the parallel socket library code could communicate - synchronous and asynchronous.
In synchronous communication, the code of one part is called from
the code of the other part. That is, when the application code calls a
function from the parallel socket library, the code implementing
parallel transfers is executed. Synchronous communication would most
likely result in ineffient use of time. Individual transfers can only
be attended when the application code calls some function from the
parallel socket library. At that time, a lot of accumulated events
could delay return to the application code. The problem could be
alleviated by attaching code dealing with individual transfers to some
other system calls, such as poll(), select() or
sleep().
Therefore, we decided to use a more efficient asynchronous communication, where parts of the code that can wait for some event run in independent threads.
6 Psock implementation
In our implementation we use two kinds of threads. The first is a library thread, which runs the application code and the library API code. The second is a protocol thread, which runs code that distributes data into individual transfers and controls the protocol behaviour. There is one protocol thread for each parallel transfer on the given machine. The architecture is shown in Figure.
The library thread and the protocol thread communicate over interface that we call PSCIF (Parallel Socket Control Interface). This interface can be implemented using shared memory, UNIX sockets, message queues or in a future version it can also use TCP or UDP sockets or MPI (Message Passing Interface) library. In the current implementation we use message queues.
The protocol thread uses a control socket (1) to exchange control messages with the psock peer. These control messages serve to negotiate the number of individual transfers to use, to announce data socket port numbers, to agree on the parallel transfer driver (how is the data distributed into individual transfers, see below) and to exchange other information about a parallel transfer. Each peer announces its recommended nrcmd and maximum nmax number of individual transfers to establish. The number of individual transfers that will be used is determined as
min(max(nrcmd1,nrcmd2),nmax1,nmax2).
The protocol thread listens for events on data sockets (2) and reads block headers from them. The protocol thread then uses events on data sockets to prepare a parallel transfer schedule table. The library thread uses this table to control multiplexing and demultiplexing (3) of data to and from data sockets. In this way, events on data sockets are processed as soon as possible by the protocol thread and when the application wants to read or write data, the library thread already knows what individual transfers it should use (4). The schedule table also guarantees that the library thread cannot use a data socket at the same time when it is used by the protocol thread.
PSCIF communication includes three types of messages. Dispatching Instructions (5) are used for filling the parallel transfer schedule table. PSCIF Control Messages (6) are used for parallel socket manipulation (assigning a name, setting listen operations, accepting connections, setting/quering options, etc.). PSCIF Data Transfer (5) is a rarely used operation with the purpose to pass chunks of payload data when psock is about to be reconfigured (for example if some socket hangs).
6.1 Parallel transfer schedule table
The structure of the parallel transfer schedule table is shown in Table. It is formed by one circular buffer for data to be sent and one circular buffer for data to be received. The buffers do not store the data itself. Each buffer item includes the number of the individual transfer that should be used next to send or receive data and the number of bytes that can be sent or received using this individual transfer. There are also two pointers indicating which item should be used next in the sending and receiving buffer by the library thread. If this item is empty (no individual transfer is available for sending or receiving), then the library thread must wait until the table is filled by the protocol thread. The library thread can send or receive less than the number of bytes available in some individual transfer. For instance, the parallel transfer schedule table can look as shown in Table and the application may want to read 1600 bytes. Then the application will read 1400 bytes from individual transfer 2 and 200 bytes from individual transfer 0. It will leave 1900 bytes available in individual transfer 0 for future reading.
| Read item: 0, Write item: 1 | |||
| Buffer item | 0 | 1 | 2 |
|---|---|---|---|
| Read next data from this individual transfer/Read bytes | 2/1400 | 0/2100 | - |
| Write next data to this individual transfer/Write bytes | - | 0/2400 | 1/1600 |
Table 1: Parallel transfer schedule table
6.2 Round-robin driver
The method of distributing data into individual transfers is determined by the parallel transfer driver. The driver is implemented as a set of functions divided into the protocol thread and the library thread. The type of the parallel transfer driver to be used is agreed upon using control messages at the parallel connection startup. We implemented two parallel transfer drivers.
The first is the round-robin parallel transfer driver. The sender splits data into blocks of the same size and sends them over individual transfers in a round-robin fashion. The receiver retrieves data blocks from individual transfers in the same round-robin order. The round-robin driver is simple, but its disadvantage is that blocking of one individual transfer blocks the whole parallel transfer.
Throughput of a parallel transfer c with the round-robin driver is limited by a multiple of throughput of the slowest individual transfer: c=n mini ci, where ci is throughput of i-th individual transfer and n is the number of individual transfers.
6.3 Poll-all driver
The second parallel transfer driver is called poll-all,
because it uses poll() system call on all sockets of
individual transfers. Data is distributed into individual transfers as
they are available for sending. When some sending socket buffer become
full because its individual transfer is slow, the sender will
distribute further data into other individual transfers. We can
utilize different and varying available bandwidth of individual
transfers. Data blocks can arrive at the receiver in different order
than they were sent by the sender. Therefore each data block is
preceeded by a header with a block number. The protocol thread on the
receiver side waits for events on individual transfers, reads block
numbers and prepares the parallel transfer schedule table. If some
block arrives so early that it cannot fit in the table, its number is
added to farsched_list structure connected to the item in the
table, where it should fit later.
Throughput of a parallel transfer c with the poll-all driver should be the sum of achievable throughputs of individual transfers:
c = c1 + c2 + ... + cn,
where ci is throughput of i-th individual transfer and n is the number of individual transfers.
6.4 Socket buffer sizing
Proper sizes of sending and receiving socket buffers are important in order to achieve maximum possible throughput of a parallel transfer.
6.4.1 Individual TCP connection
In an individual TCP connection, socket buffers have two purposes:
-
to store data for stream reassembly when some packets are delayed or lost
-
to store data between the time when the application writes or reads it and when it can be sent to or received from the network
When a receiver detects missing packets, it continues to store further received data in its socket buffer and waits until missing data arrives. A sender must also store data already sent until acknowledgement is received. It takes one RTT until sender can detect a missing acknowledgement or until receiver can get retransmitted data. A NewReno type of TCP now commonly used tries to retransmit several missing packets in several consecutive RTT periods until all sent data is acknowledged or until sender retransmission timer expires. When a sequence of n consecutive data segments gets lost, this mechanism works when socket buffers can store the volume of data B = n×RTT×bw, where RTT is round-trip time and bw is achievable throughput. However, the problem is which RTT and bw should be taken in this calculation. When we exceed capacity of router buffers, we decrease throughput due to increased packet loss. An interesting analysis of socket buffer sizing in relation to the size of the buffer in a router before the bottleneck link was presented in [buffers].
Some operating systems separate a part of the receiving socket
buffer used to store data until the application can retrieve it from
the rest of the receiving socket buffer used for stream
reassembly. For example, in Linux the division of the receiving socket
buffer can be configured by app_window and
window_clamp sysctl variables. The motivation is that
applications usually retrieve data in chunks and the volume of unread
data in socket buffer therefore fluctuates. A separate part of buffer
then prevents these fluctuations to propagate to the sender in the
receiver window (rwin) of available buffer space announced by the
receiver. Linux also does several internal modifications to configured
buffer sizes, which should be considered when setting buffer sizes,
see Figure.
6.4.2 Parallel transfer
In a parallel transfer, socket buffers have an additional purpose:
-
to store data for reassembly from multiple connections in case that it arrives reordered due to different delay in individual transfers
Suppose that we have two individual transfers with different delay between the sender and receiver. Packets will be distributed on the sender into the two individual transfers in proportion of their achievable throughputs (when poll-all driver is used). However, packets arriving on a transfer with shorter delay will always have to wait for packets arriving on a transfer with longer delay for reassembly of the original stream that was split into individual transfers. This situation is illustrated in Figure.
The total delay consists from a) network delay on lines and in router buffers, b) delay in the sending socket buffer. The latter depends on the size of the sending socket buffer and the speed by which it is emptied to the network. When we do not provide receiver buffer space for this waiting, one individual connection will be blocked by another.
7 Psock one-path evaluation
In this section we demonstrate how psock performs over a single path between the sender and receiver.
To test psock behaviour we used setup shown in Figure with network path characteristics emulated by NIST Net [nistnet]. We configured link throughput to 100 Mb/s and RTT to 40 ms (roughly RTT across Europe). The path capacity was therefore
C = 100 Mb/s × 40 ms = 4 Mb = 500 kB.
We configured sender and receiver socket buffers to 0.2, 0.8, 1.0 and 1.5 multiple of C. These were actual buffer sizes after Linux internal arithmetics [linux-buffers]. We transfered 100 MB of data in each test run.
We used psock-testperf utility included in the psock
distribution to generate and receive traffic. We did memory to memory
transfers using the following commands:
# Start receiver on PC2 psock-testperf -recv -out mem -listen # Start sender on PC1 psock-testperf -xmit -in mem -len 100000000 -connect pc2
It is also possible to do a disk to disk test with psock-testperf utility:
# Start receiver on PC2 psock-testperf -recv -out file -file dumpfile -listen # Start sender on PC1 psock-testperf -xmit -in file -file sourcefile -connect pc2
7.1 One path, no losses
When we have one physical path between the sender and the receiver and the connection without losses, such as over a dedicated circuit, psock can still increase throughput in cases when we cannot configure socket buffers to cover the path capacity for some reason.
The measured throughput for different number of individual transfers is shown in Figure. The case of one individual transfer is equivalent to using a single TCP connection (not a parallel transfer). We can see that the parallel transfer increased throughput when the socket buffer size of an individual transfer was less than C. There was also some small increase in throughput of the parallel transfer for buffer sizes greater than C, because the cummulative congestion window (cwnd) of the parallel transfer approached sooner the path capacity in the slow start phase, but this is not important for transfers of high volumes of data.
7.2 One path, packet losses
When packet losses occur during a TCP connection, its congestion window is managed by TCP congestion avoidance algorithm. A parallel transfer has its cummulative congestion window increased at the speed of a sum of increases of congestion windows of individual transfers. Therefore, throughput of a parallel transfer increases with the number of individual transfers until it gets most of the available bandwidth or until it increases packet loss so much that injecting more traffic into the network actually decreases throughput.
To test psock behaviour we used the same testbed as in the previous example and we configured NIST Net to emulate packet loss of 10-5. The measured throughput is shown in Figure. We can see that increasing the number of individual transfers also significantly increased total throughput, until we reached about 90% of available bandwidth.
Figure 6: Throughput over one path with packet loss of 10-5.
8 Test of parallel transfer drivers
In this section we demonstrate how psock performs over multiple paths between the sender and receiver. We particularly show behaviour of different parallel transfer drivers, that is the methods of distributing data into individual transfers.
In order to use multiple physical paths, the psock-testperf utility must be directed to listen on multiple local interfaces or to connect to multiple remote IP addresses. An application can do this by calling setpsockopt() with PSOCK_ADD_LOCAL_ADDR:
# Start receiver on PC2 psock-testperf -recv -out mem -listen \ -max 2 -rcmd 2 -add_addr 10.0.0.2 -add_addr 10.0.1.2 # Start sender on PC1 psock-testperf -xmit -in mem -len 100000000 \ -max 2 -rcmd 2 -connect 10.0.0.2 -connect 10.0.1.2
8.1 NIST Net testbed
We first used NIST Net to emulate two links between the sender and receiver. The emulated bandwidth of the first link varied from 0.1 to 0.9 in 0.1 steps of the emulated bandwidth of the second link. The sum of bandwidth of both links was 100 Mb/s. Ideally, the parallel transfer should be able to utilize bandwidth of both links even when one link is slower than the other. The measured throughput is shown in Figure.
When the round-robin driver was used, throughput of each individual transfer was equal to the slowest individual transfer. In our case the total throughput was twice the throughput of the slower individual transfer. On the other hand, when the poll-all driver was used, the total throughput was independent of the ratio of bandwidth in each individual transfer and was constantly near the sum of the installed bandwidth at about 95 Mb/s.
8.2 Real network
We then tested psock over two distinct physical paths between the sender and receiver, which were located about 260 km apart in two cities. The setup is illustrated in Figure.
In the first test, one path had bandwidth of 155 Mb/s and the other path had bandwidth of 620 Mb/s. The bandwidth was configured by using different channels in a SONET circuit. RTT on an empty network was 3.1 ms.
We measured TCP throughput using iperf with socket buffer sizes set to slightly above the bandwidth * RTT product. The measured results can be summarized as follows:
-
TCP throughput over each of the links was 142 Mb/s and 567 Mb/s, respectively, which was close to the installed bandwidth of 155 Mb/s and 620 Mb/s.
-
TCP throughput of a parallel transfer using the round-robin driver was 282.6 Mb/s. As expected this was very close to double of throughput of the slower link, which was 2*142 Mb/s=284 Mb/s.
-
TCP throughput of a parallel transfer using the poll-all driver was 707.7 Mb/s. As expected this was very close to the sum of throughput of both links, which was 142 Mb/s+567 Mb/s=709 Mb/s.
In the second test, we used the same testbed as in the previous example, but we configured both links for the same bandwidth of 310 Mb/s. We measured TCP throughput of a parallel transfer with each of the drivers for 15 seconds in three scenarious, when we added another TCP stream to the network as cross-traffic as follows:
-
Cross-traffic stream added to link 1 from 5 to 10 seconds
-
Cross-traffic stream added to link 1 from 5 to 8 seconds and to link 2 from 8 to 11 seconds (the two cross-traffic streams did not overlap)
-
Cross-traffic stream added to link 1 from 5 to 8 seconds and to link 2 from 7 to 10 seconds (the two cross-traffic streams overlapped)
Measured throughput for both drivers and each of the above cases is shown in Figure through Figure. The figures also include throughput of cross-traffic streams at the bottom. You can see that the poll-all driver was able to keep significantly higher throughput of the parallel transfer than the round-robin driver, by utilizing available bandwidth of a less loaded link, with only little effect on cross-traffic streams. The throughput was also more stable with the poll-all driver.
Figure 10: Throughput over two real links, cross-traffic on both links
9 Code example
We present below a code example illustrating the use of psock library and its similarity to the standard BSD socket library interface. Note that most calls are the same except that the letter "p" is prepended. For clarity, we do not check returned values of called functions for errors, which should normally be done.
#define SERVADDR "127.0.0.1" /* server address */
#define SERVPORT 2323 /* server port */
/* Usage:
* run server: psock_sample (without arguments)
* run client: psock_sample c */
int main(int argc, char *argv[])
{
struct sockaddr_in servaddr; /* server address */
int ps; /* parallel socket descriptor*/
int streams; /* for setting stream counts */
/* Create a parallel socket. */
ps = psocket(AF_INET, SOCK_STREAM, IPPROTO_PSOCK);
/* Set server address. */
memset(&servaddr, 0, sizeof servaddr);
inet_aton(SERVADDR, &addr.sin_addr);
servaddr.sin_family = AF_INET;
servaddr.sin_port = htons(SERVPORT);
/* Set the default recommended data stream count. */
streams = 3;
psock_setopt(PSOCK_RCMD_STREAM_CNT, &streams, sizeof streams);
/* Set the default maximal data stream count. */
streams = 5;
psock_setopt(PSOCK_MAX_STREAM_CNT, &streams, sizeof streams);
if (argc <= 1 || argv[1][0] != 'c') {
/* Server */
int i, ps2;
struct pstream_int pi;
pbind(ps, (struct sockaddr*)&servaddr, sizeof servaddr);
servaddr.sin_port = htons(2324);
plisten(ps, 1);
/* Add the following 5 ports to be used partial data transfers. */
for (i = 1; i <= 5; i++) {
servaddr.sin_port = htons(SERVPORT + i);
setpsockopt(ps, SOL_PSOCK, PSOCK_ADD_LOCAL_ADDR,
&servaddr, sizeof servaddr);
}
ps2 = paccept(ps, NULL, NULL);
/* Set send socket buffer of the 1st partial data stream. */
pi.index = 0;
pi.value = 500000;
setpsockopt(ps2, SOL_PSOCK, PSOCK_SNDBUF, &pi, sizeof pi);
#define MSG "Hello there!\n"
r = psend(ps2, MSG, sizeof MSG - 1, 0));
psclose(ps2;
} else {
/* Client */
char buf[128];
int r;
pconnect(ps, (struct sockaddr*)&servaddr, sizeof servaddr);
precv(ps, buf, sizeof buf, 0));
/* Write received message to the output. */
write(1, buf, r);
psclose(ps2;
}
return 0;
}
10 Conclusion
We presented an architecture and implementation of a parallel
socket library called psock. This library allows easy use of parallel
data transfers in applications to increase throughput. In contrast to
other approaches to parallel data transfers, psock can use alternative
drivers to distribute data into individual transfers. A poll-all
driver provided as part of psock distribution can utilize different
and varying available bandwidth of individual transfers. Psock uses a
networking stack installed on end stations without modifications (a
standard TCP or some kind of a "fast TCP"), that is it does not
require kernel patching. Psock can be downloaded from
http://www.ces.net/project/qosip#psock.
11 References
References
| [fasttcp] | Yee-Ting Li, Douglas Leith, Robert N. Shorten. Experimental Evaluation of TCP Protocols for High-Speed Networks, Hamilton Institute, 2005. |
| [psockets] | H. Sivakumar, S. Bailey, R. L. Grossman. PSockets: The Case for Application-level Network Striping for Data Intensive Applications using High Speed Wide Area Networks, Supercomputing 2000, Dallas, Texas, USA. |
| [sctp] | R. Stewart et al. Stream Control Transmission Protocol, RFC2960, Internet Engineering Task Force, October 2000. |
| [gridftp] | W. Allcock et al. GridFTP: Protocol Extension for FTP for the Grid, GGF Document Series GFD.20, April 2003. |
| [ptcp] | Hung-Yun Hsieh, Raghupathy Sivakumar. pTCP: An End-to-End Transport Layer Protocol for Striped Connection, Proceedings of IEEE ICNP, 2002. |
| [buffers] | Manish Jain, Ravi S. Prasad, Constantinos Dovrolis. The TCP Bandwidth-Delay Product revisited: network buffering, cross traffic, and socket buffer auto-sizing, Technical Report, GIT-CERCS-03-02, Georgia Tech, 2003. |
| [nistnet] | Mark Carson, Darrin Santay. NIST Net - A Linux-based Network Emulation Tool, Computer Communication Review, June 2003. Available online. |
| [linux-buffers] | Sven Ubik, Pavel Cimbál. Achieving Reliable High Performance in LFNs, TNC 2003, Zagreb, Croatia. |