Sunteți pe pagina 1din 8

Connection-less TCP

Patricia Gilfeather Scalable Systems Lab Department of Computer Science University of New Mexico pfeather@cs.unm.edu Abstract
TCP is an important protocol in high-performance computing. It is used extensively in graphics programs and le systems and it is often the protocol used for the cluster control mechanism. As the breadth of applications increases, the need for a scalable and efcient implementation of TCP becomes more important. In addition to other bottlenecks that must be alleviated, TCP connection management must be made scalable. This becomes critical as we consider ofoading TCP processing onto TCP ofoad engines (TOEs) or intelligent network interface cards (iNICs). In this paper, we show how to take advantage of special characteristics of the high-performance computing environment and apply existing operating system mechanisms in a unique way to address some of the scalability concerns in ofoaded TCP. Specically, we implement methods for activating and deactivating TCP connections. These allow us to maintain a large store of open TCP connections without a large amount of storage overhead.

Arthur B. Maccabe Scalable Systems Lab Department of Computer Science University of New Mexico maccabe@cs.unm.edu
competitive with respect to performance attributes like latency and overhead, message-passing libraries like MPI can be implemented over TCP and cluster administrators can maintain fewer protocols. Finally, TCP/IP is well-maintained, well-tested, well-understood and highly interoperable. One way to make TCP competitive with respect to latency and overhead for large clusters is to ofoad some protocol processing. However, ofoad engines can become very expensive. If a cluster designer wants to create a competitive large cluster using a TCP ofoad engine, the amount of memory devoted to connection management must remain small. Ideally, large-scale systems will only need to provide resources for a small working-set of active connections. This paper describes mechanisms to facilitate the working set method of managing TCP connections. Our goal is to lessen the overhead of inactive connections by deactivating the heavy-weight socket and replacing it with a placeholder that allows the connection to be reactivated when it is needed. This decreases the amount of memory needed to maintain connections and facilitates ofoading TCP communication processing because only a small working set of active communications is fully instantiated at any time. One of the advantages of working with commodity protocols is that the Linux implementation of TCP has data structures and methods that can be leveraged to accomplish deactivation and reactivation of connections. Originally, minisocks and open requests were created to decrease resource usage for connections in the time wait state of TCP and for connections in the process of being created. The former is essential for large-scale web servers in order to maintain protocol correctness during the connection shutdown process. The latter is used to survive denial-of-service attacks. In this paper, we show how to modify these existing data structures and methods to create the deactivation and reactivation methods that drastically decrease the memory requirements for TCP in large-scale clusters. This is accomplished by creating a small working-set of active con-

1. Introduction
Clusters are getting bigger. In clusters of hundreds of thousands of nodes, resource management of communication will become more critical. One of the aspects of this scalability bottleneck is the amount of memory necessary to maintain communication. This problem is severe when we consider ofoading protocol processing onto different architectures like iNICs, TOEs and processors in memory (PIMs). TCP/IP implementations and applications are widespread and therefore inherently appealing for use in cluster computing. Additionally, if TCP/IP can be made
Los Alamos Computer Science Institute SC R71700H-29200001 and Albuquerque High Performance Computing Center through IBM SUR

nections. The rst part of this paper reviews TCP connection management and ofoading TCP. Next, we measure the resource constraints associated with TCP ofoad. Then, we outline the modications we implemented to deactivate sockets and reactivate sockets and present the results. Finally, we discuss future plans for further addressing TCP/IP bottlenecks.

Source Socket at connect

Destination Socket at listen SYN seq# X

SYN seq# Y & ACK seq# X+1

Socket accepts

2. TCP Working Sets


One problem with TCP is the question of connection state. Applications or libraries must make decisions about how to best allocate and maintain TCP connections. There are three options: 1) open a TCP connection when it is needed and close it again after the message is sent; 2) open a TCP connection when it is needed and keep the connection open in case it is needed again; 3) open all possible TCP connections at application launch and use them as needed. There are inefciencies associated with each of the three methods for handling TCP connection state. Option one, opening and closing a connection with each message, is the most inefcient. The cost of opening a connection is incurred with each message which increases latency too much. Option two is often used now. However, as we move either into larger clusters or onto NICs or PIMs that do not have large memories, this method will suffer from resource overcommitment pressure. Option three, opening all connections at application load, causes application startup to be drastically slower. Also, connections may be created that are never used. We will review the process of opening and closing a connection in TCP in order to more fully understand the costs associated with each of the three methods for handling TCP connection state discussed above. Furthermore, this review will provide background needed to explain the methods and data structures we used to deactivate and reactivate connections.
ACK seq# Y+1 & Data

Figure 1. Three-way handshake of TCP connection startup

Source Socket at close Socket in FIN_WAIT_1 FIN seq# U

Destination

ACK seq# U+1 Socket in FIN_WAIT_2 Socket in TIME_WAIT ACK seq# V+1

Socket in CLOSE_WAIT

FIN seq# V

Socket closed

Figure 2. Close of a TCP connection

2.1. TCP Connection Establishment


When a sender wants to send data to a receiver, the sender must establish a TCP connection with the receiver. As illustrated in Figure 1, TCP connection establishment occurs in three steps. Data is not allowed to be sent with the SYN message or with the SYN-ACK message. This means that the cost for connection startup between two hosts is a full roundtriptime. Clearly, the roundtrip cost of a connection startup should be paid only once as it is a high latency operation. A security concern associated with opening a connection, the SYN ood, was the motivation behind the openrequest data structure used in Linux to hold the place of a potential fully-instantiated connection. We discuss this in

detail below as well as how we modied it to create reactivation of sockets.

2.2. TCP Connection Close


Figure 2 shows a TCP connection close. Source sends a messages with the FIN ag set in the TCP header. The connection on Source must remain in TIME WAIT state until there is no possibility that a message intended for this connection can be received. This wait is called the 2MSL wait and is equal to twice the maximum segment lifetime of a segment. The MSL is implementation specic and can be as short as 30 seconds or as long as about 2 minutes. The MSL of Linux is 30 seconds. Because the activator of a close must remain in the time-

wait state for at least 30 seconds, clusters must account for this. Linux uses minisocks as place-holders to alleviate the memory costs associated with the time-wait state. We explain these in detail below as well as how we modify minisocks to create deactivated sockets.

2.3. Connection Working Sets


Ideally, an application or library would be able to maintain a working set of currently active TCP connections. For example, after a period of inactivity, a connection between two hosts is deactivated and a small amount of state is stored. When the host sends a message to the inactive connection, that connection is reactivated into a fully instantiated socket without paying the cost of a TCP three-way handshake. Connection working sets are especially powerful in environments with limited memory resources since on systems in which there are limited memory resources, the amount of memory available can signicantly reduce the number of active connections available. Regardless of how connections are opened (at the beginning of an application or on rst use) and regardless of how long they live, there will only be a small set of fully instantiated, active TCP connections. If the working set is small, the amount of memory needed to maintain the state of each connection will be manageable for TCP implementations that are ofoaded onto commodity NICs or TOEs with a limited amount of memory.

head associated with a fully instantiated socket without attached data buffers. Protocol ofoading exacerbates these issues by moving connection management onto a memory-limited resource. Current iNICs have a memory of about 2MB [1]. While there are iNICs now that have up to 4GB of memory, we are interested in the commodity NIC market where memory will continue to be a constrained resource. Figure 3 shows that if there were no other rmware, and no data buffers, the maximum number of ofoaded TCP connections on a 2MB (512 4Kb page) iNIC would be less than 1000. In fact, Linux will not run with TCP stacks of less than about 3000 pages. Ofoaded TCP onto commodity NICs using traditional stacks, is not scalable.
7000 Number of open connections

6000 5000 4000 3000 2000 1000 0 0 5000 10000 15000 20000 25000 30000 35000 Size of tcp_mem (MB)

3. Ofoading TCP
There is a great deal of work being done to ofoad TCP either onto iNICs or TOEs. The most well-known work on ofoaded TCP[5]. Our research has shown that TCP latency can be reduced by as much as 40% when parts of the TCP stack are ofoaded[6]. Also, ofoading all or part of the TCP stack decreases overhead associated with communication processing. This decrease is due to memory copy overhead[5], interrupt pressure overhead[8, 3], and to the ofoading of the communication progress thread making an event-driven model easy to implement[9]. Figure 3 shows the number of open connections possible as the number of 4Kb pages of memory allocated for the TCP stack decreases. As memory becomes more limited, the number of possible active connections is reduced. We measured the growth of memory with respect to the number of open connections for the Linux 2.4.25 TCP stack. First, we limited the memory associated with TCP connections by modifying the tcp mem proc lesystem le. This interface allows us to place a maximum memory limit, in terms of pages, on the TCP stack. We measured the memory over-

Figure 3. Active Connections versus Available Memory

4. Connection-less TCP
The WAN world has been able to reduce the resource pressure associated with connection management by decreasing the amount of memory needed for a TCP connection during startup and during tear-down. Can we use these techniques to reduce the memory footprint for a socket during its lifetime?

4.1. Characteristics of High-Performance Networks


There has been a great deal of research done in the WAN world on decreasing the cost of connection setup in TCP. HTTP 1.1 made some web-serving trafc connectionoriented by default. Also, methods of caching and sharing connection information in order to avoid connection

startup costs have been proposed[14, 10, 13, 2, 4, 14]. All of this research concentrates on working around the inconsistencies of large-scale heterogeneous networks. Specically, routes change so RTT estimations cannot be cached for long and congestion is highly variable so the congestion window must be regularly re-calculated. Large-scale clusters, however, are generally homogeneous with static routing. This allows us to reuse routes and RTT estimations. Because we are in a high-performance networking environment, we are able to move sockets into and out of an inactive state without paying a performance penalty because of stale ow control information. The result is a connection-less version of TCP.

ood. Additionally, the size of an open request is approximately 64 bytes whereas a fully-instantiated socket is 832 bytes. The memory savings are signicant. Syncookies are further defense against SYN ooding. They reduce the need to even create the open request data structure. A web server generates a cookie based on the IP address, port number, write sequence number and mss of the client and uses the cookie as the sequence number in the acknowledgment of the SYN. Upon receipt of the nal SYN-ACK, the server creates the open request data structure from the listening socket, the cookie and the nal acknowledgment. With the listening socket, the open request structure, the incoming message, and the route table entry, a new socket can be created. 4.2.2. Time-Wait loading Originally, clients were expected to actively close connections. It was assumed that small clients with few resource constraints would do most of the waiting in the timewait state[12]. However, because of the still widely-used HTTP 1.0 protocol (which is message oriented), web servers actually do most of the active closes. In the HTTP 1.0 protocol, the client issues a GET message. The server then sends the requested information. This is the end of the interaction. The server is forced to close the connection that was made and therefore must provide large amounts of resources for maintaining connections in time-wait[7]. In order to lighten the memory load for timewait on high-trafc web servers, minisocks were introduced into the Linux 2.4.0 TCP stack. The data structure associated with minisocks is called a tcp tw bucket. Because it is supposedly possible to move from a closing state back to an established state, the original large socket is not destroyed as long as there is a context to it. If an application closes or dies without closing its sockets, then the sockets are orphaned. Orphaned sockets must also go through the time-wait state and so tcp tw buckets must be created and ports must remain bound, but because there is no context with which to re-establish the connection, the large socket can be destroyed. Additionally, tcp tw buckets are hashed into a separate hash table. This is ostensibly done to keep the establishedconnection hash table from growing very large during timewait loading. Since it was assumed that a small establishedconnection hash table will keep demultiplexing latency low. Generally, web servers use individual threads to service a request. When the thread dies, the large socket becomes orphaned and only the tcp tw bucket remains. In this way high-trafc web-servers are able to avoid the heavy load of fully-instantiated sockets being held during time-wait. Again, the savings are substantial. The savings is 96 bytes versus 832 bytes.

4.2. Connection Management in Linux TCP


Because we are working with a commodity protocol in a commodity operating system, it is important that we show that we can reuse methods and data structures created by the community-at-large. We use data structures created to protect servers against time-wait loading to deactivate connections and maintain a placeholder and we use methods used to protect servers against denial-of-service attacks to reactivate the connections. We introduce the mechanisms used by the Linux implementation of TCP to protect resources in two common scenarios for high-trafc web servers, a denial-of-service attack and time-wait loading. Then we explain the modications we made to protect resources for a common scenario for high-performance clusters, ofoading. 4.2.1. Denial of Service One of the most common denialof-service attacks on the Web is a SYN-ooding attack. In this attack, a source oods a destination with SYN messages. The destination opens thousands of connections, and sends SYN-ACK messages back. Since the SYN-ACK messages are ignored. Thousands of half-open connections must time out. Any denitive protection against SYN-ooding must occur at the routers since some resources will always need to be allocated at the server when handling an open connection request[11]. However, the most successful defense against a denial-of-service attack is still to simply survive it. This is the reasoning behind one of the earliest defenses the Linux stack implemented, the open request data structure. When the Linux TCP stack receives a SYN request, instead of opening an entire socket, the stack creates a smaller data structure with just enough information to eventually open a large socket if the three-way handshake completes. This smaller data structure is called an open request. Open requests are held in a separate hash table. During a SYN-ood the hash table of established connections remains small. This allows legitimate connections to remain open as long as they dont time out during the

4.3. Deactivating and Reactivating Connections


The methods and data structures associated with tcp tw buckets and syncookies were created by the general networking community to address memory constraint situations in the normal TCP stack. We can leverage these methods and data structures to address similar memory constraints in high-performance computing. 4.3.1. Deactivation Deactivating a connection simply means putting a connection in the timewait state. There are, however, some additions that must be added to the timewait bucket structure. These additions are necessary in order to reconstitute the route information when the connection is reactivated. The additional overhead is: 4 bytes for the connection ags, 20 bytes for the IP options, and 4 bytes for the pointer to the route table entry. Additionally, we added a new state allowed for the tw substate eld. The INACTIVE state is set so that we may later determine that a new message is valid for this tcp timewait bucket; We attempted to reuse the tcp time wait system call to create the modied tcp timewait bucket, but were unable to reuse the call because we needed to initialize the extra data elds and because we needed to move the socket into the closed state so that the process of reclaiming memory on orphaned timewait sockets would proceed directly rather than waiting until the thread is closed. Because we determined that latency will not be affected by the size of the established-connection hash table, we chose not to move the time-wait bucket into the timewait hash. This saves us the cost of rehashing the connection on reactivation and will not cost us extra latency during demultiplexing. Finally, we wrap the tcp deactivate call in a system call, tcp deactivate connection. This system call resolves the socket descriptor to the socket and i node which are the wrappers for the sock data structure which has been replaced by the timewait bucket. Another implementation could allow the kernel to decide when to deactivate a connection, but we feel that the application or library using TCP may have better information about deactivation. Therefore, the differences between tcp timewait and tcp deactivate are: the close on the fully-instantiated socket that removes the context from the sock data structure and allows the tcp done method to free the socket memory; the initialization of the added elds discussed above; and the removal the hashing and scheduling of timers. 4.3.2. Reactivation We leverage the methods used by the syncookies mechanism to reactivate a connection upon receipt of a message. First, a message follows the path of a message bound for a socket in the time-wait state. We added a check on the tw substate eld at the beginning

of the tcp timewait state process. If the connection is inactive TCP TW REACTIVATE is returned to the main receive process. We pass the timewait bucket and the incoming message to a process modeled after the cookie v4 check method. An open request data structure is created. The open request, the pointer to the route table entry, the timewait bucket and the incoming message are all sent to the tcp v4 syn recv sock method. We cannot fully reuse this method because we are not rehashing the newly constituted socket and because the call to create the child socket is different. Otherwise, this method is identical. The call that allocates the socket from the timewait bucket is tcp create timewait child. It is modeled after the tcp create openreq child method. There are some substantial differences between these two methods. The tcp create openreq child method copies all data and initializes with data from the listening socket. We do not have that socket. We initialize the newly allocated socket from a mixture of timewait bucket and static data.

5. Results
All measurements were made using a modied version of Linux on one host and an unmodied version of the same kernel on the other host. The tests were performed on 993MHz Pentium IIIs with 1Gb Acenic Ethernet cards in a cross-over pattern. The server code simply listens on a well-known port and accepts requests as they are received.

5.1. Memory Usage


The client code (running on the modied Linux 2.4.25) rst reads /proc/slabinfo in order to get a baseline measurement of cache usage. When the control test is run, the client code loops creating a socket, connecting it and reading /proc/slabinfo. During the deactivated run, the client loops creating a socket, connecting it, deactivating it and reading /proc/slabinfo. The memory measurements are created by multiplying the number of tcp tw buckets by the size of a tcp tw bucket and adding the product of the number of sockets and their size. This is then subtracted from the initial values to get a relative memory used per socket measurement. In addition to showing memory usage per socket for regular sockets (called connected sockets), we created a system call that either simply puts a socket in time wait or orphans a socket and puts it in time wait. Figure 4 shows memory use for connected sockets, sockets in timewait state, orphaned sockets in timewait state, and deactivated sockets.

18 16 14 Memory used 12 10 8 6 4 2 0 0 5000

Time Wait Connected Deactivated Time Wait Orphan Avg latency in usec

500 450 400 350 300 250 200 150 100 50 0

Open Timewait Timewait/Orphan

10000 Number of sockets

15000

20000

Figure 4. Memory used as a function of the number of active sockets

5.2. Demultiplexing Latency


Minisocks were originally introduced into the Linux kernel in order to move sockets that were in the time-wait state out of the main hash table. The idea was that a smaller hash table would decrease the time it took to demultiplex established connections. The established hash table and the timewait hash table occupy the top half and the bottom half of the connection hash table. We wanted to test the hypothesis that the smaller hash table really will decrease latency of demultiplexing. To test the latency of a connection as the function of the number of active connections, we rst opened x connections and measured the ping-pong latency of the rst connection made. This measures the demultiplexing speed of the active connection hash table. The timewait portion of the hash table will be empty. Second, we measured the ping-pong latencies of the rst connection made when all other connections were moved into the timewait state as soon as they were opened. This will measures the demultiplexing speed when the timewait hash table is full. Finally, we measured the ping-pong latency of the rst connection if all other connections are moved into time-wait and orphaned. This meaures the demultiplexing speed of an empty active connection hash table and an empty timewait hash table. Figure 5 shows the demultiplexing latency of Linux 2.6.9 implementation of the TCP stack for the above congurations. scratch. Figure 6 shows the start-up latency of a single connection on Linux 2.6.9 for active connections, deactivated connections and closed connections. We looped through the process of opening a connection, sending and receiving a message and closing the connection. Next, we repeated the experiment, but instead of opening and closing the connection, we simply deactivated and reactivated the connection. Finally, we measured the latency of a connection that remains open.

Avg latency in usec

5.3. Reactivation Latency


Reactivation is only a legitimate method of opening a connection if it is faster than opening a connection from

10000

20000 30000 Number of connections

40000

50000

Figure 5. Demultiplexing latency of an 8 byte message

900 800 700 600 500 400 300 200 0 200 400

Open Open/Close Deactivate

600 800 1000 Message size (bytes)

1200

1400

Figure 6. Ping-pong latency

6. Discussion
As we see in Figure 4, deactivated sockets save substantial memory. Because sockets with a context are not released, the time wait run shows both the memory overhead of sockets and the memory overhead of the tcp tw buckets. The deactivated sockets require slightly more memory than the orphaned, time-wait sockets because there is the additional state necessary to reconstitute the socket. The traditional Linux stack only allows around 2000 connections as measured by the slab cache information. If we use deactivated sockets, we increase the number of connections allowed in 2MB of memory to over 20,000 connections. This is ten-fold increase in scalability. Figure 5 shows no signicant correlation between demultiplexing speed and the number of connections populating the hash table. The speed appears to be constant. No differences are found between the various methods for storing inactive sockets. These ndings are signicant because they call into question the reasoning behind the use of minisocks in the standard Linux kernel. The timewaitand-orphan measurement shows the performance of standard Linux servers. There is no latency advantage. The only signicant advantage of minisocks is the memory savings. Furthermore, there is no reason to remove deactivated sockets from the established-connection hash table. As we see in Figure 6, reactivation moderately increases startup latency over leaving connections open. On the other hand, reactivations decreases latency by approximately 40% for the rst arriving packet of a message compared to closing and repoening a connection. When memory pressure requires space-saving methods, reactivation is clearly a viable method. In addition, note that this latency cost is only paid on the packet that initiates reactivation. The most signicant result of these experiments is that there are methods that allow full ofoading of TCP processing for large clusters using commodity NICs with limited memory resources. We drastically decreased the memory used for inactive sockets while only moderately increasing latency of the reactivation message.

8. Conclusions
We were able to drastically reduce memory usage for open TCP connections thus increasing the scalability of TCP, especially for ofoaded TCP. The cost in latency of the rst arriving packet on an inactive connection is signicant, but it is much lower than the cost of closing and re-opening the connection. We were able to reuse mechanisms created in other areas of network research that reduce resource commitment for communication. By making small modications to an existing operating system, we were able to drastically reduce resource usage. This is the great advantage of working with commodity protocols: often a good deal of research has already been implemented. Commodity protocols, especially TCP/IP will always be an important part of cluster computing. Certainly, as more scientic elds come to rely on high-performance computation, there will be more rather than less of a dependency on the interoperable, easy-to-program protocol. We must push TCP to the limits of its performance and efciency with respect to high-performance environments and commodity hardware, as we cannot expect commodity protocols or components to fully disappear. Deactivation is a method for understanding the TCP stack in the context of high-performance computing. It begins to address a problem with TCP, scalability with respect to state management for TCP implementations that are ofoaded onto commodity hardware.

References
[1] Acenic gigabit ethernet for linux. Web: http://jes.home.cern.ch/jes/gige/acenic.html, August 2001. [2] M. Allman, S. Floyd, and C. Partridge. RFC 2414 : Increasing TCPs initial window, September 1998. Status: EXPERIMENTAL. [3] I. M. Amnon Barak, Ilia Gilderman. Performance of the communication layers of tcp/ip with the myrinet gigabit lan. Computer Communications, 22(11), July 1999. [4] H. Balakrishnan, V. Padmanabhan, S. Seshan, M. Stemm, and R. Katz. TCP behavior of a busy internet server: Analysis and improvements. In IEEE INFOCOM, March 1998. [5] J. Chase, A. Gallatin, and K. Yocum. End-system optimizations for highspeed TCP. In IEEE Communications, special issue on TCP Performance in Future Networking Environments, volume 39, page 8, 2000. [6] B. Duncan. Splinter tcp to decrease small message latency in high-performance computing. Technical Report TR-CS2003-27, University of New Mexico, 2003. [7] T. Faber, J. Touch, and W. Yue. The time-wait state in tcp and its effect on busy servers, 1999.

7. Future Work
Deactivation and reactivation of sockets is a good beginning in our research to decrease latency and increase scalability of TCP on large clusters. First, we must streamline the reactivation process in an attempt to decrease the latency cost. The next step is to use this mechanism to ofoad a small subset of highly, latency-sensitive connections to a NIC to allow for polling and direct access without interrupts or memory-copy.

[8] P. Gilfeather and T. Underwood. Fragmentation and high performance ip. In Proc. of the 15th International Parallel and Distributed Processing Symposium, April 2001. [9] S. Majumder and S. Rixner. Comparing ethernet and myrinet for mpi communication. In Proc. of the 7th Workshop on Languages, Compilers, and Run-time Support for Scalable Systems, October 2004. [10] V. Padmanabhan and R. Katz. TCP Fast Start: a technique for speeding up web transfers. In IEEE Globecom 98 Internet MiniConference, November 1998. [11] C. L. Schuba, I. V. Krsul, M. G. Kuhn, E. H. Spafford, A. Sundaram, and D. Zamboni. Analysis of a denial of service attack on TCP. In Proceedings of the 1997 IEEE Symposium on Security and Privacy, pages 208223. IEEE Computer Society, IEEE Computer Society Press, May 1997. [12] W. R. Stevens. TCP/IP Illustrated, Volume 1; The Protocols. Addison Wesley, Reading, 1994. [13] J. Touch. RFC 2140: TCP control block interdependence, April 1997. Status: EXPERIMENTAL. [14] Y. Zhang, L. Qiu, and S. Keshav. Optimizing TCP start-up performance. Technical Report TR99-1731, 10, 1999.

S-ar putea să vă placă și