Documente Academic
Documente Profesional
Documente Cultură
Patricia Gilfeather Scalable Systems Lab Department of Computer Science University of New Mexico pfeather@cs.unm.edu Abstract
TCP is an important protocol in high-performance computing. It is used extensively in graphics programs and le systems and it is often the protocol used for the cluster control mechanism. As the breadth of applications increases, the need for a scalable and efcient implementation of TCP becomes more important. In addition to other bottlenecks that must be alleviated, TCP connection management must be made scalable. This becomes critical as we consider ofoading TCP processing onto TCP ofoad engines (TOEs) or intelligent network interface cards (iNICs). In this paper, we show how to take advantage of special characteristics of the high-performance computing environment and apply existing operating system mechanisms in a unique way to address some of the scalability concerns in ofoaded TCP. Specically, we implement methods for activating and deactivating TCP connections. These allow us to maintain a large store of open TCP connections without a large amount of storage overhead.
Arthur B. Maccabe Scalable Systems Lab Department of Computer Science University of New Mexico maccabe@cs.unm.edu
competitive with respect to performance attributes like latency and overhead, message-passing libraries like MPI can be implemented over TCP and cluster administrators can maintain fewer protocols. Finally, TCP/IP is well-maintained, well-tested, well-understood and highly interoperable. One way to make TCP competitive with respect to latency and overhead for large clusters is to ofoad some protocol processing. However, ofoad engines can become very expensive. If a cluster designer wants to create a competitive large cluster using a TCP ofoad engine, the amount of memory devoted to connection management must remain small. Ideally, large-scale systems will only need to provide resources for a small working-set of active connections. This paper describes mechanisms to facilitate the working set method of managing TCP connections. Our goal is to lessen the overhead of inactive connections by deactivating the heavy-weight socket and replacing it with a placeholder that allows the connection to be reactivated when it is needed. This decreases the amount of memory needed to maintain connections and facilitates ofoading TCP communication processing because only a small working set of active communications is fully instantiated at any time. One of the advantages of working with commodity protocols is that the Linux implementation of TCP has data structures and methods that can be leveraged to accomplish deactivation and reactivation of connections. Originally, minisocks and open requests were created to decrease resource usage for connections in the time wait state of TCP and for connections in the process of being created. The former is essential for large-scale web servers in order to maintain protocol correctness during the connection shutdown process. The latter is used to survive denial-of-service attacks. In this paper, we show how to modify these existing data structures and methods to create the deactivation and reactivation methods that drastically decrease the memory requirements for TCP in large-scale clusters. This is accomplished by creating a small working-set of active con-
1. Introduction
Clusters are getting bigger. In clusters of hundreds of thousands of nodes, resource management of communication will become more critical. One of the aspects of this scalability bottleneck is the amount of memory necessary to maintain communication. This problem is severe when we consider ofoading protocol processing onto different architectures like iNICs, TOEs and processors in memory (PIMs). TCP/IP implementations and applications are widespread and therefore inherently appealing for use in cluster computing. Additionally, if TCP/IP can be made
Los Alamos Computer Science Institute SC R71700H-29200001 and Albuquerque High Performance Computing Center through IBM SUR
nections. The rst part of this paper reviews TCP connection management and ofoading TCP. Next, we measure the resource constraints associated with TCP ofoad. Then, we outline the modications we implemented to deactivate sockets and reactivate sockets and present the results. Finally, we discuss future plans for further addressing TCP/IP bottlenecks.
Socket accepts
Destination
ACK seq# U+1 Socket in FIN_WAIT_2 Socket in TIME_WAIT ACK seq# V+1
Socket in CLOSE_WAIT
FIN seq# V
Socket closed
wait state for at least 30 seconds, clusters must account for this. Linux uses minisocks as place-holders to alleviate the memory costs associated with the time-wait state. We explain these in detail below as well as how we modify minisocks to create deactivated sockets.
head associated with a fully instantiated socket without attached data buffers. Protocol ofoading exacerbates these issues by moving connection management onto a memory-limited resource. Current iNICs have a memory of about 2MB [1]. While there are iNICs now that have up to 4GB of memory, we are interested in the commodity NIC market where memory will continue to be a constrained resource. Figure 3 shows that if there were no other rmware, and no data buffers, the maximum number of ofoaded TCP connections on a 2MB (512 4Kb page) iNIC would be less than 1000. In fact, Linux will not run with TCP stacks of less than about 3000 pages. Ofoaded TCP onto commodity NICs using traditional stacks, is not scalable.
7000 Number of open connections
6000 5000 4000 3000 2000 1000 0 0 5000 10000 15000 20000 25000 30000 35000 Size of tcp_mem (MB)
3. Ofoading TCP
There is a great deal of work being done to ofoad TCP either onto iNICs or TOEs. The most well-known work on ofoaded TCP[5]. Our research has shown that TCP latency can be reduced by as much as 40% when parts of the TCP stack are ofoaded[6]. Also, ofoading all or part of the TCP stack decreases overhead associated with communication processing. This decrease is due to memory copy overhead[5], interrupt pressure overhead[8, 3], and to the ofoading of the communication progress thread making an event-driven model easy to implement[9]. Figure 3 shows the number of open connections possible as the number of 4Kb pages of memory allocated for the TCP stack decreases. As memory becomes more limited, the number of possible active connections is reduced. We measured the growth of memory with respect to the number of open connections for the Linux 2.4.25 TCP stack. First, we limited the memory associated with TCP connections by modifying the tcp mem proc lesystem le. This interface allows us to place a maximum memory limit, in terms of pages, on the TCP stack. We measured the memory over-
4. Connection-less TCP
The WAN world has been able to reduce the resource pressure associated with connection management by decreasing the amount of memory needed for a TCP connection during startup and during tear-down. Can we use these techniques to reduce the memory footprint for a socket during its lifetime?
startup costs have been proposed[14, 10, 13, 2, 4, 14]. All of this research concentrates on working around the inconsistencies of large-scale heterogeneous networks. Specically, routes change so RTT estimations cannot be cached for long and congestion is highly variable so the congestion window must be regularly re-calculated. Large-scale clusters, however, are generally homogeneous with static routing. This allows us to reuse routes and RTT estimations. Because we are in a high-performance networking environment, we are able to move sockets into and out of an inactive state without paying a performance penalty because of stale ow control information. The result is a connection-less version of TCP.
ood. Additionally, the size of an open request is approximately 64 bytes whereas a fully-instantiated socket is 832 bytes. The memory savings are signicant. Syncookies are further defense against SYN ooding. They reduce the need to even create the open request data structure. A web server generates a cookie based on the IP address, port number, write sequence number and mss of the client and uses the cookie as the sequence number in the acknowledgment of the SYN. Upon receipt of the nal SYN-ACK, the server creates the open request data structure from the listening socket, the cookie and the nal acknowledgment. With the listening socket, the open request structure, the incoming message, and the route table entry, a new socket can be created. 4.2.2. Time-Wait loading Originally, clients were expected to actively close connections. It was assumed that small clients with few resource constraints would do most of the waiting in the timewait state[12]. However, because of the still widely-used HTTP 1.0 protocol (which is message oriented), web servers actually do most of the active closes. In the HTTP 1.0 protocol, the client issues a GET message. The server then sends the requested information. This is the end of the interaction. The server is forced to close the connection that was made and therefore must provide large amounts of resources for maintaining connections in time-wait[7]. In order to lighten the memory load for timewait on high-trafc web servers, minisocks were introduced into the Linux 2.4.0 TCP stack. The data structure associated with minisocks is called a tcp tw bucket. Because it is supposedly possible to move from a closing state back to an established state, the original large socket is not destroyed as long as there is a context to it. If an application closes or dies without closing its sockets, then the sockets are orphaned. Orphaned sockets must also go through the time-wait state and so tcp tw buckets must be created and ports must remain bound, but because there is no context with which to re-establish the connection, the large socket can be destroyed. Additionally, tcp tw buckets are hashed into a separate hash table. This is ostensibly done to keep the establishedconnection hash table from growing very large during timewait loading. Since it was assumed that a small establishedconnection hash table will keep demultiplexing latency low. Generally, web servers use individual threads to service a request. When the thread dies, the large socket becomes orphaned and only the tcp tw bucket remains. In this way high-trafc web-servers are able to avoid the heavy load of fully-instantiated sockets being held during time-wait. Again, the savings are substantial. The savings is 96 bytes versus 832 bytes.
of the tcp timewait state process. If the connection is inactive TCP TW REACTIVATE is returned to the main receive process. We pass the timewait bucket and the incoming message to a process modeled after the cookie v4 check method. An open request data structure is created. The open request, the pointer to the route table entry, the timewait bucket and the incoming message are all sent to the tcp v4 syn recv sock method. We cannot fully reuse this method because we are not rehashing the newly constituted socket and because the call to create the child socket is different. Otherwise, this method is identical. The call that allocates the socket from the timewait bucket is tcp create timewait child. It is modeled after the tcp create openreq child method. There are some substantial differences between these two methods. The tcp create openreq child method copies all data and initializes with data from the listening socket. We do not have that socket. We initialize the newly allocated socket from a mixture of timewait bucket and static data.
5. Results
All measurements were made using a modied version of Linux on one host and an unmodied version of the same kernel on the other host. The tests were performed on 993MHz Pentium IIIs with 1Gb Acenic Ethernet cards in a cross-over pattern. The server code simply listens on a well-known port and accepts requests as they are received.
Time Wait Connected Deactivated Time Wait Orphan Avg latency in usec
15000
20000
10000
40000
50000
900 800 700 600 500 400 300 200 0 200 400
1200
1400
6. Discussion
As we see in Figure 4, deactivated sockets save substantial memory. Because sockets with a context are not released, the time wait run shows both the memory overhead of sockets and the memory overhead of the tcp tw buckets. The deactivated sockets require slightly more memory than the orphaned, time-wait sockets because there is the additional state necessary to reconstitute the socket. The traditional Linux stack only allows around 2000 connections as measured by the slab cache information. If we use deactivated sockets, we increase the number of connections allowed in 2MB of memory to over 20,000 connections. This is ten-fold increase in scalability. Figure 5 shows no signicant correlation between demultiplexing speed and the number of connections populating the hash table. The speed appears to be constant. No differences are found between the various methods for storing inactive sockets. These ndings are signicant because they call into question the reasoning behind the use of minisocks in the standard Linux kernel. The timewaitand-orphan measurement shows the performance of standard Linux servers. There is no latency advantage. The only signicant advantage of minisocks is the memory savings. Furthermore, there is no reason to remove deactivated sockets from the established-connection hash table. As we see in Figure 6, reactivation moderately increases startup latency over leaving connections open. On the other hand, reactivations decreases latency by approximately 40% for the rst arriving packet of a message compared to closing and repoening a connection. When memory pressure requires space-saving methods, reactivation is clearly a viable method. In addition, note that this latency cost is only paid on the packet that initiates reactivation. The most signicant result of these experiments is that there are methods that allow full ofoading of TCP processing for large clusters using commodity NICs with limited memory resources. We drastically decreased the memory used for inactive sockets while only moderately increasing latency of the reactivation message.
8. Conclusions
We were able to drastically reduce memory usage for open TCP connections thus increasing the scalability of TCP, especially for ofoaded TCP. The cost in latency of the rst arriving packet on an inactive connection is signicant, but it is much lower than the cost of closing and re-opening the connection. We were able to reuse mechanisms created in other areas of network research that reduce resource commitment for communication. By making small modications to an existing operating system, we were able to drastically reduce resource usage. This is the great advantage of working with commodity protocols: often a good deal of research has already been implemented. Commodity protocols, especially TCP/IP will always be an important part of cluster computing. Certainly, as more scientic elds come to rely on high-performance computation, there will be more rather than less of a dependency on the interoperable, easy-to-program protocol. We must push TCP to the limits of its performance and efciency with respect to high-performance environments and commodity hardware, as we cannot expect commodity protocols or components to fully disappear. Deactivation is a method for understanding the TCP stack in the context of high-performance computing. It begins to address a problem with TCP, scalability with respect to state management for TCP implementations that are ofoaded onto commodity hardware.
References
[1] Acenic gigabit ethernet for linux. Web: http://jes.home.cern.ch/jes/gige/acenic.html, August 2001. [2] M. Allman, S. Floyd, and C. Partridge. RFC 2414 : Increasing TCPs initial window, September 1998. Status: EXPERIMENTAL. [3] I. M. Amnon Barak, Ilia Gilderman. Performance of the communication layers of tcp/ip with the myrinet gigabit lan. Computer Communications, 22(11), July 1999. [4] H. Balakrishnan, V. Padmanabhan, S. Seshan, M. Stemm, and R. Katz. TCP behavior of a busy internet server: Analysis and improvements. In IEEE INFOCOM, March 1998. [5] J. Chase, A. Gallatin, and K. Yocum. End-system optimizations for highspeed TCP. In IEEE Communications, special issue on TCP Performance in Future Networking Environments, volume 39, page 8, 2000. [6] B. Duncan. Splinter tcp to decrease small message latency in high-performance computing. Technical Report TR-CS2003-27, University of New Mexico, 2003. [7] T. Faber, J. Touch, and W. Yue. The time-wait state in tcp and its effect on busy servers, 1999.
7. Future Work
Deactivation and reactivation of sockets is a good beginning in our research to decrease latency and increase scalability of TCP on large clusters. First, we must streamline the reactivation process in an attempt to decrease the latency cost. The next step is to use this mechanism to ofoad a small subset of highly, latency-sensitive connections to a NIC to allow for polling and direct access without interrupts or memory-copy.
[8] P. Gilfeather and T. Underwood. Fragmentation and high performance ip. In Proc. of the 15th International Parallel and Distributed Processing Symposium, April 2001. [9] S. Majumder and S. Rixner. Comparing ethernet and myrinet for mpi communication. In Proc. of the 7th Workshop on Languages, Compilers, and Run-time Support for Scalable Systems, October 2004. [10] V. Padmanabhan and R. Katz. TCP Fast Start: a technique for speeding up web transfers. In IEEE Globecom 98 Internet MiniConference, November 1998. [11] C. L. Schuba, I. V. Krsul, M. G. Kuhn, E. H. Spafford, A. Sundaram, and D. Zamboni. Analysis of a denial of service attack on TCP. In Proceedings of the 1997 IEEE Symposium on Security and Privacy, pages 208223. IEEE Computer Society, IEEE Computer Society Press, May 1997. [12] W. R. Stevens. TCP/IP Illustrated, Volume 1; The Protocols. Addison Wesley, Reading, 1994. [13] J. Touch. RFC 2140: TCP control block interdependence, April 1997. Status: EXPERIMENTAL. [14] Y. Zhang, L. Qiu, and S. Keshav. Optimizing TCP start-up performance. Technical Report TR99-1731, 10, 1999.