Documente Academic
Documente Profesional
Documente Cultură
Part No. 817-1046-10 June 2004, Revision A Submit comments about this document at: http://www.sun.com/hwdocs/feedback
Copyright 2004 Sun Microsystems, Inc., 4150 Network Circle, Santa Clara, California 95054, U.S.A. All rights reserved. Sun Microsystems, Inc. has intellectual property rights relating to technology that is described in this document. In particular, and without limitation, these intellectual property rights may include one or more of the U.S. patents listed at http://www.sun.com/patents and one or more additional patents or pending patent applications in the U.S. and in other countries. This document and the product to which it pertains are distributed under licenses restricting their use, copying, distribution, and decompilation. No part of the product or of this document may be reproduced in any form by any means without prior written authorization of Sun and its licensors, if any. Third-party software, including font technology, is copyrighted and licensed from Sun suppliers. Parts of the product may be derived from Berkeley BSD systems, licensed from the University of California. UNIX is a registered trademark in the U.S. and in other countries, exclusively licensed through X/Open Company, Ltd. Sun, Sun Microsystems, the Sun logo, AnswerBook2, docs.sun.com, iPlanet, Java, JavaDataBaseConnectivity, JavaServer Pages, Enterprise JavaBeans, Netra Sun ONE , Sun Trunking, JumpStart, and Solaris are trademarks or registered trademarks of Sun Microsystems, Inc. in the U.S. and in other countries. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. in the U.S. and in other countries. Products bearing SPARC trademarks are based upon an architecture developed by Sun Microsystems, Inc. The OPEN LOOK and Sun Graphical User Interface was developed by Sun Microsystems, Inc. for its users and licensees. Sun acknowledges the pioneering efforts of Xerox in researching and developing the concept of visual or graphical user interfaces for the computer industry. Sun holds a non-exclusive license from Xerox to the Xerox Graphical User Interface, which license also covers Suns licensees who implement OPEN LOOK GUIs and otherwise comply with Suns written license agreements. U.S. Government RightsCommercial use. Government users are subject to the Sun Microsystems, Inc. standard license agreement and applicable provisions of the FAR and its supplements. DOCUMENTATION IS PROVIDED "AS IS" AND ALL EXPRESS OR IMPLIED CONDITIONS, REPRESENTATIONS AND WARRANTIES, INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT, ARE DISCLAIMED, EXCEPT TO THE EXTENT THAT SUCH DISCLAIMERS ARE HELD TO BE LEGALLY INVALID. Copyright 2004 Sun Microsystems, Inc., 4150 Network Circle, Santa Clara, Californie 95054, Etats-Unis. Tous droits rservs. Sun Microsystems, Inc. a les droits de proprit intellectuels relatants la technologie qui est dcrit dans ce document. En particulier, et sans la limitation, ces droits de proprit intellectuels peuvent inclure un ou plus des brevets amricains numrs http://www.sun.com/patents et un ou les brevets plus supplmentaires ou les applications de brevet en attente dans les Etats-Unis et dans les autres pays. Ce produit ou document est protg par un copyright et distribu avec des licences qui en restreignent lutilisation, la copie, la distribution, et la dcompilation. Aucune partie de ce produit ou document ne peut tre reproduite sous aucune forme, par quelque moyen que ce soit, sans lautorisation pralable et crite de Sun et de ses bailleurs de licence, sil y en a. Le logiciel dtenu par des tiers, et qui comprend la technologie relative aux polices de caractres, est protg par un copyright et licenci par des fournisseurs de Sun. Des parties de ce produit pourront tre drives des systmes Berkeley BSD licencis par lUniversit de Californie. UNIX est une marque dpose aux Etats-Unis et dans dautres pays et licencie exclusivement par X/Open Company, Ltd. Sun, Sun Microsystems, le logo Sun, AnswerBook2, docs.sun.com, iPlanet, Java, JavaDataBaseConnectivity, JavaServer Pages, Enterprise JavaBeans, Netra Sun ONE , Sun Trunking, JumpStart, et Solaris sont des marques de fabrique ou des marques dposes de Sun Microsystems, Inc. aux Etats-Unis et dans dautres pays. Toutes les marques SPARC sont utilises sous licence et sont des marques de fabrique ou des marques dposes de SPARC International, Inc. aux Etats-Unis et dans dautres pays. Les produits portant les marques SPARC sont bass sur une architecture dveloppe par Sun Microsystems, Inc. Linterface dutilisation graphique OPEN LOOK et Sun a t dveloppe par Sun Microsystems, Inc. pour ses utilisateurs et licencis. Sun reconnat les efforts de pionniers de Xerox pour la recherche et le dveloppement du concept des interfaces dutilisation visuelle ou graphique pour lindustrie de linformatique. Sun dtient une license non exclusive de Xerox sur linterface dutilisation graphique Xerox, cette licence couvrant galement les licencies de Sun qui mettent en place linterface d utilisation graphique OPEN LOOK et qui en outre se conforment aux licences crites de Sun. LA DOCUMENTATION EST FOURNIE "EN LTAT" ET TOUTES AUTRES CONDITIONS, DECLARATIONS ET GARANTIES EXPRESSES OU TACITES SONT FORMELLEMENT EXCLUES, DANS LA MESURE AUTORISEE PAR LA LOI APPLICABLE, Y COMPRIS NOTAMMENT TOUTE GARANTIE IMPLICITE RELATIVE A LA QUALITE MARCHANDE, A LAPTITUDE A UNE UTILISATION PARTICULIERE OU A LABSENCE DE CONTREFAON.
Please Recycle
Acknowledgements
Deepak says, I would like to thank the many people who have gone out of their way to help me not only with writing this book, but also teaching me about various aspects of my professional and academic career. First, I am very grateful for the tremendous corporate support I received from Scott McNealy, Clark Masters, Gary Beck, Brad Carlile, and Bill Sprouse. I feel extremely fortunate to be a part of a team with the greatest, most unselfish corporate leadership of modern time. I would like to thank Kemer Thomson, Vicky Hardman, Gary Rush, Barb Jugo, Alice Kemp, a veteran book writer and also my book mentor, and Diana Lins, who did the illustrations in this book. Much of the data center work developed in this book was built on the shoulders of giants Richard Croucher, a world-renowned expert in datacenter technologies, Dr. Jim Baty, Dr. Joseph Williams, Mikael Lofstrand, and Jason Carolan. Frank and I are very grateful to the technical reviewers who spent considerable time reviewing and providing feedback and comments: David Auslander, Martin Lorenz, Ken Pepple, Mukund Buddhikot, my good friends and collegues: Mark Garner, who sacrificed many pub nights for me, David Deeths, Don Devitt, and John Howard. I would also like to thank Dr. Nick McKeown, professor at Stanford University, Rui Zhang-Shen, Ph.D. student at Stanford University, John Fong, NortelNetworks, John Reuter and David Bell of Foundry Networks, Dan Mercado and Bill Cormier of Extreme Networks, and Sunil Cherian of ArrayNetworks. Above all, says Deepak, I thank my wife, Jagruti, and daughters, Angeli and Kristina, for their patience, sacrifice, and understanding of why I had to miss both school programs and family events. Finally, I want thank my mother, mother-in-law, and father-in-law for helping out at home during my many absences.
iii
Frank says, We must also remember the unsung heros who implement, test, sustain, and measure performance of the Sun network technology, device drivers, and device driver framework for their outstanding contribution to the Sun networking technology. Their assistance and support helped gain the collective experience that has been essential to providing the best possible networking capability for Sun and, in turn, the material necessary for some parts of this book. These include the Network ASIC development team, in particular Shimon Muller, Binh Pham, George Chu, and Carlos Castil. The device driver development team, including Sumanth Kamatala, Joyce Yu, Paul Simons, Raghunath Shenbagam, David Gordon and Paul Lodrige. The Networking Quality Assurance team, including Lalit Bhola, Benny Chin, Jie Zhu, Alan Hanson, Neeraj Gupta, Deb Banerjee, Charleen Yee, and Ovid Jacob. The Solaris Device Driver Framework Development team, including Adi Masputra, Jerry Chu, Priyanka Agarwal, and Paul Durrant. The System Performance Measurement team: Patrick Ong, Jian Huang, Paul Rithmuller, Charles Suresh, and Roch Borbonnais. I would love to say a big thank you to my wife, Bridget, Frank says, for her patience and encouragment as I progressed with my contribution to this book. Thanks to my two sons, Francesco and Antonio, for distracting me from time to time and forcing me to take a break from the book and play. For them, Dads writing his book was not a reasonable excuse. God bless them, for they were right. Who would imagine someone three feet tall could have that much insight? Those breaks made all the difference.
Contents
1.
Overview
1 1
End-to-End Session: Tuning the Transport Layer Network Edge Traffic Steering: IP Services Server Networking Internals 12 13 10
17
20 21
Mapping Tiers to the Network Architecture Inter-tier Traffic Flows Web Services Tier 23 26 22
Designing for Vertical Scalability and Performance Designing for Security and Vertical Scalability 32
31
33
40 41 44 46
Connection Setup
TCP Congestion Control and Flow Control Sliding Windows TCP Tuning for ACK Control TCP Example Tuning Scenarios 54 56 56
53
Tuning TCP for Optical Networks WANS Tuning TCP for Slow Links 59
TCP and RDMA Future Data Center Transport Protocols 4. Routers, Switches, and AppliancesIP-Based Services: Network Layer Packet Switch Internals 66 71
62
65
Round-Robin
vi
Smallest Queue First /Least Connections Finding the Best SLB Algorithm How the Proxy Mode Works 78 80 76
74
80
82 82
84 84
Extreme Networks BlackDiamond 6800 Integrated SLB Proxy Mode 86 Layer 7 Switching 88 91
93 93 94
94
96
Contents vii
98 98 99
Implementation Approaches
102 102
Deployment of Data and Control Planes Packet Classifier Metering Marking 105 106 107 107 105
107
109
110 112
The Crypto Accelerator BoardPacket Flow SSL Accelerator AppliancePacket Flow SSL Performance Tests 117 115
113
Test 1: SSL Software Libraries versus SSL Accelerator ApplianceNetscaler 9000 117 Test 2: Sun Crypto Accelerator 1000 Board 118
Test 3: SSL Software Libraries versus SSL Accelerator ApplianceArray Networks 119 Conclusions Drawn from the Tests 121
viii
5.
Server Network Interface Cards: Datalink and Physical Layer Token Ring Networks 123 125 125
123
Configuring the SunTRI/S Adapter with TCP/IP Setting the Maximum Transmission Unit Disabling Source Routing 127 127 126
Disabling ARI/FCI Soft Error Reporting Configuring the Operating Mode 127
128 128
Configuring the SunTRI/P Adapter with TCP/IP Setting the Maximum Transmission Unit Configuring the Ring Speed 129 129
Configuring the Locally Administered Address Fiber Distributed Data Interface Networks FDDI Stations 132 132 133 131
130
134 135
Configuring the SunFDDI/S Adapter with TCP/IP Setting the Maximum Transmission Unit Target Token Rotation Time 137 137
137
Configuring the SunFDDI/P Adapter with TCP/IP Setting the Maximum Transmission Unit Target Token Rotation Time Ethernet Technology 139 139 138
138
Contents
ix
Software Device Driver Layer Transmit Receive 140 144 152 152
140
Jumbo Frames
153
154 155
Link-Partner Auto-negotiation Advertisement Register Gigabit Media Independent Interface Ethernet Flow Control Example 1 164 Example 2 164 165 165 168 161 157
Current Device Instance in View for ndd Operational Mode Parameters Transceiver Control Parameter Inter-Packet Gap Parameters 168 169 170
Local Transceiver Auto-negotiation Capability Link Partner Capability 173 174 175 176
171
Current Device Instance in View for ndd Operational Mode Parameters Transceiver Control Parameter Inter-Packet Gap Parameters 177 178 178
179
181
Current Device Instance in View for ndd Operational Mode Parameters Transceiver Control Parameter Inter-Packet Gap Parameters 184 185 185
184
187 187
191 192
Current Physical Layer Status Fiber Gigabit Ethernet 195 196 196
Current Device Instance in View for ndd Operational Mode Parameters Transceiver Control Parameter Inter-Packet Gap Parameters 199 200 200
198
202 203
Local Transceiver Auto-negotiation Capability Link Partner Capability 204 205 206 209
Contents
xi
Current Device Instance in View for ndd Operational Mode Parameters Flow Control Parameters 213 212
211
Gigabit Link Clock Mastership Controls Transceiver Control Parameter Inter-Packet Gap Parameters 214 214
213
Receive Interrupt Blanking Parameters Random Early Drop Parameters PCI Bus Interface Parameters 216
215
217 217
10/100/1000 bge Broadcom BCM 5704 Gigabit Ethernet Operational Mode Parameters 222 224
220
Current Physical Layer Status Sun VLAN Technology VLAN Configuration Sun Trunking Technology Trunking Configuration Trunking Policies Network Configuration 228 230 230 231
Configuring the System to Use the Embedded MAC Address Configuring the Network Host Files 234
Setting Up a GigaSwift Ethernet Network on a Diskless Client System Installing the Solaris Operating System Over a Network Configuring Driver Parameters 238 238 236
235
xii
Using the ndd Utility in Non-interactive Mode Using the ndd Utility in Interactive Mode Reboot Persistence Using driver.conf Global driver.conf Parameters 240 242
240
242 243
Per-Instance driver.conf Parameters Using /etc/system to Tune Parameters Network Interface Card General Statistics 244 245
246 249
Maximizing the Performance of an Ethernet NIC Interface Ethernet Physical Layer Troubleshooting 250
Deviation from General Ethernet MII/GMII Conventions Ethernet Performance Troubleshooting ge Gigabit Ethernet ce Gigabit Ethernet 6. 255 256 261 261 254
253
Network Availability Design Strategies Network Architecture and Availability Layer 2 Strategies 264
Trunking Approach to Availability Theory of Operation Availability Issues 265 265 267
264
Load-Sharing Principles
Availability Strategies Using SMLT and DMLT Availability Using Spanning Tree Protocol Availability Issues Layer 3 Strategies 278 279 274 274
271
279
Contents
xiii
281
Conclusions Drawn from Evaluating Fault Detection and Recovery Times 7. Reference Design Implementations Logical Network Architecture IP Services 298 298 296 295
292
Stateless Server Load Balancing Stateless Layer 7 Switching Stateful Layer 7 Switching 299 300
302 302
Stateful Secure Sockets Layer Session ID Persistence Stateful Cookie Persistence 304 305 308 309 312
Collapsed Layer 2/Layer 3 Network Design Multi-Tier Data Center Logical Design
How Data Flows Through the Service Modules Physical Network Implementations Secure Multi-Tier 315 315
316 317
Flat Architecture Using Collapsed Large Chassis Switches Physical NetworkConnectivity Switch Configuration 322 323 324 320
Configuring the Extreme Networks Switches Configuring the Foundry Networks Switches Master Core Switch Configuration Standby Core Switch Configuration Server Load Balancer Server Load Balancer 328 329 326 327
xiv
Network Security
330 333
355
Contents
xv
Figures
High-Level Overview of Networks Spanning Clients, Data Center, Vendors, and Partners (a) 5 High-Level Overview of Networks Spanning Clients, Data Center, Vendors, and Partners (b) 6 Influence of Multi-Tier Software Architectures on Network Architecture 8 Transport Layer Traffic Flows Tuned According to Client Links 10 Data Center Edge IP Services 11 Data Center Networking Considerations on the Server 12 Availability Strategies in the Data Center 14 Example Implementation of an Enterprise Muli-Tier Data Center 15 Main Components of Multi-Tier Architecture 19 20 22
FIGURE 1-3
FIGURE 1-4 FIGURE 1-5 FIGURE 1-6 FIGURE 1-7 FIGURE 1-8 FIGURE 1-9 FIGURE 2-1 FIGURE 2-2 FIGURE 2-3 FIGURE 2-4 FIGURE 2-5 FIGURE 2-6 FIGURE 2-7 FIGURE 2-8 FIGURE 2-9 FIGURE 3-1
Logical View of Multi-Tier Service on Demand Architecture Network Inter-tier Traffic Flows of a Web-based Transaction
Model of Presentation/Web Tier Components and Interfacing Elements 24 High-Level Survey of EJB Availability Mechanisms 27 31
Tightly Coupled Web Tier and Application Server TierVertically Scaled 32 Decoupled Web Tier and Application Server TierHorizontally Scaled 33 Tested and Implemented Architecture Solution 35 Overview of Overlapping Tuning Domains 39
xvii
FIGURE 3-2 FIGURE 3-3 FIGURE 3-4 FIGURE 3-5 FIGURE 3-6 FIGURE 3-7 FIGURE 3-8 FIGURE 3-9 FIGURE 3-10 FIGURE 3-11 FIGURE 3-12 FIGURE 3-13 FIGURE 3-14 FIGURE 4-1 FIGURE 4-2 FIGURE 4-3 FIGURE 4-4 FIGURE 4-5 FIGURE 4-6 FIGURE 4-7 FIGURE 4-8 FIGURE 4-9 FIGURE 4-10 FIGURE 4-11 FIGURE 4-12 FIGURE 4-13 FIGURE 4-14 FIGURE 4-15 FIGURE 4-16 FIGURE 4-17
40
Perfectly Tuned TCP/IP System 42 Tuning Required to Compensate for Faster Links 43 Tuning Required to Compensate for Slower Links Complete TCP/IP Stack on Computing Nodes 45 TCP and STREAM Head Data Structures Tunable Parameters TCP State Engine Server and Client Node TCP Startup Phase 52 55 49 47 44
Comparison between Normal LAN and WAN Packet Traffic 57 Tuning Required to Compensate for Optical WAN 59 Comparison between Normal LAN and WAN Packet TrafficLong Low Bandwidth Pipe 60 Increased Performance of InfiniBand/RDMA Stack Internal Architecture of a Multi-Layer Switch 68 High-Level Model of Server Load Balancing 73 High-Level Model of the Shortest Queue First Technique Round-Robin and Weighted Round-Robin 76 Server Load Balanced System Modeled as N - M/M/1 Queues 77 System Model of One Queue 78 Server Load BalancePacket Flow: Proxy Mode Direct Server Return Packet Flow 81 Content Switching Functional Model 90 97 79 75 63
Overview of End-to-End Network and Systems Architecture One-Way End-to-End Packet Data Path Transversal 100 QoS Functional Components 104 Traffic Burst Graphic 106
Congestion Control: RED, WRED Packet Discard Algorithms 108 High-Level Condensed Protocol Overview 111 113 114
xviii
FIGURE 4-18 FIGURE 4-19 FIGURE 4-20 FIGURE 4-21 FIGURE 4-22 FIGURE 4-23 FIGURE 4-24 FIGURE 5-1 FIGURE 5-2 FIGURE 5-3 FIGURE 5-4 FIGURE 5-5 FIGURE 5-6 FIGURE 5-7 FIGURE 5-8 FIGURE 5-9 FIGURE 5-10 FIGURE 5-11 FIGURE 5-12 FIGURE 5-13 FIGURE 5-14 FIGURE 5-15 FIGURE 5-16 FIGURE 5-17 FIGURE 5-18 FIGURE 5-19 FIGURE 5-20 FIGURE 5-21 FIGURE 5-22 FIGURE 5-23
SSL Appliance Offloads Frontend Client SSL Processing 116 SSL Test Setup with No Offload 117 Throughput Increases Linearly with More Processors 119 SSL Test Setup for SSL Software Libraries 119 120 120
SSL Test Setup for an SSL Accelerator Appliance Effect of Number of Threads on SSL Performance Effect of File Size on SSL Performance Token Ring Network 124 Typical FDDI Dual Counter-Rotating Ring 132 SAS Showing Primary Output and Input 133 121
DAS Showing Primary Input and Output 134 SAC Showing Multiple M-ports with Single-Attached Stations 135
DAC Showing Multiple M-ports with Single-Attached Stations 136 Communication Process between the NIC Software and Hardware Transmit Architecture 141 140
Hardware Receive Checksum 148 Software Load Balancing 149 Hardware Load Balancing 150
Basic Mode Control Register 153 Basic Mode Status Register 154 Link Partner Auto-negotiation Advertisement 155 Link Partner Priority for Hardware Decision Process 156 Auto-negotiation Expansion Register 157 Extended Basic Mode Control Register 158 Basic Mode Status Register 158 Gigabit Extended Status Register Gigabit Control Status Gigabit Status Register 159 160 159
Figures
xix
FIGURE 5-24 FIGURE 5-25 FIGURE 5-26 FIGURE 5-27 FIGURE 5-28 FIGURE 5-29 FIGURE 5-30 FIGURE 5-31 FIGURE 5-32 FIGURE 5-33 FIGURE 6-1 FIGURE 6-2 FIGURE 6-3 FIGURE 6-4 FIGURE 6-5 FIGURE 6-6 FIGURE 6-7 FIGURE 6-8 FIGURE 6-9 FIGURE 6-10 FIGURE 6-11 FIGURE 6-12 FIGURE 6-13 FIGURE 6-14 FIGURE 6-15 FIGURE 7-1 FIGURE 7-2 FIGURE 7-3 FIGURE 7-4 FIGURE 7-5
161
Flow Control Pause Frame Format 161 Link Partner Auto-negotiation Advertisement Register Rx/Tx Flow Control in Action 163 162
Typical hme External Connectors 166 Typical qfe External Connectors 175 Typical vge and ge MMF External Connectors 196
Sun GigaSwift Ethernet MMF Adapter Connectors 209 Sun GigaSwift Ethernet UTP Adapter Connectors 209
Example of Servers Supporting Multiple VLANs with Tagging Adapters 229 Network Topologies and Impact on Availability Trunking Software Architecture 265 Trunking Failover Test Setup 266 Correct Trunking Policy on Switch 268 263
Incorrect Trunking Policy on Switch 268 Correct Trunking Policy on Server 269
Incorrect Trunking Policy on a Server 270 Incorrect Trunking Policy on a Server 271 Layer 2 High-Availability Design Using SMLT 272
Layer 2 High-Availability Design Using DMLT 273 Spanning Tree Network Setup 275 High-Availability Network Interface Cards on Sun Servers 280 281
Design PatternIPMP and VRRP Integrated Availability Solution Design PatternOSPF Network RIP Network Setup 289 297 282
IP ServicesSwitch Functions Operate on Incoming Packets 299 Application Redirection Functional Model 300 Content Switching Functional Model 301 303
xx
FIGURE 7-6 FIGURE 7-7 FIGURE 7-8 FIGURE 7-9 FIGURE 7-10 FIGURE 7-11 FIGURE 7-12 FIGURE 7-13 FIGURE 7-14 FIGURE 7-15 FIGURE 7-16 FIGURE 7-17 FIGURE 7-18 FIGURE 7-19 FIGURE 7-20 FIGURE 7-21
Tested SSL Accelerator ConfigurationRSA Handshake and Bulk Encryption 304 Network Availability Strategies 305 Logical Network ArchitectureDesign Details 306
Traditional Availability Network Design Using Separate Layer 2 Switches 308 Availability Network Design Using Large Chassis-Based Switches 309 Logical Network Architecture with Virtual Routers, VLANs, and Networks 310 Logical Network Secure Multi-Tier 313 315 316
Multi-Tier Data Center Architecture Using Many Small Switches Network Configuration with Extreme Networks Equipment 318
Sun ONE Network Configuration with Foundry Networks Equipment 319 Physical Network Connections and Addressing 321 Collapsed Design Without Layer 2 Switches 322 Foundry Networks Implementation 325 Firewalls between Service Modules 331
Virtual Firewall Architecture Using Netscreen and Foundry Networks Products 332
Figures
xxi
Tables
TABLE 2-1 TABLE 5-1 TABLE 5-2 TABLE 5-3 TABLE 5-4 TABLE 5-5 TABLE 5-6 TABLE 5-7 TABLE 5-8 TABLE 5-9 TABLE 5-10 TABLE 5-11 TABLE 5-12 TABLE 5-13 TABLE 5-14 TABLE 5-15 TABLE 5-16 TABLE 5-17 TABLE 5-18 TABLE 5-19
Network Inter-tier Traffic Flows of a Web-based Transaction tr.conf Parameters 126 MTU Sizes 126 127 127
23
trp.conf Parameters 129 Maximum Transmission Unit Ring Speed 130 nf.conf Parameters 137 Maximum Transmission Unit Request Operating TTRT pf.conf Parameters 137 129
138
Request Operating Target Token Rotation Time Multi-Data Transmit Tunable Parameter 144
Possibilities for Resolving Pause Capabilities for a Link Driver Parameters and Status Instance Parameter 168 Operational Mode Parameters 168 167
163
xxiii
TABLE 5-20 TABLE 5-21 TABLE 5-22 TABLE 5-23 TABLE 5-24 TABLE 5-25 TABLE 5-26 TABLE 5-27 TABLE 5-28 TABLE 5-29 TABLE 5-30 TABLE 5-31 TABLE 5-32 TABLE 5-33 TABLE 5-34 TABLE 5-35 TABLE 5-36 TABLE 5-37 TABLE 5-38 TABLE 5-39 TABLE 5-40 TABLE 5-41 TABLE 5-42 TABLE 5-43 TABLE 5-44 TABLE 5-45 TABLE 5-46 TABLE 5-47 TABLE 5-48 TABLE 5-49
170
171 172
Local Transceiver Auto-negotiation Capability Parameters Link Partner Capability Parameters 173 174
Current Physical Layer Status Parameters Driver Parameters and Status Instance Parameter 176 Operational Mode Parameters Inter-Packet Gap Parameter 177 178 175
Local Transceiver Auto-negotiation Capability Parameters Link Partner Capability Parameters 180 181
179
Current Physical Layer Status Parameters Driver Parameters and Status Instance Parameter 184 Operational Mode Parameters Inter-Packet Gap Parameters 184 186 182
187 187
Local Transceiver Auto-negotiation Capability Parameters Link Partner Capability Parameters 188 189
Current Physical Layer Status Parameters Driver Parameters and Status Operational Mode Parameters 190 191
Local Transceiver Auto-negotiation Capability Parameters Link Partner Capability Parameters 194 195
193
Current Physical Layer Status Parameters Driver Parameters and Status Instance Parameter 198 Operational Mode Parameters Inter-Packet Gap Parameter 199 201 197
202
xxiv
TABLE 5-50 TABLE 5-51 TABLE 5-52 TABLE 5-53 TABLE 5-54 TABLE 5-55 TABLE 5-56 TABLE 5-57 TABLE 5-58 TABLE 5-59 TABLE 5-60 TABLE 5-61 TABLE 5-62 TABLE 5-63 TABLE 5-64 TABLE 5-65 TABLE 5-66 TABLE 5-67 TABLE 5-68 TABLE 5-69 TABLE 5-70 TABLE 5-71 TABLE 34 TABLE 5-72 TABLE 5-73 TABLE 7-1 TABLE 7-2 TABLE 7-3
Local Transceiver Auto-negotiation Capability Parameters Link Partner Capability Parameters 204 205
203
Current Physical Layer Status Parameters Performance Tunable Parameters Driver Parameters and Status Instance Parameter 211 Operational Mode Parameters 212 210 207
Read-Write Flow Control Keyword Descriptions Gigabit Link Clock Mastership Controls Inter-Packet Gap Parameter 215 215 216 214
213
Rx Random Early Detecting 8-Bit Vectors PCI Bus Interface Parameters 217 217 218
Jumbo Frames Enable Parameter Performance Tunable Parameters Driver Parameters and Status Operational Mode Parameters 221
222 224
Local Transceiver Auto-negotiation Capability Parameters Link Partner Capability Parameters 226 228
Current Physical Layer Status Parameters General Network Interface Statistics General Network Interface Statistics 245 246
Physical Layer Configuration Properties List of ge Specific Interface Statistics List of ce Specific Interface Statistics Network and VLAN Design 311
253
255 256
314 320
Tables
xxv
Preface
Networking Concepts and Technology: A Designers Resource is a resource for network architects who must create solutions for emerging network environments in enterprise data centers. Youll find information on how to leverage Sun Open Network Environment (Sun ONE) technologies to create Services on Demand solutions as well as technical details about the networking internals. Youll also learn how to integrate your environment with advanced networking switching equipment, providing sophisticated Internet Protocol (IP) services beyond plain vanilla Layer 2 and Layer 3 routing. Based upon industry standards, expert knowledge, and hands-on experience, this book provides a detailed technical overview of the following:
s
Design of highly available, scalable, manageable gigabit network architectures with a focus on the server-to-switch tier. We will share key ingredients for successful deployments based on actual experiences. Emerging IP services that vastly improve Sun ONE-based solutions, giving you a centralized source of concise information about these services, the benefits they provide, how to implement them, and where to use them. Example services include quality of service (QoS), server load balancing (SLB), Secure Sockets Layer (SSL), and IPSec. Sun Networking software and hardware technologies available. We describe and explain how Sun differs from the competition in the networking arena, and then summarize the internal operations and describe technical details that lead into the tuning section. Currently there are only blind recommendations for tuning, with no explanations. This book fills that void by first describing the networking technology, which variables serve what purpose, what tuning will do, and why.
xxvii
xxviii
Chapter 2 explores the main components of a typical enterprise Services on Demand network architecture and some of the more important underlying issues that impact network architecture design decisions. Chapter 3 describes some of key Transport Control Protocol (TCP) tunable parameters related to performance tuning: how these tunables work, how they interact with each other, and how they impact network traffic when they are modified. Chapter 4 describes the internal architecture of a basic network switch and provides a comprehensive discussion of server load balancing. Chapter 5 discusses the networking technologies that are regularly found in a data center. Chapter 6 provides an overview of the various approaches and describes where it makes sense to apply that solution. Chapter 7 describes network implementation concepts and details. Appendix A provides an example of the Lyapunov function. Glossary provides definitions for the technical terms and acronyms used in this book.
s s s
Shell Prompts
Shell Prompt
C shell C shell superuser Bourne shell and Korn shell Bourne shell and Korn shell superuser
machine-name% machine-name# $ #
Preface
xxix
Typographic Conventions
Typeface Meaning Examples
AaBbCc123
The names of commands, files, and directories; on-screen computer output What you type, when contrasted with on-screen computer output Book titles, new words or terms, words to be emphasized. Replace command-line variables with real names or values.
Edit your.login file. Use ls -a to list all files. % You have mail. % su Password: Read Chapter 6 in the Users Guide. These are called class options. You must be superuser to do this. To delete a file, type rm filename.
AaBbCc123
AaBbCc123
CHAPTER
Overview
This book provides a resource for network architects who design IP network architectures for the typical data center. It provides abstractions as well as detailed insights based on actual network engineering experience that include network product development and real-world customer experiences. The focus of this book is limited to network architectures that support Web services-based multi-tier architectures. However, this includes everything from the edge data center switch (which connects the data center to the existing backbone network) to the server network protocol stacks. While there is tremendous acceptance of Web services technologies and multi-tier software architectures, there is limited information about how to create the network infrastructures required in the data center to optimally support these new architectures. This book also provides a new perspective on how to think about solving this problem, leveraging new emerging technologies that help create superior solutions. It explains in detail how certain key technologies work and why they are recommended procedures. One of the complexities of networking in general is that the technology requires breadth of knowledge. Networking connectivity spans many completely different technologies. It is a complex, interconnected, commingled, interrelated set of components and devices, including hardware, software, and different solution approaches for each segment. We try to simplify this complexity by extracting key segments and taking a layered approach in describing an end-to-end solution while limiting the scope of material to the data center.
hardware and software neutral. Thus enterprises can easily communicate with employees, customers, business partners, and vendors while maintaining the specific security requirements, access, and privileges needs of all. The examples used in this book will focus on Web-based systems, which simplifies our discussion from a networking perspective.
FIGURE 1-1 shows a conceptual model of how this new paradigm impacts the data center network architecture, which must efficiently support this infrastructure. The Web services-based infrastructure allows different applications to integrate through the exposed interface, which is the advertised service. The internal details of the service and subservices required to provide the exposed service are hidden. This approach has a profound impact on the various networks, including the service provider network and the data center network that support the bulk of the intelligence required to deliver these Web-based services.
AP PL ICA TIO NS
cy n 2 ga tio Le ica l pp A
cy n 1 ga tio Le ica l pp
A ed as -b e A eb ic W erv S
W eb
idd Bu lew sin Fe ar es eI de Ne s P nfr ra as ted tw ar tru or tne Ne k ctu r tw re Cu or s Ne tom ks tw er or k Ve Ne ndo tw r or k
FIGURE 1-1
Data center network architectures are driven by computing paradigms. One can argue that the computing paradigm has now come full circle. From the 1960s to the 1980s, the industry was dominated by a centralized data center architecture that
ed as -b e B eb ic W erv S
er m n to tio us a C plic p A A
ML
or n nd tio Ve lica pp
revolved around a mainframe with remote terminal clients. Systems Network Architecture (SNA) and Binary Synchronous Communication (BSC) were dominant protocols. In the early to mid 1990s, client-server computing influenced a distributed network architecture. Departments had their local workgroup server with local clients and an occasional link to the corporate database or mainframe. Now, computing has returned to a centralized architecture (where the enterprise data center is more consolidated) for improved manageability and security. This centralized data center architecture is required to provide access to intranet and Internet clients, with different devices, link speeds, protocols, and security levels. Clients include internal corporate employees, external customers, partners, and vendorseach with different security requirements. A single flexible and scalable architecture is required to provide all these different services. Now the network architect requires a wider and deeper range of knowledge, including Layer 2 and Layer 3 networking equipment vendors, emerging startup appliance makers, and server-side networking features. Creating optimal data center edge architectures is not only about routing packets from the client to the target server or set of servers that collectively expose a service, but also about processing, steering, and providing cascading services at various layers. For the purposes of this book, we distinguish network design from architecture as follows:
s
Architecture is a high-level description of how the major components of the system interconnect from a logical and physical perspective. Design is a process that specifies, in sufficient detail for implementation, how to construct a network of interconnected nodes that meets or exceeds functional and non-functional requirements (performance, availability, scalability, and such).
Advances in networking technologies, combined with the rapid deployment of Webbased, mission-critical applications, brought growth and significant changes in enterprise IP network architectures. The ubiquitous deployment of Web-based applications that has streamlined business processes has further accelerated Web services deployments. These deployments have a profound impact on the supporting infrastructures, often requiring a complete paradigm shift in the way we think about building the network architectures. Early client-server deployments had network traffic pattern characteristics that were predominantly localized traffic over large Layer 2 networks. As the migration towards Web-based applications accelerated, client-server deployments evolved to multi-tier architectures, resulting in different network traffic patterns, often outgrowing the old architectures. Traditional network architectures were designed on the assumption that the bulk of the traffic would be local or Layer 2, with proportionately less inter-company or Internet traffic. Now, traffic is very different due to the changing landscape of corporate business policies towards virtual private networks (VPNs), consumer-tobusiness, and business-to-business e-commerce. These innovations have also given rise to new challenges and opportunities in the design and deployment of emerging enterprise data center IP network architectures.
Chapter 1
Overview
This book describes why these network traffic patterns have changed, defining multi-tier data centers that support these emerging applications and then describing how to design and build suitable network architectures that will optimally support multi-tier data centers. The focus of this book spans the edge of the data center network to the servers. The scope of this book is limited to the data center edge, so it does not cover the core of the enterprise network.
The Internet Service Provider (ISP) that provides connectivity to the public Internet for both clients and enterprises The owners of the physical plant and communications equipment, which fall into one of the following categories:
s
The Incumbent Local Exchange Carrier (ILEC), which provides local access to subscribers in a local region The Inter Exchange Carrier (IXC), which provides national and international access to subscribers The Tier 2 ISP, which is usually a private company that leases lines and cage space at an ILEC facility, or it can be the ILEC or IXC itself.
The diagram shows Tier 2 ISPs as being relatively local ISPs, situated in the access networks, whereas Tier 1 ISPs have their own longhaul backbone and provide wider regional coverage, situated in the Core, or national backbone. Tier 1 often aggregates the traffic of many Tier 2 ISPs, in addition to providing services directly to individual subscribers. Large networks connect to each other through peering points, such as MAE-East/MAE-West, which are public peering points, or through Network Access Points (NAPs), such as SPRINTS NAP, which are private.
ATM
ATM ATM
IP Ethernet
FIGURE 1-2
High-Level Overview of Networks Spanning Clients, Data Center, Vendors, and Partners (a)
Chapter 1
Overview
Vendors
Partners
FIGURE 1-3
High-Level Overview of Networks Spanning Clients, Data Center, Vendors, and Partners (b)
A client can be any software that initiates a request for a service. This means that a Web server itself can be a client. For example, when in the process of replying back to a client for a Web page, it needs to fetch images from an image server. FIGURE 1-2 and FIGURE 1-3 show remote dial-up clients as well as corporate clients. Depending on the distances between the client and the server hosting the Web service, data might need to traverse a variety of networks for end-to-end communication. The focus of this book is the network that interconnects the servers located in an enterprise data center or the data center of an ISP offering collocation services. We describe the features and functions of the networking equipment and servers in sufficient depth to help a network architect in the design of enterprise IP network architectures. We take a layered approach, starting from the application layer down to the physical layer to describe the implications on the design of network architectures. We describe not only high-level architectural principles, but also key details such as tuning the transport layer for optimal working networks. We discuss in detail how the building blocks of the network architecture are constructed and work to make more informed design decisions. Finally, we present actual tested configurations, providing a baseline for customizing and extending these configurations to meet actual customer requirements.
Chapter 1
Overview
OR
FIGURE 1-4
Chapter 2 offers insight into the applications that generate the traffic flows across the tiers. Inter-tier traffic starts with a client request, which can originate from remote dial-up, intranet corporate employee, Internet partner, and so on. This HyperText Transport Protocol (HTTP) or HyperText Transport Protocol over SSL (HTTPS) packet is usually about a hundred bytes. The server response is usually a 1000-byte to 200-kilobyte file, often consisting of Web page images. Chapter 2 describes the key components and technologies used at the application layer and provides some deeper insights into achieving availability.
8 Networking Concepts and Technology: A Designers Resource
The processing of client Web requests and generation of an HTTP response may require significant processing across various Web and application or legacy servers. Examples of typical applications include business applications implemented using Enterprise JavaBeans (EJBs) on an application server, mail messaging, and dynamic Web page generation using JavaServer Pages (JSP) and servlets. The nature of the traffic requirements should be clearly identified and quantified. Most important are the identification and specification of handling peaks or bursts. We provide detailed Web, application, and database tier traffic flows and availability strategies, which directly impact inter-tier traffic flows.
Chapter 1
Overview
FIGURE 1-5
a multi-tier data center network architecture. These services are essentially packet processing functions that alter the flow of traffic from client to server for increasing certain aspects of the architecture. Firewalls and Secure Sockets Layer (SSL) are added for increasing security. Server load balancing (SLB) is used to increase availability, scalability, flexibility, and performance. In Chapter 4, we describe key services that the architect can leverage, including in-depth explanations of how they work and which variant is the best and why. A question that is often asked is which server load balancing algorithm is best and why. We provide a detailed technical analysis explaining exactly why one algorithm is the best. We also provide detailed
10
explanation of the new emerging quality of service (QoS), which has gained more importance because of increasing deployment time-dependent applications, such as Voice over IP (VoIP) or multimedia. Most enterprise networks are overprovisioned. Normal steady-state network flows are usually not an issue. What really concerns most competent network architects is how to handle peaks or bursts. Every potential incoming HTTP request could be a revenue-producing opportunity that cannot be discarded. Here is where QoS plays an essential role. One of the missing pieces in most network architectures is planning for handling peak workloads and providing differentiated services. When there is congestion, we absolutely must prioritize and service the important customers ahead of casual browsing Web surfers. Quality of service will be discussed in detail: its importance, where to use it, and how it works.
Enterprise Multi-Tier Data Center SLB Service Access Point QoS NAT SSL FW
FIGURE 1-6
Chapter 1
Overview
11
FIGURE 1-7
12
Trunking NIC, server side Trunking Switch side, including Distributed Multi-link Trunking (DMLT) Spanning Tree Protocol (STP)
Virtual Router Redundancy Protocol (VRRP) default router redundancy mechanisms IP Multipathing (IPMP) NIC redundancy Open Shortest Path First (OSPF) and Routing Information Protocol (RIP) data center routing protocol availability features.
s s
The advantages and disadvantages will be described, along with suggestions on which approach makes sense for which situation.
Chapter 1
Overview
13
FIGURE 1-8
Reference Implementations
The final chapter ties together all the concepts from previous chapters and describes actual network architectures from a complete solution standpoint. The material in Chapter 7 is based on actual tested configurations. FIGURE 1-9 shows an example of the tested configurations. The solution described in Chapter 7 is generic enough to be useful for actual solutions, yet customizable for specific requirements. The logical network architecture describes a high-level overview of the Layer 3 networks and segregates the various service tiers. IP services, which are implemented at key boundary points of the architecture, are then reviewed. We describe the design considerations that lead to the physical architecture. We discuss two different architectural approaches and show how different network switch vendors can be deployed, including the advantages and disadvantages of each. We present detailed descriptions of configurations and implementations. A hardware-based firewall is used to show a logical firewall solution, providing security between each tier, yet using only one appliance. For increased availability, a second appliance is optionally added.
14
Client 2
192.168.10.1 10.50.0.1 Standby core 192.168.10.3 10.30.0.1 10.40.0.1 10.20.0.1 10.10.0.1 Server load-balancer switches
Sun Fire 280R Sun Fire 280R Sun Fire 280R Sun Fire 280R 10.10.0.100 10.10.0.101 10.10.0.102 10.10.0.103
Directory service Tier Sun Fire 280R Sun Fire 280R Sun Fire 280R Sun Fire 280R 10.20.0.100 10.20.0.101 10.20.0.102 10.20.0.103
T3
FIGURE 1-9
Chapter 1
Overview
15
CHAPTER
Presentation Tier Web Tier Application Tier Naming Services Tier Data Tier
In this chapter we will explore the main components of a typical enterprise Services on Demand network architecture and some of the more important underlying issues that impact network architecture design decisions. We describe in detail the Web tier and Application tier, pointing out issues that impact design decisions for availability, performance, manageability, security, and scalability. We then describe some example architectures that were actually deployed in industry. Topics include:
s
Services on Demand Architecture on page 18 describes the overall architecture from a software perspective, showing the applications that generate the network traffic. Multi-Tier Architecture and Traffic Patterns on page 20 describes the mapping process from the logical architecture to the physical realization onto the network architecture. It also describes the inter-tier traffic patterns and the reasons behind the network traffic. Web Services Tier on page 23 describes the most important tier, which is present in all Web-based architectures. This section provides detailed insights into the applications that run on this tier and directly impact the network architecture. Application Services Tier on page 26 describes the Application tier and the relationship to the Web services tier. This section provides detailed insights into the applications that run on this tier and directly impact the network architecture.
17
Architecture Examples on page 29 provides examples of various architectures based design trade offs and the reasons behind them. Its important to note that it is the application characteristics that influence the design of the network architecture. Example Solution on page 34 describes an actual tested and implemented multi-tier architecture, using the design concepts and principles described in this chapter.
18
Or
ac
le
ec
tio
ns JD g C n on BC tio ns
ing gi n
Firewall
BC
ec
Firewall
b We er erv S Ro ute r
et rvl Se ainer nt Co B OR
Firewall
TP HT er rv Se
er Int
ne
t r ute
b We er erv S
Su
tio
m Me Na g JB Mail gin E ner sa es tai M n rity Co cu il Se Ma A D rity JP cu r ito Se on M DA n& JP mi r ito Ad B on OR &M n er mi erv Ad nS S ion erv er
a ss
ing
Le
ga
cy
nO
A NE
pp
lica
Ro
Se
rvi
eC
ti rea
ce rvi Selivery De
ce rvi n ce Se ratio rvi r g Se taine e n s Int Co ation lic eb p Apore Wes C ervic S b Weces rvi licy Se Po nd a ty nti rm Ide tfo Pla on
,A
m se
bly
an
p De
loy
me
nt
FIGURE 2-1
Chapter 2
19
Web Tier Network 10.10.0.0. Directory Tier Network 10.20.0.0. App Serv Tier Network 10.30.0.0.
20
FIGURE 2-2 shows how the various services map directly to a corresponding logical Layer 3 network cloud, shown above in boxes, which then maps directly onto a Layer 2 VLAN. The mapping process starts with the high-level model of the services to be deployed onto the physical model. This top-down approach allows network architects to maintain some degree of platform neutrality. The target hardware can change or scale, but the high-level model remains intact.1
Layer 2 VLANs segregate Layer 2 broadcast domains and service domains. An example of a service domain would be a group of Web servers, load balanced, horizontally scaled, and aggregated to provide a highly available service with a single IP access point, commonly deployed in actual practice as a VIP on a load balancer. Layer 3 IP networking segregates Layer 3 routed domains and service domains. Segregating service domains based on IP addresses makes this service network accessible to any host on any Layer 3 IP network. One advantage of this approach is that the service interface for each cloud only needs to be one endpoint, which is easily implemented by a virtual IP (VIP) address. The service is actually provided across many subservice instances running on physically separated servers, collectively forming a logical cluster. The external world does not need to know (and should not know for many reasons, especially security) about the individual servers that provide the service. By creating a layer of indirection, the requesting client need not be modified if any one server is removed or replaced. This decoupling improves manageability and serviceability.
This mapping process allows better control of the network traffic by providing a mechanism for routers and switches to steer the traffic according to user-defined rules. In actual practice, these user-defined rules are accomplished by configuring VLANs, static routes, and access control lists (ACLs). A further benefit allows traffic to be filtered at wirespeed to identify flows for other services such as Quality of Service (QoS).
1Keep in mind the physical constraints imposed by the actual target hardware. Examples of physical constraints
Chapter 2
21
Clients
10
Switching Services
Web Services 3 4 8 5
Directory Services
Application Services
Database Services
FIGURE 2-3
22
The Item column in TABLE 2-1 corresponds with the numbers in FIGURE 2-3.
TABLE 2-1 Item
1 2
Client Switch
HTTP HTTP
Client initiates Web request. Switch redirects client request to particular Web server based on L2-L7 and SLB configuration. Web service request directory service. Directory service resolves request. Servlet obtains handle to EJB bean, invokes a method on remote object. Web server talks to the iAS through a Web connector, which uses NSAPI, ISAPI, or optimized CGI. Entity Bean requests to retrieve or update row in DB table. Entity Bean request completed. Application server returns dynamic content to Web server. Switch receives reply from Web server. Switch rewrites IP header, returns HTTP request to client.
3 4 5
6 7 8 9 10
Chapter 2
23
FIREWALL
FIREWALL
FIGURE 2-4 provides an overview of a high-level model of the Presentation, Web, Application, and Data-tier components and interfacing elements. The following describes the sequence of interaction between the client and the multi-tier architecture:
1. Client initiates a Web request; HTTP request reaches Web server. 2. Web server processes client request and passes request to backend Application server, containing another Web server with a servlet engine. 3. Servlet processes a portion of the request and requests spurring service from an Enterprise Java Bean (EJB) running on Application server containing an EJB container. 4. EJB retrieves data from database. The Sun ONE Application server comes with a bundled Web server container. However, reasons for deploying a separate Web tier include security, load distribution, and functional distribution. There are two availability strategies that depend on the type of operations that are executed between the client and Web server processes:
24
Stateless and Idempotent If the nature of transactions is idempotent (where the order of transactions is not dependent on each other), then the availability strategy at the Web tier is trivial. Both availability and scalability are achieved by replication. Web servers are added behind a load-balancer switch. This class of transactions includes static Web pages and simple servlets that perform a single computation. Stateful If the transactions between the client and server require that state be maintained between individual client HTTP requests and server HTTP responses, then the problem of availability is more complicated and discussed in this section. Examples of this class of applications include shopping carts, banking transactions, and the like.
The Sun ONE Web servers provide various services including SSL, a Web container that serves static content, JSP software, and a servlet engine. Availability strategies include a front-end multilayer switch with load balancing capabilities and the ability to switch based on SSL session IDs and cookies. If the Web servers are only serving static pages, then the load balancer will provide sufficient availability; if any Web server fails, subsequent client requests will be forwarded to the remaining surviving servers. However, if the Web servers are running JSP software or servlets that require session persistence, the availability strategy is more complex. Implementing session failover capabilities can be accomplished by coding, Web container support, or a combination of both. There are actually several complications, including the fact that even if the transparent session failover problem is solved for failures that occur at the beginning of transactions, idempotent transactions still pose a problem for transactions that have started and then failed because the client is unaware of the server state. A programmatic session failover solution can involve leveraging the javax.servlet.http.HttpSession object, storing and retrieving user session state to or from an LDAP directory or database using cookies in the clients HTTP request. Some Web containers provide the ability to cluster HttpSession objects using elaborate schemes, but they still have flaws such as failures in the middle of a transaction. These clustering schemes involve memorybased session persistence or database-based persistence and a replicated HttpSession object on a backup server. If the primary server fails, the replica takes over. The Sun ONE Web server availability strategy for HttpSession persistence offers extending the IWSSessionManager, which in multiprocess mode can share session information across multiple processes running on multiple Web servers. This means that a client request has an associated session ID, which identifies the specific client. This information can be saved and subsequently retrieved either in a file that resides on a Network File System (NFS) mounted directory or by having the database IWSSessionManager create an IWSHttpSession object for each client session. The IWSSessionManager will require some coding efforts to support distributed sessions so that if the primary server that maintained a particular session fails, the standby server running another IWSSessionManager should retrieve the persistent session information from persistent store based on the session ID. Logic is also required to ensure the load balancer would redirect the clients HTTP request to the backup Web server based on additional cookie information.
Chapter 2 Network Traffic Patterns: Application Layer 25
Currently there is no support for SSL session failover in Sun ONE Web Server 6.0. HttpSession failover can be implemented by extending IWSSessionManager using a shared NFS file or database session persistence strategies, providing user control and flexibility.
HttpSession This is the client session object that the Web container creates and manages for each client HTTP request. Session failover mechanisms were described in the previous section. Stateless Session Bean This type of EJB does not require any session failover services. If a client request requires logic to be executed in a stateless session bean, and the server where that bean is deployed fails, an alternative server can redo the operation correctly without any knowledge of the failed bean. Failure detection by the client plug-in or application login must detect when the operation has failed and reinitiate the same operation on a secondary server with the appropriately deployed EJB component. Stateful Session Bean This type of EJB component requires sophisticated mechanisms to maintain state between the primary and backupin addition to the required failover mechanisms described in the Stateless Session Bean case. Entity Bean There are two types of Entity Beans: Container Managed Persistence (CMP) and Bean Managed Persistence (BMP). These essentially differ in whether the container or the user code is responsible for ensuring persistence. In either case, session failover mechanisms other than those already provided in the EJB 2.0 specification are not required because Entity Beans represent a row in a database and the notion of session is replaced by transaction. Clients usually access Entity Beans at the start of transactions. If a failure occurs, the entire transaction is rolled back. An alternative server can redo the transaction, resulting in correct operation.
26
The degree of transparency of the failover requires some consideration. In some cases, the client is completely unaware that a failure occurred and an automatic failover action took place. In other situations, the client times out and must reinitiate a transaction.
WS1
JSP/Servlet
Home Home STUS
AS1
Home Object EJB
Primary
EJB Class Instance
Static HTML
Remote
Object STUS
EJB Object
EJB Container
Namespace
7 9
WS2 9
Web Plug-in
AS1
Home Object EJB EJB Object
Replicated
EJB Class Instance EJB Class Instance EJB Class Instance
Web Container
EJB Container
Namespace
JNDI local tree JNDI global tree Modified JNDI
FIGURE 2-5
Chapter 2
27
Scenario 1
1. A client makes an HTTP request, which may contain some cookie state information to preserve state between that individuals HTTP requests to a particular server. 2. The load balancer switch ensures that the clients request is forwarded to the appropriate server. 3. The JSP software or servlet retrieves a handle to a remote EJB object residing in the application server instance. 4. The client must first find the home object using a naming service such as Java Naming and Directory Interface (JNDI). The returned object is cast to the home interface type. 5. The client uses this home interface reference to create instances. 6. The client continues to create instances. 7. The application server provides replication services. When an EJB object is updated on the active application server instance, the standby server updates the corresponding backup EJB objects state information. These replication services are provided by the application server systems services.
Scenario 2
8. A JNDI tree cluster that manages replication of the EJB state updates and keeps track of the primary and replicated objects. This scenario occurs when vendor implementations use a modified JNDI as a clustering mechanism. In the standard JNDI implementation, multiple objects cannot bind to a single name, but using added logic, each member of a cluster can have a local and shared global JNDI tree. If the primary object fails, the JNDI will return the backup object bound to a particular name. If a failure occurs after a client has performed a JNDI lookup, the client will hang or time out and try again. The subsequent request will be directed to a secondary server, which will have the correct state of the failed node of a particular entity.
Scenario 3
9. This scenario simply forwards the HTTP request to the Application server using a plug-in. The HTTP request would be received by the Application servers HTTP server. The HTTP request would recursively arrive at point 2 in FIGURE 2-5. Another mechanism includes adding a replica-aware or cluster-aware stub to the EJB objects and system services including a cluster module that runs on the appserver, which is loaded in the deployment descriptor, if specified. The cluster module might
28 Networking Concepts and Technology: A Designers Resource
consist of various subsystems that provide data synchronization services, keep the state of the backup EJB object synchronized with the primary, manage cluster failovers, and monitor the health of the appserver instances. If the primary appserver instance fails, the cluster failover manager can redirect client-side EJB method invocations to the backup node. Another approach involves the primary and secondary cluster nodes inserting and altering a cookie on the clients HTTP request, which would apply in the case where the Web server and app server reside on the same server. If the primary node of the cluster fails, the load-balancing switch must be configured to redirect the request to the backup node of the cluster. The backup node must look at the clients cookie and retrieve state information. However, most of these solutions suffer one drawback: Idempotent transactions are not handled transparently or properly in the event that a failure occurs after a method invocation has commenced. At the time of this writing, the Sun ONE Application Server 7 Enterprise Edition is expected to provide a highly available and scalable EJB clustering solution that allows enterprise customers to create solutions with minimal downtime.
Architecture Examples
This section describes three architecture designs. Deciding which architecture to choose can be reduced to identifying the following design objectives:
s
Application partitioning The application itself might make better use of resources by segregating or collapsing the Web tier from the Application tier. If an application makes heavy use of static Web pages, JSP software, or servlet code, and minimal EJB architecture, it might make sense to horizontally scale the Web tier and have only one or two small application servers. Similarly, at the other end of the spectrum, it might make sense for an application to deploy all the servlet and EJB war, jar, and ear files on the same application server if there is a lot of servlet-to-EJB communication. Security level Separating the Web tier and Application Server tier with a firewall creates a more secure solution. The potential drawbacks include hardware and software costs, increased communication latencies between servlets and EJB components, and increased manageability costs. Performance In some cases, customers are willing to forego tight security advantages for increased performance. For example, the firewall between the Web tier and the Application Server tier might be considered overkill because the ingress traffic is already firewalled in front of the Web tier. Scalability Applications can be partitioned and deployed in two ways:
Chapter 2
29
s s
Horizontally scaled, where many small separate Web systems are utilized Vertically scaled, where a few monolithic systems support many instances of Web servers
Manageability In general, the fewer the number of servers, the lower the total cost of operation (TCO).
Designing for Vertical Scalability and Performance on page 31 describes a vertically scaled design where the primary objectives are security and vertical scalability. Designing for Security and Vertical Scalability on page 32 describes a tightlycoupled design where the primary objectives are performance between the Web tier and Application tier and vertical scalability. Designing for Security and Horizontal Scalability on page 33 describes a highly distributed solution with the primary design objective being horizontal scalability and security. It is the application characteristics that directly influence the network architecture.
30
servlet container
Ingress Network
FIGURE 2-6
Web/presentation Tier
The architecture example shown in FIGURE 2-6 provides enhanced security. The Web server can be configured as a reverse proxy by receiving an HTTP request on the ingress network side from a client, then opening another socket connection on the appserver side to send an HTTP request to the Web server running inside the Sun ONE Application Server instance. Alternatively, the Web server instance could instantiate EJB components after performing a lookup on the home interface of a particular EJB component. One advantage of this decoupled architecture is independent scaling. If it turns out that the Web server servlets need to scale horizontally, they can do so independently of the application server logic. Similarly, if the EJB architectures logic needs to scale or be modified, it can do so
Chapter 2
31
independently of the Web tier. Potential disadvantages include increased latency between the Web tier and Application Server tier communications and increased maintenance.
FIREWALL
Ingress Network
FIGURE 2-7
Web/presentation Tier
FIREWALL
Enterprise Information Tier
The example shown in FIGURE 2-7 represents a collapsed architecture that takes advantage of the Web server already included in the Sun ONE Application Server instance process. This architecture is suitable for applications that have relatively intensive servlet-to-EJB communications and less stringent security requirements.
32
From an availability standpoint, fewer horizontal servers result in lower availability. A potential advantage of this architecture is lower maintenance cost because there are fewer servers to manage and configure.
FIREWALL
FIREWALL
Ingress Network
FIGURE 2-8
Web/presentation Tier
FIREWALL
Enterprise Information Tier
servlet container
The architecture shown in FIGURE 2-8 is a more horizontally scaled variant of the architecture shown in FIGURE 2-6. This results in increased availability. More server failures can be tolerated without bringing down services in this configuration.
Chapter 2
33
Example Solution
This section describes an example of a tested and implemented data center, multitier network architecture shown in FIGURE 2-9. The network design is composed of segregated networks, implemented physically using VLANs configured by the network switches. This internal network used the 10.0.0.0 private IP address space for security and portability advantages. This design is an implementation of the design described in Designing for Security and Horizontal Scalability on page 33. It includes availability design principles, which will be discussed further in Chapter 6. The management network allows centralized data collection and management of all devices. Each device has a separate interface to the management network to avoid contaminating the production network performance measurements. The management network is also used for jumpstart installation and terminal server access. Although several networks physically reside on a single active core switch, network traffic is segregated and secured using static routes, access control lists (ACLs), and VLANs. From a practical perspective, this can be as secure as separate individual switches, depending on the switch manufacturers implementation of VLANs.
34
Client 2
192.168.10.1 10.50.0.1 Standby core 192.168.10.3 10.30.0.1 10.40.0.1 10.20.0.1 10.10.0.1 Server load-balancer switches
Sun Fire 280R Sun Fire 280R Sun Fire 280R Sun Fire 280R 10.10.0.100 10.10.0.101 10.10.0.102 10.10.0.103
Directory service Tier Sun Fire 280R Sun Fire 280R Sun Fire 280R Sun Fire 280R 10.20.0.100 10.20.0.101 10.20.0.102 10.20.0.103
T3
FIGURE 2-9
Chapter 2
35
CHAPTER
37
TCP Tuning Domains on page 38 provides an overview of TCP from a tuning perspective, describing the various components that contain tunable parameters and where they fit together from a high level, thus showing the complexities of tuning TCP. TCP State Model on page 48 proposes a model of TCP that illustrates the behavior of TCP and the impact of tunable parameters. The system model then projects a network traffic diagram baseline case showing an ideal scenario. TCP Congestion Control and Flow Control Sliding Windows on page 53 shows various conditions to help explain how and why TCP tuning is needed and which are the most effective TCP tunable parameters needed to compensate for adverse conditions. TCP and RDMA Future Data Center Transport Protocols on page 62 describes TCP and RDMA, promising future networking protocols that may overcome the limitations of TCP.
38
STREAMS
Congestion Control
Timers
FIGURE 3-1
FIGURE 3-1 shows a high-level view of the different components that impact TCP processing and performance. While the components are interrelated, each has its own function and optimization strategy.
s
The STREAMS framework looks at raw bytes flowing up and down the streams modules. It has no notion of TCP, congestion in the network, or the client load. It only looks at how congested the STREAMS queues are. It has its own flow control mechanisms. TCP-specific control mechanisms are not tunable, but they are computed based on algorithms that are tunable. Flow control mechanisms and congestion control mechanisms are functionally completely different. One is concerned with the endpoints, and the other is concerned with the network. Both impact how TCP data is transmitted. Tunable parameters control scalability. TCP requires certain static data structures that are backed by non-swappable kernel memory. Avoid the following two scenarios:
s
Allocating large amounts of memory. If the actual number of simultaneous connections is fewer than anticipated, memory that could have been used by other applications is wasted.
Chapter 3
39
Allocating insufficient memory. If the actual number of connections exceeds the anticipated TCP load, there will not be sufficient free TCP data structures to handle the peak load.
This class of tunable parameters directly impacts the number of simultaneous TCP connections a server can handle at peak load and control scalability.
ServerThe focus of this chapter. NetworkThe endpoints can only infer the state of the network by measuring and computing various delays, such as round-trip times, timers, receipt of acknowledgments, and so on. ClientThe remote client endpoint of the TCP connection.
Server Application
read ()
Client Application
write ()
Network
Receive Buffer
FIGURE 3-2
This section requires basic background in queueing theory. For more information, refer to Queueing Systems, Volume 1, by Dr. Lenny Kleinrock, 1975, Wiley, New York. In FIGURE 3-2, we model each component as an M/M/1 queue. An M/M/1 queue is
40 Networking Concepts and Technology: A Designers Resource
a simple queue that has packet arrivals at a certain speed, which weve designated as . At the other end of the queue, these packets are processed at a certain speed, which weve designated as . TCP is a full duplex protocol. For the sake of simplicity, only one side of the duplex communication process is shown. Starting from the server side on the left in FIGURE 3-2, the server application writes a byte stream to a TCP socket. This is modeled as messages arriving at the M/M/1 queue at the rate of l. These messages are queued and processed by the TCP engine. The TCP engine implements the TCP protocol and consists of various timers, algorithms, retransmit queues, and so on, modeled as the server process , which is also controlled by the feedback loop as shown in FIGURE 3-2. The feedback loop represents acknowledgements (ACKs) from the client side and receive windows. The server process sends packets to the network, which is also modeled as an M/M/1 queue. The network can be congested, hence packets are queued up. This captures latency issues in the network, which are a result of propagation delays, bandwidth limitations, or congested routers. In FIGURE 3-2 the client side is also represented as an M/M/1 queue, which receives packets from the network and the client TCP stack, processes the packets as quickly as possible, forwards them to the client application process, and sends feedback information to the server. The feedback represents the ACK and receive window, which provide flow control capabilities to this system.
to the client of an ideally tuned system. Send window-sized packets are sent one after another in a pipelined fashion continuously to the client receiver. Simultaneously, the client sends back ACKs and receive windows in unison with the server. This is the goal we are trying to achieve by tuning TCP parameters. Problems crop up when delays vary because of network congestion, asymmetric network capacities, dropped packets, or asymmetric server/client processing capacities. Hence, tuning is required. To see the TCP default values for your version of Solaris, refer to the Solaris documentation at docs.sun.com.
Chapter 3
41
D1
D2
D3
D4
D5
D6
D7
D8
Router 1
D1'
D2'
D3'
D4'
Router 2 D1" D2" A1 D3" A2 D4" A3 A4 Client - ACKs sent back, A1...A4 Time
FIGURE 3-3
Short Slow Link -Dial Up -POTS
In a perfectly tuned TCP system spanning several network links of varying distances and bandwidths, the clients send back ACKs to sender in perfect synchronization with the start of sending the next window. The objective of an optimal system is to maximize the throughput of the system. In the real world, asymmetric capacities require tuning on both the server and client side to achieve optimal throughput. For example, if the network latency is excessive, the amount of traffic injected into the network will be reduced to more closely maintain a flow that matches the capacity of the network. If the network is fast enough, but the client is slow, the feedback loop will be able to alert the sender TCP process to reduce the amount of traffic injected into the network. Later sections will build on these concepts to describe how to tune for wireless, high-speed wide area networks (WANs), and other types of networks that vary in bandwidth and distance.
FIGURE 3-4 shows the impact of the links increasing in bandwidth; therefore, tuning is needed to improve performance. The opposite case is shown in FIGURE 3-5, where the links are slower. Similarly, if the distances increase or decrease, delays attributed to propagation delays require tuning for optimal performance.
42
D1
D2
D3
D4
D5
D6
D7
D8
Router 1
D1'
D2'
D3'
D4'
D5'
D6'
D7'
D8'
Chapter 3
43
D1
D2
D3
Router 1
D1'
D2'
D3'
44
Server Application
socket () close () bind () write () listen () read () accept()
Client Application
socket () close () connect () write () read ()
Libsocket
libnsl
Server Node
Client Node
Libsocket
libnsl
rput() wsrv() wq TCP rsrv() wput() rput() wsrv() wq IP rsrv() wput() rput() wsrv() wq NIC Driver rsrv() wput() Network
rput() wsrv() wq TCP rsrv() wput() rput() wsrv() wq IP rsrv() wput() rput() wsrv() wq NIC Driver rsrv() wput()
FIGURE 3-6
To gain a better understanding of TCP protocol processing, we will describe how a packet is sent up and down a typical STREAMS-based TCP implementation. Consider the server application on the left side of FIGURE 3-6 as a starting point. The following describes how data is moved from the server to the client on the right. 1. The server application opens a socket. (This triggers the operating system to set up the STREAMS stack, as shown.) The server then binds to a transport layer port, executes listen, and waits for a client to connect. Once the client connects, the server completes the TCP three-way handshake, establishes the socket, and both server and client can communicate. 2. Server sends a message by filling a buffer, then writing to the socket. 3. The message is broken up and packets are created, sent down the streamhead (down the read side of each STREAMS module) by invoking the rput routine. If the module is congested, the packets are placed on the service routine for deferred processing. Each network module will prepend the packet with an appropriate header.
Chapter 3
45
4. Once the packet reaches the NIC, the packet is copied from system memory to the NIC memory, transmitted out of the physical interface, and sent into the network. 5. The client reads the packet into the NIC memory and an interrupt is generated that copies the packet into system memory and goes up the protocol stack as shown on right in the Client Node. 6. The STREAMS modules read the corresponding header to determine the processing instructions and where to forward the packet. Headers are stripped off as the packet is moved upwards on the write side of each module. 7. The client application reads in the message as the packet is processed and translated into a message, filling the client read buffer. The Solaris operating system (Solaris OS) offers many tunable parameters in the TCP, User Datagram Protocol (UDP), and IP STREAMS module implementation of these protocols. It is important to understand the goals you want to achieve so that you can tune accordingly. In the following sections, we provide a high-level model of the various protocols and provide deployment scenarios to better understand which parameters are important to tune and how to go about tuning them. We start off with TCP, which is, by far, the most complicated module to tune and has the greatest impact on performance. We then describe how to modify these tunable parameters for different types of deployments. Finally, we describe IP and UDP tuning.
46
STREAMHEAD rsrv
tcp_sth_rcv_lowat tcp_sth_rcv_hiwat
wsrv
rput
wput
TCP STREAMS Module TCP Server Process Listening on Socket TCP State Lookup tcp_bind_hash tcp_conn_hash tcp_acceptor_hash
SYN Recvd, Pending 3 way handshake tcp_conn_req_max_q0
tcp_listen_hash
rsrv
tcp_sth_rcv_lowat tcp_sth_rcv_hiwat
wsrv
rput
wput
FIGURE 3-7
top is the streamhead, which has a separate queue for TCP traffic, where an application reads data. STREAMS flow control starts here. If the operating system is sending up the stack to the application and the application cannot read data as fast as the sender is sending it, the stream read queue starts to fill. Once the number of packets in the queue exceeds the high-water mark, tcp_sth_recv_hiwat, streamsbased flow control triggers and prevents the TCP module from sending any more packets up to the streamhead. There is some space available for critical control messages (M_PROTO, M_PCPROTO). The TCP module will be flow controlled as long as the number of packets is above tcp_sth_recv_lowat. In other words, the
Chapter 3
47
streamhead queue must drain below the low-water mark to reactivate TCP to forward data messages destined for the application. Note that the write side of the streamhead does not require any high-water or low-water marks because it is injecting packets into the downstream, and TCP will flow control the streamhead write side by its high-water and low-water marks tcp_xmit_hiwat and tcp_xmit_lowat. Refer to the Solaris AnswerBook2 at docs.sun.com for the default values of your version of the Solaris OS. TCP has a set of hash tables. These tables are used to search for the associated TCP socket state information on each incoming TCP packet to maintain state engine for each socket and perform other TCP tasks to maintain that connection, such as update sequence numbers, update windows, round trip time (RTT), timers, and so on. The TCP module has two new queues for server processes. The first queue, shown on the left in FIGURE 3-7, is the set of packets belonging to sockets that have not yet established a connection. The server side has not yet received and processed a clientside ACK. If the client does not send an ACK within a certain window of time, then the packet will be dropped. This was designed to prevent synchronization (SYN) flood attacks, where a bunch of unacknowledged client SYN requests caused servers to be overwhelmed and prevented valid client connections from being processed. The next queue is the listen backlog queue, where the client has sent back the final ACK, thus completing the three-way handshake. The server socket for this client will move the connection from LISTEN to ACCEPT. But the server has not yet processed this packet. If the server is slow, then this queue will fill up. The server can override this queue size with the listen backlog parameter. TCP will flow control on IP on the read side with its parameters tcp_recv_lowat and tcp_recv_hiwat similar to the streamhead read side.
48
SERVER NODE
CLIENT NODE
Server Application
socket () close () bind () write () listen () read () accept()
Client Application
socket () close () write () connect () read ()
CP
socket ()
Socket Opened
bind ()
Socket Opened
bind ()
Connection Setup
Port Bound
listen ()
Port Bound
connect () RST tcp_ip_linterval
Listen
client connects()
Connect
send syn/ACK()
RST tcp_ip_linterval
SYN RECVD
send syn/ACK()
Close
SYN SENT
recv syn/ACK()
Close
SYN SENT
client ACK
SYN RECVD
server ACK
Fin_Wait_1
send ACK recv FIN, ACK
Fin_Wait_1
send ACK recv FIN, ACK
Close_Wait
recv ACK
Fin_Wait_2
recv FIN, send ACK
Closing
recv ACK
Close_Wait
recv ACK
Fin_Wait_2
recv FIN, send ACK
Closing
recv ACK
Last_Ack
2MSL Timeout
Time_Wait
Last_Ack
2MSL Timeout
Time_Wait
Socket Closed
recv ACK
Connection Shutdown
Socket Closed
recv ACK
Connection Shutdown
FIGURE 3-8
This figure shows the server and client socket API at the top and the TCP module with the following three main states:
Connection Setup
This includes the collection of substates that collectively set up the socket connection between the two peer nodes. In this phase, the set of tunable parameters includes:
Chapter 3 Tuning TCP: Transport Layer 49
tcp_ip_abort_cinterval: the time a connection can remain in half-open state during the initial three-way handshake, just prior to entering an established state. This is used on the client connect side. tcp_ip_abort_linterval: the time a connection can remain in half-open state during the initial three-way handshake, just prior to entering an established state. This is used on the server passive listen side. Long Abort Intervals The longer the abort interval, the longer the server will wait for the client to send information pertaining to the socket connection. This might result in increased kernel consumption and possibly kernel memory exhaustion. The reason is that each client socket connection requires state information, using approximately 12 kilobytes of kernel memory. Remember that kernel memory is not swappable, and as the number of connections increases, the amount of consumed memory and time delays for lookups for connections increases. Hackers exploit this fact to initiate Denial of Service (DoS) attacks, where attacking clients constantly send only SYN packets to a server, eventually tying up all kernel memory, not allowing real clients to connect. Short Abort Intervals If the interval is too short, valid clients that have a slow connection or go through slow proxies and firewalls could get aborted prematurely. This might help reduce the chances of DoS attacks, but slow clients might also be mistakenly terminated.
Connection Established
This includes the main data transfer state (the focus of our tuning explanations in this chapter). The tuning parameters for congestion control, latency, and flow control will be described in more detail. FIGURE 3-8 shows two concurrent processes that read and write to the bidirectional full-duplex socket connection.
Connection Shutdown
This includes the set of substates that work together to shut down the connection in an orderly fashion. We will see important tuning parameters related to memory. Tunable parameters include:
s
until this time has expired. However, if this value is too short and there have been many routing changes, lingering packets in the network might be lost. tcp_fin_wait_2_flush_interval: how long this side will wait for the remote side to close its side of the connection and send a FIN packet to close the connection. There are cases where the remote side crashes and never sends
50
a FIN. So to free up resources, this value puts a limit on the time the remote side has to close the socket. This means that half-open sockets cannot remain open indefinitely.
Startup Phase
In Startup Phase tuning, we describe how the TCP sender starts to initially send data on a particular connection. One of the issues with a new connection is that there is no information about the capabilities of the network pipe. So we start by blindly injecting packets at a faster and faster rate until we understand the capabilities and adjust accordingly. Manual TCP tuning is required to change macro behavior, such as when we have very slow pipes as in wireless or very fast pipes such as 10 Gbit/sec. Sending an initial maximum burst has proven disastrous. It is better to slowly increase the rate at which traffic is injected based on how well the traffic is absorbed. This is similar to starting from a standstill on ice. If we initially floor the gas pedal, we will skid, and then it is hard to move at all. If, on the other hand, we start slowly and gradually increase speed, we can eventually reach a very fast speed. In networking, the key concept is that we do not want to fill buffers. We want to inject traffic as close as possible to the rate at which the network and target receiver can service the incoming traffic. During this phase, the congestion window is much smaller than the receive window. This means the sender controls the traffic injected into the receiver by computing the congestion window and capping the injected traffic amount by the size of the congestion window. Any minor bursts can be absorbed by queues. FIGURE 3-9 shows what happens during a typical TCP session starting from idle.
Chapter 3
51
tcp_cwnd_max
Timeout
tcp_slow_start_initial
tcp_slow_start__after_idle
Time
FIGURE 3-9
The sender does not know the capacity of the network, so it starts to slowly send more and more packets into the network trying to estimate the state of the network by measuring the arrival time of the ACK and computed RTT times. This results in a self-clocking effect. In FIGURE 3-9, we see the congestion window initially starts with a minimum size of the maximum segment size (MSS), as negotiated in the three-way handshake during the socket connection phase. The congestion window is doubled every time an ACK is returned within the timeout. The congestion window is capped by the TCP tunable variable tcp_cwnd_max, or until a timeout occurs. At that point, the ssthresh internal variable is set to half of tcp_cwnd_max. ssthresh is the point where upon a retransmit, the congestion window grows exponentially. After this point it grows additively, as shown in FIGURE 3-9. Once a timeout occurs, the packet is retransmitted and the cycle repeats.
FIGURE 3-9 shows that there are three important TCP tunable parameters:
s
tcp_slow_start_initial: sets up the initial congestion window just after the socket connection is established. tcp_slow_start_after_idle: initializes the congestion window after a period of inactivity. Since there is some knowledge now about the capabilities of the network, we can take a shortcut to grow the congestion window and not start from zero, which takes an unnecessarily conservative approach.
52
tcp_cwnd_max: places a cap on the running maximum congestion window. If the receive window grows, then tcp_cwnd_max grows to the receive window size.
In different types of networks, you can tune these values slightly to impact the rate at which you can ramp up. If you have a small network pipe, you want to reduce the packet flow, whereas if you have a large pipe, you can fill it up faster and inject packets more aggressively.
Propagation Delay This is primarily influenced by distance. This is the time it takes one packet to traverse the network. In WANs, tuning is required to keep the pipe as full as possible, increasing the allowable outstanding packets. Link Speed This is the bandwidth of the network pipe. Tuning guidelines for link speeds from 56kbit/sec dial-up connections differ from 10Gbit/sec optical local area networks (LANs).
In short, tuning will be adjusted according to the type of network and associated key properties: propagation delay, link speed, and error rate. These properties actually self-adjust in some instances by measuring the return of acknowledgments. We will look at various emerging network technologies: optical WAN, LAN, wireless, and so onand describe how to tune TCP accordingly.
Flow control is accomplished by the receiver sending back a window to the sender. The size of this window, called the receive window, tells the sender how much data to send. Often, when the client is saturated, it might not be able to send back a receive window to the sender to signal it to slow down transmission. However, the sliding windows protocol is designed to let the sender know, before reaching a meltdown, to start slowing down transmission by a steadily decreasing window size. At the same time these flow control windows are going back and forth, the speed at which ACKs come back from the receiver to the sender provides additional information to the sender that caps the amount of data to send to the client. This is computed indirectly. The amount of data that is to be sent to the remote peer on a specific connection is controlled by two concurrent mechanisms:
s
The congestion in the network - The degree of network congestion is inferred by the calculation of changes in Round Trip Time (RTT): that is the amount of delay attributed to the network. This is measured by computing how long it takes a packet to go from sender to receiver and back to the client. This figure is actually calculated using a running smoothing algorithm due to the large variances in time. The RTT value is an important value to determine the congestion window, which is used to control the amount of data sent out to the remote client. This provides information to the sender on how much traffic should be sent to this particular connection based on network congestion. Client load - The rate at which the client can receive and process incoming traffic. The client sends a receive window that provides information to the sender on how much traffic should be sent to this connection based on client load.
54
Sender
Receiver
Data
ACK
tcp_remit_interval_min,4000ms [1ms to 20 secs] tcp_remit_interval_min, 3s [1ms to 20 secs] tcp_remit_interval_min, 60s [1ms to 120 min] tcp_ip_abort_interval, 8min [500ms to 1193 hrs]
tcp_deferred_ack_max, 2 [1 to 16] -max received tcp segments (multiple of mss) received from non-direct connected endpoints tcp_local_dacks_max, 8 [0 to 16] -max received tcp segments (multiple of mss received before forcing out ACK)
it Dat
it Dat
it Dat
reset
FIGURE 3-10
There are two mechanisms that are used by senders and receivers to control performance:
s
Senderstimeouts waiting for ACK. This class of tunable parameters controls various aspects of how long to wait for the receiver to send back an ACK of the data that was sent. If tuned too short, then excessive retransmissions occur. If tuned too long, then excess wasted idle time elapses before the sender realizes the packet was lost and retransmits. Receiverstimeouts and number of bytes received before sending an ACK to sender. This class of tunable parameters allows the receiver to control the rate at which the sender sends data. The receiver does not want to send an ACK for every packet received because the sender will send many small packets, increasing the ratio of overhead to actual useful data ratio and reducing the efficiency of the transmission. However, if the receiver waits too long, there is excess latency that increases the burstiness of the communication. The receiver side can control ACKs with two overlapping mechanisms based on timers and the number of bytes received.
Chapter 3
55
56
Data1
Data2
Data3
ACK1 ACK2 ACK3
Time
Sender sends fewer data packets due to higher error rates, but there are wasted time slots until the first ACK returns. Host 1
Data1
Data2
ACK1
ACK2 Long Slow Pipe Few packets in Flight, Line delay incurs huge cost in packet transmission, hence selective ACK is a major improvement.
Host 2
Time
FIGURE 3-11
between a typical LAN of 10 mbps/100 meters with RTT of 71 microseconds, which is what TCP was originally designed for, and an optical WAN, which spans New York to San Francisco at the rate of 1 Gbps with RTT of 100 milliseconds. The bandwidth delay product represents the number of packets that is actually in the network and implies the amount of buffering the network must provide. This also
Chapter 3 Tuning TCP: Transport Layer 57
gives some insight into the minimum window size, which we discussed earlier. The fact that the optical WAN has a very large bandwidth delay product as compared to a normal network requires tuning as follows:
s
The window size must be much larger. The current window size allows for 216 bytes. To achieve larger windows, RFC 1323 was introduced to allow the window size to scale to larger sizes while maintaining backwards compatibility. This is achieved during the initial socket connection, where during the SYN-ACK threeway handshake, window scaling capabilities are exchanged by both sides, and they try to agree on the largest common capabilities. The scaling parameter is an exponent base 2. The maximum scaling factor is 14, hence allowing a maximum window size of 230 bytes. The window scale value is used to shift the window size field value up to a maximum of 1 gigabyte. Like the MSS option, the window scale option should only appear in SYN and SYN-ACK packets during the initial three-way handshake. Tunable parameters include:
s
tcp_wscale_always: controls who should ask for scaling. If set to zero, the remote side needs to request; otherwise, the receiver should request. tcp_tstamp_if_wscale: controls adding timestamps to the window scale. This parameter is defined in RFC 1323 and used to track the round-trip delivery time for data in order to detect variations in latency, which impact timeout values. Both ends of the connection must support this option.
During the slow start and retransmissions, the minimum initial window size, which can be as small as one MSS, is too conservative. The send window size grows exponentially, but starting at the minimum is too small for such a large pipe. Tuning in this case requires that the following tunable parameters be adjusted to increase the minimum start window size:
s
tcp_slow_start_initial: controls the starting window just after the connection is established. tcp_slow_after_idle: controls the starting window after a lengthy period of inactivity on the sender side.
Both of these parameters must be manually increased according to the actual WAN characteristics. Delayed ACKs on the receiver side should also be minimized because this will slow the increasing of the window size when the sender is trying to ramp up. RTT measurements require adjustment less frequently due to the long RTT times, hence interim additional RTT values should be computed. The tunable tcp_rtt_updates parameter is somewhat related. The TCP implementation knows when enough RTT values have been sampled, and then this value is cached. tcp_rtt_updates is on by default, but a value of 0 forces it to never be cached, which is the same as the case of not having enough for an accurate estimate of RTT for this particular connection.
58
tcp_recv_hiwat and tcp_xmit_hiwat: control the size of the STREAMS queues before STREAMS-based flow control is activated. With more packets in flight, the size of the queues must be increased to handle the larger number of outstanding packets in the system.
receiver
receiver
Chapter 3
59
ACK1
ACK2
ACK3 Host 2 Normal Data sent and ACK Received Timings Synchronized Receiver - ACKS sent back, Ack1, Ack2. . . Time
Sender sends fewer data packets due to higher error rates, but there are wasted time slots until the first ACK returns. Host 1
K1 AC K2
AC
FIGURE 3-13
Da ta
60
Da ta 2
Time
Long Slow Pipe More packets in Flight, Line delay incurs huge cost in packet transmission, hence selective ACK is a major improvement.
Host 2 Receiver - ACKs sent, but delayed at sender due to slow link
Comparison between Normal LAN and WAN Packet TrafficLong Low Bandwidth Pipe
Another problem introduced in these slow links is that the ACKs play a major role. If ACKs are not received by the sender in a timely manner, the growth of windows is impacted. During initial slow start, and even slow start after an idle, the send window needs to grow exponentially, adjusting to the link speed as quickly as
possible for coarser tuning. It then grows linearly after reaching ssthresh for finergrained tuning. However, if the ACK is lost, which has a higher probability in these types of links, then the performance throughput is again degraded. Tuning TCP for slow links includes the following parameters:
s
tcp_sack_permitted: activates and controls how SACK will be negotiated during the initial three-way handshake:
s s
0 = no sack disabled. 1 = TCP will not initiate a connection with SACK information, but if an incoming connection has the SACK-permitted option, TCP will respond with SACK information. 2 = TCP will both initiate and accept connections with SACK information.
TCP SACK is specified in RFC 2018 TCP selective acknowledgement. TCP need not retransmit the entire send buffer, only the missing bytes. Due to the higher cost of retransmission, it is far more efficient to only re-send the missing bytes to the receiver. Like optical WANs, satellite links also require the window scale option to increase the number of packets in flight to achieve higher overall throughput. However, satellite links are more susceptible to bit errors, so too large a window is not a good idea because one bad byte will force a retransmission of one enormous window. TCP SACK is particularly useful in satellite transmissions to avoid this problem because it allows the sender to select which packets to retransmit without requiring an entire window (which contained that one bad byte) for retransmission.
s
tcp_dupack_fast_retransmit: controls the number of duplicate ACKs received before triggering the fast recovery algorithm. Instead of waiting for lengthy timeouts, fast recovery allows the sender to retransmit certain packets, depending on the number of duplicate ACKs received by the sender from the receiver. Duplicate ACKs are an indication that possibly later packets have been received, but the packet immediately after the ACK might have been corrupted or lost.
Adjust all timeouts to compensate for long-delay satellite transmissions and possibly longer-distance WANs; the timeout values must be compensated.
Chapter 3
61
Interrupts generated to the CPU The CPU must be fast enough to service all incoming interrupts to prevent losing any packets. Multi-CPU machines can be used to scale. However, the PCI bus then introduces some limitations. It turns out that the real bottleneck is memory. Memory Speed An incoming packet must be written and read from the NIC to the operating system kernel address space to the user address. You can reduce the number of memory-to-memory copies to achieve zero copy TCP by using workarounds such as page flipping, direct data placement, and scatter-gather I/O. However, as we approach 10-gigabit Ethernet interfaces, memory speed continues to be a source of performance issues. The main problem is that over the last few years, memory densities have increased, but not speed. Dynamic random access memory (DRAM) is cheap but slow. Static random access memory (SRAM) is fast but expensive. New technologies such as reduced latency DRAM (RLDRAM) show promise, but these seem to be dwarfed by the increases in network speeds.
To address this concern, there have been some innovative approaches to increase the speed and reduce the network protocol processing latencies in the area of remote direct memory access (RDMA) and infiniband. New startup companies such as Topspin are developing high-speed server interconnect switches based on infiniband and network cards with drivers and libraries that support RDMA, Direct Access Programming Library (DAPL), and Sockets Direct Protocol (SDP). TCP was originally designed for systems where the networks were relatively slow as compared to the CPU processing power. As networks grew at a faster rate than CPUs, TCP processing became a bottleneck. RDMA fixes some of the latency.
62
Application Stream Head TCP IP NIC PCI Driver PCI Bus Network Interface Card Mac PHY
Network Traffic
User Memory
User Memory
Kernel Memory
Kernel Memory
Memory
Infiniband HCA
Network Traffic
Infiniband/RDMA Stack
TCP/IP Stack
FIGURE 3-14
FIGURE 3-14 shows the difference between the current network stack and the newgeneration stack. The main bottleneck in the traditional TCP stack is the number of memory copies. Memory access for DRAM is approximately 50 ns for setup and then 9 ns for each subsequent write or read cycle. This is orders of magnitude longer than the CPU processing cycle time, so we can neglect the TCP processing time. Saving one memory access on every 64 bits results in huge savings in message transfers. Infiniband is well suited for data center local networking architectures, as both sides must support the same RDMA technology.
Chapter 3
63
CHAPTER
Server Load Balancinga mechanism to distribute loads across a group of servers, which host identical applications, that logically behaves as one application Layer 7 Switchingpacket forwarding decisions based on packet payload
65
Network Address Translation (NAT)rewriting packet source and destination addresses and ports for the purpose of decoupling the external public interface from internal interfaces of servers in particular IP addresses and ports Quality of Service (QoS)providing differentiated services to packet flows Secure Socket Layers (SSL)encrypting traffic at the application layer for HTTPbased traffic
s s
This chapter first describes the internal architecture of a basic network switch and then describes more advanced features. It also provides a comprehensive discussion of server load balancing from a detailed conceptual perspective to actual practical switch configuration details. Because of the stateless nature of HTTP, server load balancing (SLB) has proven to be ideal for scaling the Web tier. However, there are many different flavors of SLB in terms of fundamental algorithm and deployment strategies that this chapter discusses and describes in detail. This chapter also answers a question that crops up over and over and is rarely answered: How do we know which is the best SLB algorithm, and what is the proof? The chapter then briefly describes Layer 7 switching and NAT and variants thereof. This is followed by a detailed look at QoS, showing where and how to use it and how it works. Finally, we look at SSL from a conceptual layer and describe configuring a commercially available SSL appliance.
was due to cut-through mode, which allowed the switch to immediately make a forwarding decision even before the entire packet was read into the memory of the switch. Traditional switches were of the store-and-forward type, which needed to read the entire packet before making a forwarding decision.
FIGURE 4-1
shows the internal architecture of a multi-layer switch, including a significant amount of integration of functions. Most of the important repetitive tasks are implemented in ASIC components, in contrast to the early routers described previously, which performed forwarding tasks in software on a general-purpose computer CPU card. Here the CPU mostly runs control plane and background tasks and does very little data forwarding. Modern network switches break down tasks into those that need to be completed quickly and those that do not need to be performed in real time into layers or Planes as follows:
Control Planethe set of functions that controls how incoming packets should be processed or how the data path is managed. This includes routing processes and protocols that populate the forwarding tables that contain routing (Layer 3) and switching (Layer 2) entries. This is also commonly referred to the slow path because timing is not as crucial as in the data path. Data Planethe set of functions that operates on the data packets, such as Route lookup and rewrite of destination MAC address. This is also commonly referred to as the fast path. Packets must be forwarded at wire speed, hence packet processing has a much higher priority than control processing and speed is of the essence. The following section describes various common components and features of a modern network switch.
Chapter 4
67
Memory
Addressing Tables
RISC Processor(s) 5 3 Trunking LACP Flow Control SNMP MGT SW 7 FLOW Data Structs CLI S/W Routing Protocols 11 SPT BPDU 4 12 Packet 13 Scheduler Switching Fabric
QoS Queues
Packet/Frame Buffers
0x00ff 0x0100 0x0101 0xffff Tx, RX Descriptors
10 9
FIB Lookup VLAN Lookup Packet Classification RX FIFO MAC PHY Transceiver TX FIFO 14
FIB Lookup VLAN Lookup Packet Classification RX FIFO MAC PHY Transceiver TX FIFO
FIGURE 4-1
The following numbered sections describe the main functional components of a typical network switch and correlate to the numbers in FIGURE 4-1. 1. PHY Transceiver
FIGURE 4-1 shows that as a packet enters a port, the physical layer (PHY) chip is in Receive Mode (Rx). The data stream will be in some serialized encoded format, where a 4-bit nibble is built and sent to the MAC to build a complete Ethernet frame. The PHY chip implements critical functions such as collision detection, which is needed only in half-duplex mode, link monitoring to detect tpe-link-test, and auto negotiation to synchronize with the sender.
68
2. Media Access Control The Media Access Control (MAC) ASIC takes the 4-bit nibble from PHY and constructs a complete Ethernet frame. The MAC chip inserts a Start Frame Delimiter and Preamble when in Transmit Mode (Tx) and strips off these bytes when in Rx mode. The MAC implements the 802.3u,z functions depending on the link speed. MAC implements functions such as collision backoff and flow control. The flow control feature prevents slower link queues from being overrun. This is an important feature. For example, when a 1 Gbit/sec link is transmitting to a slower 100 Mbit/sec link, a finite amount of buffer or queue memory is available. By sending PAUSE frames, the sender slows down, hence using fewer switch resources to accommodate the fast senders and slow receiver data transmissions. Once a frame is constructed, the MAC first checks if the destination MAC address is in the range of 01-80-C2-00-00-00 to 01-80-C2-00-00-0F. These are special reserved multicast addresses used for MAC functions such as link aggregation, spanning tree, or pause frames for flow control. 3. Flow Control (MAC Pause Frames) When a flow control frame is received, a timer module is invoked to wait until a certain time elapses before sending out the subsequent frame. For example, a flow control frame is sent out when the queues are being overrun, so the MAC is free to catch up and allow the switch to process the ingress frames that are queued up. 4. Spanning Tree When Bridge Protocol Data Units (BPDUs) are received by the MAC, the spanning tree process parses the BPDU, determines the advertised information, and compares it with stored state. This allows the process to compute the spanning tree and control which ports to block or unblock. 5. Trunking When a Link Aggregation Control Protocol (LACP) frame is received, a link aggregation sublayer parses the LACP frame, processes the information, and configures the collector and distributor functions. The LACP frame contains information about the peer trunk device such as aggregation capabilities and state information. This information is used to control the data packets across the trunked ports. The collector is an ingress module that aggregates frames across the ports of a trunk. The distributor spreads out the frames across the trunked ports on egress. 6. Receive FIFO If the MAC frame is not a control frame, then the MAC frame is stored in the Receive Queue or Rx FIFO (first in, first out), which are buffers that are referenced by Rx descriptors. These descriptors are simply pointers, so when moving packets around for processing, small 16-bit pointers are moved around instead of 1500-byte frames.
Chapter 4
69
7. Flow Structures The first thing that occurs after the Ethernet frame is completely constructed is that a flow structure is looked up. This flow structure will have a pointer to an address table that will be able to immediately identify the egress port so that the packet can be quickly stored, queued, and forwarded out the egress port. On the first packet of a flow, this flow data structure will not exist, so the lookup will return a failure. The CPU must be interrupted to create this flow structure and return to caller. This flow structure has enough information about where to store the packet in a region of memory used for storing entire packets. There are associated data structures called Tx or Rx descriptors; these are handles to the packet itself. As with FIFO descriptors, the reason for these data structures is speed. Instead of moving around large 1500byte packets for queuing up, only 32-bit pointers are moved around. 8. Packet Classification A switch has many flow-based rules for firewalls, NAT, VPN, and so on. The packet classification performs a quick lookup for all the rules that apply to this packet. There are many algorithms and implementations that basically inspect the IP header and try to find a match in the table that contains all the rules for this packet. 9. VLAN Lookup The VLAN module needs to identify the VLAN membership of this frame by looking at the VLAN ID (VID) in the tag. If the frame is untagged, then depending on whether the VLAN is port based or MAC address based, the set of output ports needs to be looked up. This is usually implemented by vendors in ASICs due to wirespeed timing requirements. 10. Forwarding Information Base (FIB) Lookup After a packet has passed through all the Layer 2 processing, the next step is to determine the egress ports that this packet must be forwarded to. The routing tables determine the next hop, which is populated in the control plane. There are two approaches to implementing this function:
s s
Centralized: One central database contains all the forwarding entries. Distributed: Each port has a local database for quick lookups.
The distributed implementation is much faster. It will be discussed further later in this chapter. 11. Routing Protocols All routing packets are sent to the appropriate routing process, such as RIP, OSPF, or BGP, and this process populates the routing tables. This process is performed in the control plane or slow path. The routing tables are used to populate the Forwarding
70
Information Base (FIB), which can be in a central memory area or downloaded to each ports local memory, providing faster data path performance in the FIB lookup phase. The next step occurs when the packet is ready to be scheduled for transmission by the packet scheduler by pulling out the descriptor out of the appropriate QoS queue. Finally, the packet is sent out the egress port. 12. Switch Fabric Module (SFM) Once the FIB lookup is completed, the packet scheduler must queue the packet onto the output queues. The output queues can be implemented as a set of multiple queues, each with a certain priority, to implement different classes of services. The SFM links the ingress processing to the egress processing. SFM can be implemented using Shared Memory or CrossPoint Architectures. In a shared memory approach, the packets can be written and read to a shared memory location. An arbitrator module controls access to the shared memory. In a CrossPoint Architecture, there is no storage of packets; instead, there is a connection from one to another. CrossPoint further requires that the packet be broken into fixed-sized cells. CrossPoint usually has very high bandwidth used only for backplanes. The bandwidths must be higher because of the extra overhead and padding required in the construction and destruction of fixed-sized cells. Both approaches suffer from Head of Line Blocking (HOL), but usually use some form of virtual output queue workaround to mitigate the effects. HOL occurs when a large packet holds up smaller packets farther down the queue when being scheduled. 13. Packet Scheduler The packet scheduler simply chooses packets that need to be moved from one set of queues to another based on some algorithm. The packet scheduler is usually implemented in an ASIC. Instead of moving entire frames, sometimes 1500 bytes, only 16-bit or 32-bit descriptors are moved. 14. Transmit FIFO The transmit queue or Tx FIFO is the final store before the frame is sent out the egress port. The same functions are performed as those described on the ingress (Rx FIFO) but in the opposite directions.
Chapter 4
71
services at the data center edge. Services such as SLB, Web caching, SSL accelerators, NAT, QoS, firewalls, and others are now common in every data center edge. These devices are either deployed adjacent to network switches or integrated as an added service inside the network switch. Often a multitude of vendors can potentially implement a particular set of functions. The following sections describe some the key IP services you can use in the process of crafting high-quality network designs.
72
/N
SLB
/N
In FIGURE 4-2 the incoming load = . It is spread out evenly across N servers, each having a service capacity rate = . How does the SLB device determine where to forward the client request? The answer depends on the algorithm. One of the challenges faced by network architects is choosing the right SLB algorithm from the plethora of SLB algorithms and techniques available. The following sections explore the more important SLB derivatives, as well as which technique is best for which problem.
Hash
The hash algorithm pulls certain key fields from the client incoming request packet, usually the source/destination IP address and TCP/UDP port numbers, and uses their values as an index to a table that maps to the target server and port. This is a highly efficient operation because the network processor can execute this instruction in very few clock cycles, only performing expensive read operations for the index table lookup. However, the network architect needs to be careful about the following pitfalls:
s
Megaproxy architectures, such as those used by some ISPs, remap the dial-in clients source IP addresses to that of the megaproxy, not the clients actual dynamically allocated IP address, which might not be routable. So be careful not to assume stickiness properties for the hash algorithm.
Chapter 4
73
Hashing bases its assumption of even load distribution on heuristics, which require careful monitoring. It is entirely possible that due to the mathematics, the hash values will skew the load distribution, resulting in worse performance than round-robin.
Round-Robin
Round-robin (RR)or weighted round-robin (WRR)is the most widely used SLB algorithm because it is simple to implement efficiently. The RR/WRR algorithm looks at the incoming packet and remaps the destination IP address/port combination to the target IP/port from a fixed table and moving pointer. The Least Connections algorithm requires at least one more process to continually monitor the requests sent or received to or from each server, hence estimating the queue occupancy. From that information, the incoming packet can determine the target IP/port. The major flaw with this algorithm is that the servers must be evenly loaded or the resulting architecture will be unstable, as requests can build up on one server and eventually overload it.
74
SLB
Data centers often have servers that all perform the same function but vary in processing speed. Even when the servers have identical hardware and software, the actual client requests may exercise different code paths on the servers, hence injecting different loads on each server. This results in an uneven distribution of load. The SQF algorithm determines where to spread out the incoming load by looking at the queue occupancies. If server i is more overloaded than the other servers, the Queue i of one server i begins to build up. The SQF algorithm automatically adjusts itself and stops forwarding requests to server i. Because the other SLB variations do not have this crucial property, SQF is the best SLB algorithm. Further analysis shows that SQF has another more important property: stability. Stability describes the long-term behavior of the system.
Chapter 4
75
/N*W1
SLB
/N*W2
Client requests are forwarded blindly to servers. Weights W1 ..WN determine proportion of incoming load
FIGURE 4-4
76
device forwards the initial client request to the least-occupied queue. There are N queues, each with a Poisson arrival process and an exponential service time. Hence we can model all the servers as N M/M/1 queues. To prove that this system is stable, we must show that under all admissible time and injected load conditions the queues will never grow without any bounds. There are two approaches we can take:
s
Model the state of the queues as a stochastic process, determine the Markov Chain, and then solve the long-term equilibrium distribution . Craft and utilize a Lyapunov Function L (t) which accurately models the growth of the queues, and then show that over the long termthat is, after the system has time to warm up and reach a steady state and a certain thresholdthe rate of change of queue size is negative and remains negative for large enough L (t). This is a common and proven technique found in many network analysis research papers. We will show that: dL/dt = some negative value, for all values of L (t) greater than some threshold. It turns out that the Expected Value of the single step drift is equivalent, but much easier to calculate, which is the technique that we will use.
M/M/1 1 1 1
M/M/1 SLB = i 2 2 2
N M/M/1
FIGURE 4-5
Chapter 4
77
We will perform this analysis by first obtaining the discrete time model of one particular queue and then generalizing the result to all the N queues, as shown in the system model. If we take the discrete model, the state of one of the queues can be modeled as shown in FIGURE 4-6.
Queue Occupancy
at time t+1 = Queue Occupancy at t + Number of Arrivals(t+1) - Number of Departures(Serviced)(t+1) Q(t+1) = Q (t) + A (t+1) - D (t+1) Because the state of the queue depends only on the previous state, this is easily modeled as a valid Markov Process, for which there are known, proven methods of analysis to find the steady-state distribution. However, since we have N queues, the actual mathematics is very complex. The Lyapunov function is an extremely powerful and accurate method to obtain the same results, and it is far simpler. See Appendix A for more information about the Lyapunov analysis.
78
dst ip
src ip 120.141.0.19
payload </hl>
192.191.3.89
1
src ip 192.191.3.89 src port 3201 dst ip 120.141.0.19 dst port 80 payload GET://www.abc.com/index.html
2
120.141.0.19
3
src ip 120.141.0.19 src port 33 dst ip 10.0.0.1 dst port 80 payload GET://www.abc.com/index.html
4
dst ip dst port 33 src ip 10.0.0.1 src port 80 <hl>..... payload </hl>
120.141.0.19
SLB, and finally back to client. The following numbered list correlates with the numbers in FIGURE 4-7. 1. The client submits an initial service request targeted to the virtual IP (VIP) address of 120.141.0.19 on port 80. This VIP address is configured as the IP address of the SLB appliance.
Chapter 4
79
2. The SLB receives this packet from the client and recognizes that this incoming packet must be forwarded to a server selected by the SLB algorithm. 3. The SLB algorithm identifies server 10.0.0.1 at port 80 to receive this client request and modifies the packet so that the server sends it to the SLB and not to the client. Hence, the source and port are also modified. 4. The server receives the client request. 5. Perceiving that the request has come from the SLB, the server returns the requested Web page back to the SLB device. 6. The SLB receives this packet from the server. Based on the state information, it knows that this packet must be sent back to client 192.191.3.89. 7. The SLB device rewrites the packet and sends it out the appropriate egress port. 8. Client receives receives response packet.
Increases security and flexibility by decoupling the client from the backend servers Increases switch manageability because servers can be added and removed dynamically without any modifications to the SLB device configuration after it is initially configured Increases server manageability because any IP address can be used
Limits throughput because the SLB must process packets on ingress as well as return traffic from server to client Increases client delays because each packet requires more processing
support loopback. Every server has a regular unique IP address and a loopback IP address, which is the same as the external VIP address of the SLB. When the SLB forwards a packet to a particular server, the server looks at the MAC address to determine whether this packet should be forwarded up to the IP stack. The IP stack recognizes that the destination IP address of this packet is not the same as the physical interface, but it is identical to the loopback IP address. Hence, the stack will forward the packet to the listening port.
1
src ip 192.191.3.89 src port 3201 dst ip 120.141.0.19 dst port 80 payload GET://www.abc.com/index.html
2
120.141.0.19 SLB Table SLB SLB State Info
3
src ip 120.141.0.19 src port 33 dst ip 10.0.0.1 dst port 80 payload GET://www.abc.com/index.html
4
dst ip loopback lo0:120.141.0.19 mac:0:8:3e:4:4c:84 real ip: 10.0.0.1 FIGURE 4-8 dst port 33 src ip 10.0.0.1 src port 80 <hl>..... payload </hl>
120.141.0.19
Chapter 4
81
FIGURE 4-8 shows the DSR packet flow process. The following numbered list correlates with the numbers in FIGURE 4-8.
1. The client submits an initial service request targeted to the VIP address of 120.141.0.19 port 80. This VIP address is configured as the IP address of the SLB appliance. 2. The SLB receives this packet from the client and forwards this incoming packet to a server selected by the SLB algorithm. 3. The SLB algorithm identifies server 10.0.0.1 port 80 to receive this client request and modifies the packet by only changing the destination MAC Address to 0:8:3e:4:4c:84 which is the MAC address of the real server.
Note Statement 3 implies that the SLB and the servers must be on the same Layer
2 VLAN. Hence, DSR is less secure than the proxy mode approach. 4. The server receives the client request and processes the incoming packet. 5. The server returns the incoming packet directly back to the client by swapping the destination/source IP and TCP address pair. 6. The destination IP address is the same as that configured on the loopback and is sent back directly to the client.
Increases security and flexibility by decoupling the client from the back-end servers. Increases switch manageability because servers can be added and removed dynamically without any modifications to the SLB device configuration after it is initially configured. Increases performance and scalability. The server load-balancing work is reduced by half because the return path is the same as the incoming path. Thus, more cycles are free to process more incoming traffic.
The SLB must be on same Layer 2 network as the server because they have the same IP network number, only differing by MAC address. All the servers must be configured with the same loopback address as the SLB VIP. This might be an issue for securing critical servers.
82
Server Monitoring
All SLB algorithms, except the family of fixed round-robin, require knowledge of the state of the servers. SLB implementations vary enormously from vendor to vendor. Some poor implementations simply monitor link state on the port to which the real server is attached. Some monitor using ping request on Layer 3. Port-based health checks are superior because the actual target application is verified for availability and response time. In some cases, the Layer 2 state might be fine, but the actual application has failed, and the SLB device mistakenly forwards requests to that failed real server. The features and capabilities of switches are changing rapidly, often in simple flash updates, and you must be aware of the limitations.
Persistence
Often when a client is initially load-balanced to a specific server, it is crucial that subsequent requests are forwarded to the same server within the pool. There are several approaches to accomplishing this:
s s
Allow the server to insert a cookie in the clients HTTP request. Configure the SLB to look for a cookie pattern and make a forwarding decision based on the cookie. The first request of the client will have no cookie, so the SLB will forward to the best server based on the algorithm. The server will install a cookie, which is a name-value pair. On the return of the packet, the SLB will read the cookie value and record client-server pair. Subsequent requests from the same client will have a cookie, which triggers the SLB to forward based on the recorded cookie information, not on the SLB algorithm. Hash, based on the clients source IP address. This is risky if the client request comes from a megaproxy.
It is best to avoid persistence because HTTP was designed to be stateless. Trying to maintain state across many stateless transactions causes serious issues if there are failures. In many cases, the application software can maintain state. For example, when a servlet receives a request, it can identify the client based on its own cookie value and retrieve state information from the database. However, switch persistence might be required. If so, you should look at the exact capabilities of each vendor and decide which features are most critical.
Chapter 4
83
Resonate provides a Solaris library offering, where a STREAMS Module/Driver is installed on a server that accepts all traffic, inspects the ingress packet, and forwards it to another server that actually services the request. As the cost of hardware devices falls and performance increases, the Resonate product is less popular. Various companies such as Cisco, F5, and Foundry ServerIron sell hardware appliances that perform only server load balancing. One important factor to examine carefully is the method used to implement the server load-balancing function. The F5 is limited because it is a PC Intel box, running BSD UNIX, with two or more network interface cards.
Wirespeed performance can be limited because these general purpose computerbased appliances are not optimized for packet forwarding. When a packet arrives at a NIC, an interrupt must first be generated and serviced by the CPU. Then the PCI bus arbitration process will grant access to traverse the bus. Finally, the packet is copied into memory. These events cumulatively contribute to significant delays. In some newer implementations, wirespeed SLB forwarding can be achieved. Data Plane Layer 2/Layer 3 forwarding tables are integrated with the server loadbalancing updates. Hence as soon as a packet is received, a packet classifier immediately performs an SLB lookup in the data plane with hardware using tables populated and maintained by the SLB process that resides in the control plane, which also monitors the health of the servers.
84
configured in DSR mode, where the SLB forwards to the servers, which then return directly to the client. Notice that the servers are on the same VLAN as this SLB device on the internal LAN side of the 20.0.0.0 network.
CODE EXAMPLE 4-1
! ver 07.3.05T12 global-protocol-vlan ! ! server source-ip 20.20.0.50 255.255.255.0 172.0.0.10 ! !! !! ! server real s1 20.20.0.1 port http port http url "HEAD /" ! server real s2 20.20.0.2 port http port http url "HEAD /" ! ! server virtual vip1 172.0.0.11 port http port http dsr bind http s1 http s2 http ! vlan 1 name DEFAULT-VLAN by port no spanning-tree ! hostname SLB0 ip address 172.0.0.111 255.255.255.0 ip default-gateway 172.0.0.10 web-management allow-no-password banner motd ^C Reference Architecture -- Enterprise Engineering^C Server Load Balancer-- SLB0 129.146.138.12/24^C !!
Chapter 4
85
# # MSM64 Configuration generated Thu Dec 6 21:27:26 2001 # Software Version 6.1.9 (Build 11) By Release_Master on 08/30/01 11:34:27 .. # Config information for VLAN app. config vlan "app" tag 40 # VLAN-ID=0x28 Global Tag 8 config vlan "app" protocol "ANY" config vlan "app" qosprofile "QP1" config vlan "app" ipaddress 10.40.0.1 255.255.255.0 configure vlan "app" add port 4:1 untagged .. # # Config information for VLAN dns. .. configure vlan "dns" add port 5:3 untagged configure vlan "dns" add port 5:4 untagged configure vlan "dns" add port 5:5 untagged .. .. configure vlan "dns" add port 8:8 untagged config vlan "dns" add port 6:1 tagged # # Config information for VLAN super. config vlan "super" tag 1111 # VLAN-ID=0x457 config vlan "super" protocol "ANY" config vlan "super" qosprofile "QP1" # No IP address is configured for VLAN super. config vlan "super" add port 1:1 tagged config vlan "super" add port 1:2 tagged config vlan "super" add port 1:3 tagged config vlan "super" add port 1:4 tagged config vlan "super" add port 1:5 tagged config vlan "super" add port 1:6 tagged config vlan "super" add port 1:7 tagged
Global Tag 10
86
enable web access-profile none port 80 configure snmp access-profile readonly None configure snmp access-profile readwrite None enable snmp access disable snmp dot1dTpFdbTable enable snmp trap configure snmp community readwrite encrypted "r~`|kug" configure snmp community readonly encrypted "rykfcb" configure snmp sysName "MLS1" configure snmp sysLocation "" configure snmp sysContact "Deepak Kakadia, Enterprise Engineering" .. # ESRP config config config config .. .. # SLB Configuration enable slb config slb global ping-check frequency 1 timeout 2 config vlan "dns" slb-type server config vlan "app" slb-type server config vlan "db" slb-type server config vlan "ds" slb-type server config vlan "web" slb-type server config vlan "edge" slb-type client create slb pool webpool lb-method round-robin config slb pool webpool add 10.10.0.10 : 0 config slb pool webpool add 10.10.0.11 : 0 create slb pool dspool lb-method least-connection Interface Configuration vlan "edge" esrp priority 0 vlan "edge" esrp group 0 vlan "edge" esrp timer 2 vlan "edge" esrp esrp-election ports-track-priority-mac
Chapter 4
87
# config config create config config create config config create config config create 0 unit create unit 1 create unit 1 create 0 unit create 0 unit .. ..
slb slb slb slb slb slb slb slb slb slb slb slb 1 slb
pool dspool add 10.20.0.20 : 0 pool dspool add 10.20.0.21 : 0 pool dbpool lb-method least-connection pool dbpool add 10.30.0.30 : 0 pool dbpool add 10.30.0.31 : 0 pool apppool lb-method least-connection pool apppool add 10.40.0.40 : 0 pool apppool add 10.40.0.41 : 0 pool dnspool lb-method least-connection pool dnspool add 10.50.0.50 : 0 pool dnspool add 10.50.0.51 : 0 vip webvip pool webpool mode translation 10.10.0.200 : vip dsvip pool dspool mode translation 10.20.0.200 : 0
slb vip dbvip pool dbpool mode translation 10.30.0.200 : 0 slb vip appvip pool apppool mode translation 10.40.0.200 : 1 slb vip dnsvip pool dnspool mode translation 10.50.0.200 : 1
Layer 7 Switching
The recent explosive demand for application hosting and increased security fueled the demand for a new concept called content switching, also known as Layer 7 switching, proxy switching, or URL switching. This switching technology basically inspects the payload, which is expected to be some HTTP request, such as a static or dynamic Web page. The content switch searches for a certain string, and if there is a match, it takes some type of action. For example, the content switch might rewrite the content or redirect it to a pool of servers that specializes in these services or to a caching server for increased performance. The main idea is that a forwarding decision is made based on the application data, not traditional Layer 2 or Layer 3 destination network addresses. Some major technical challenges arise in performing this type of processing. The first is a tremendous performance impact. In traditional Layer 2 and Layer 3 processing, the destination addresses and corresponding egress port are found by looking at a fixed offset in the packet. This allows for extremely cheap and fast ASICs. Usually, the packet header is read in from the MAC and copied into SRAM, which has an
88
access time of around five nanoseconds. The variable size and bulky payload are usually copied into DRAM, which has a higher initial setup time. The forwarding decision requires two SRAM memory accesses, where the header is read, modified, written, and a quick lookup is performedusually a Telecommunications Access Method (TCAM) or Patricia Tree lookup in SRAM, which takes a few nanoseconds. However, for Layer 7 forwarding decisions, almost all commercial switches, except the Extreme Px1, must perform this function in much slower CPU, running a realtime operating system, such as VxWorks. The payload, which resides in DRAM, must be read, processed, and written. This string search is also time intensive. (There have been recent advances in Layer 7 technology such as that offered by Solidum and PMC-Sierras ClassiPI, which perform this at wirespeed rates. However, at the time of this writing, we are not aware of any major switch manufacturer using this technology.) This operation takes orders of magnitude more time. NAT can be extended not only to hide internal private IP addresses but also to base packet forwarding decisions on the payload. There are two approaches to accomplish this function:
s
Application Gateway This approach terminates the socket connection on the client side and creates another connection on the server side, providing complete isolation between the client and the server. This requires more processing time and resources on the switch. However, it allows the switch to make a comprehensive application-layer forwarding decision. TCP Splicing This approach simply rewrites the TCP/IP packet headers, thereby reducing the amount of processing required on the switch. This makes it more difficult for the switch to make application-layer forwarding decisions if the complete payload spans many small TCP packets.
This section describes an application gateway approach to NAT and performing Layer 7 processing.
FIGURE 4-9 shows an overview of the functional content switching model.
Chapter 4
89
servergroup 1 stata
servergroup 2 dnsa
servergroup 3 statb
servergroup 4 cacheb
-http:www.a.com/SMA/stata/index.html servergroup1 -http:www.a.com/SMA/dnsa/index.html servergroup2 -http:www.a.com/SMA/statb/index.html servergroup3 -http:www.a.com/SMA/CHACHEB/index.html servergroup4 -http:www.a.com/SMA/DYNA/index.html servergroup5
servergroup 5 dynab
FIGURE 4-9
Content switching with full network address translation (NAT) serves the following purposes:
s s
Isolates internal IP addresses from being exposed to the public Internet. Allows reuse of a single IP address. For example, clients can send their Web requests to www.a.com or www.b.com, where DNS maps both domains to a single IP address. The proxy switch receives this request with the packet containing an HTTP header in the payload that contains the target domain, for example a.com or b.com, and determines to which group of servers to redirect this request.
90
Allows parallel fetching of different parts of Web pages from servers optimized and tuned for that type of data. For example, a complex Web page might need GIFs, dynamic content, cached content, and so on. With content switching, one set of Web servers can hold the GIFs, while another can hold the dynamic content or cached content. The proxy switch can make parallel fetches and retrieve the entire page at a faster rate than would be possible otherwise. Ensures requests with cookies or SSL session IDs are redirected to the same server to take advantage of persistence.
FIGURE 4-9 shows that the clients socket connection is terminated by the proxy function. The proxy retrieves as much of the URL as is needed to make a decision based on the retrieved URL. In FIGURE 4-9, various URLs map to various server groups, which are VIP addresses. The proxy determines whether to forward the URL directly or pass it off to a server load-balancing function that is waiting for traffic destined to the server group.
The proxy is configured with a VIP address, so the switch forwards all client requests destined to this VIP address to the proxy function. The proxy function also rewrites the IP header, particularly the source IP and port, so that the server sends back the requested data to the proxy, not to the client directly.
SecurityPrevents exposing internal private IP addresses to the public. IP Address ConservationRequires only one valid exposed IP address to fetch Internet traffic from internal networks with invalid IP addresses. RedirectionIntercepts traffic destined to one set of servers and redirects it to another by rewriting the destination IP and MAC addresses. The redirected servers can send back the request directly to the clients with half NAT-translated traffic because the original source IP has not been rewritten.
NAT is configured with a set of filters, usually a 5-tuple Layer 3 rule. If the incoming traffic matches a certain filter rule, the packet IP header is rewritten or another socket connection is initiated to the target server, which itself can be changed, depending on the particular rule. NAT is often combined with other IP services such as SLB and content switching. The basic idea is that the client and servers are
Chapter 4
91
completely decoupled from each other, and the NAT device manages the IP address conversions, while the partner service is responsible for another decision such as determining which server will handle the request based on load or other rules.
Quality of Service
As a result of emerging real-time and mission-critical applications, enterprise customers realize that the traditional Best Effort IP network service model is unsuitable. The main concern is that poorly behaved flows adversely affect other flows that share the same resources. It is difficult to tune resources to meet the requirements of all deployed applications. Quality of Service (QoS) measures the ability of network and computing systems to provide different levels of services to selected applications and associated network flows. Customers that deploy mission-critical applications and real-time applications have an economic incentive to invest in QoS capabilities so that acceptable response times are guaranteed within certain tolerances.
Classes of Applications
There are five classes of applications, having different network and computing requirements. They are:
92
s s s s s
Data transfers Video and voice streaming Interactive video and voice Mission-critical Web-based
These classes are important in classifying, prioritizing, and implementing QoS. The following sections detail these five classes.
Data Transfers
Data transfers include applications such as FTP, email, and database backup. Data transfers tend to have zero tolerances for packet loss and high tolerances for delay and jitter. Typical acceptable response times range from a few seconds for FTP transfers to hours for email. Bandwidth requirements in the order of Kbyte/sec are acceptable, depending on the file size, which keeps response times to a few seconds. Depending on the characteristics of the application (for example, the size of a file), disk I/O transfer times can contribute cumulatively to delays along with network bottlenecks.
Chapter 4
93
requirements range from 250 to 500 milliseconds. This response time is compounded by the bandwidth requirements, with each stream requiring a few Mbit/sec. In a conference of five participants, each participant pumps out a voice and video stream while at the same time receiving streams from the other participants.
Mission-Critical Applications
Mission-critical applications vary in bandwidth requirements, but they tend to have zero tolerance for packet loss. Depending on the application, bandwidth requirements are about one Kbyte/sec. Response times range from 500 ms to a few seconds. Server resource requirements (CPU, disk, and memory) vary, depending on the application.
Web-Based Applications
Web-based applications tend to have low bandwidth requirements (unless large image files are associated with the requested Web page) and grow in CPU and disk requirements, due to dynamically generated Web pages and Web transaction-based applications. Response time requirements range from 500 milliseconds to one second. Different classes of applications have different network and computing requirements. The challenge is to align the network and computing services to the applications service requirements from a performance perspective.
Overprovisioning allows overallocation of resources to meet or exceed peak load requirements. Depending on the deployment, overprovisioning can be viable if it is a simple matter of just upgrading to faster lLAN switches and NICs or adding memory, CPUs, or disks. However, overprovisioning might not be viable in certain cases, for example when dealing with relatively expensive long-haul WAN links, resources that on average are underutilized, or sources that are busy only during short peak periods. Managing and controlling allows allocation of network and computing resources. Better management of existing resources attempts to optimize utilization of existing resources such as limited bandwidth, CPU cycles, and network switch buffer memory.
94 Networking Concepts and Technology: A Designers Resource
QoS Components
To give you enough background on the fundamentals and an implementation perspective, this section describes the overall network and systems architecture and identifies the sources of delays. It also explains why QoS is essentially about controlling network and system resources in order to achieve more predictable delays for preferred applications.
Implementation Functions
Three necessary implementation functions are:
s
Traffic Rate Limiting and Traffic Shaping Token Leaky Bucket Algorithm. Network traffic is always bursty. The level of burstiness is controlled by the time resolution of the measurements. Rate limiting controls the burstiness of the traffic coming into a switch or server. Shaping refers to the smoothing of the egress traffic. Although these two functions are opposite, the same class of algorithms is used to implement both. Packet Classification Individual flows must be identified and classified at line rate. Fast packet classification algorithms are crucial, as every packet must be inspected and matched against a set of rules that determine the class of service the specific packet should receive. The packet classification algorithm has serious scalability issues; as the number of rules increases, it takes longer to classify a packet. Packet Scheduling To provide differentiated services, the packet scheduler must decide quickly which packet to schedule and when. The simplest packet scheduling algorithm is strict priority. However, this often does not work because low-priority packets are starved and might never get scheduled.
QoS Metrics
QoS is defined by a multitude of metrics. The simplest is bandwidth, which can be conceptually viewed as a logical pipe of a larger pipe. However, actual network traffic is bursty, so a fixed bandwidth would be wasteful because at one instant in time one flow might use 1 percent of this pipe while another might need 110 percent of the allocated pipe. To reduce waste, certain burst metrics are used to determine how much of a burst and how long a burst can be tolerated. Other important metrics that directly impact the quality of service include packet loss rate, delay, and jitter (variation in delay). The network and computing components that control these metrics are described later in this chapter.
Chapter 4
95
96
Remote access users DSL Mobile wireless IP Ethernet PDA DSL modem IP PPP AALS/ATM SONET IP Ethernet CDMA/ GPRS/ UMTS A 1
Dial up
Voice circuit
Switch delay=(queueing, scheduling, packet classification lookup time, route lookup time, congestion, backplane Cable headend ATM Metro area network C ATM
ATM
Tier 2 ISP/access networks Leased line T1 Switch Access switch F E MAE PSTN T1 ATM Metro area network
ATM
ATM Tier 2 ISP/access networks Link delay=(Propagation, LineRate) Server delay= (CPU, memory, disk ) PSTN
CSU/DSU Firewall 3 4
Enterprise network
Chapter 4
97
Path A-H is a typical scenario, where the client and server are connected to different local ISPs and must traverse different ISP networks. Multiple Tier 1 ISPs can be traversed, connected together by peering points such as MAE-East or private peering points such as Sprints NAP. Path 1-4 shows an example of the client and server connected to the same local Tier 2 ISP, when both client and server are physically located in the same geographical area. In either case, the majority of the delays are attributed to the switches. In the Tier 2 ISPs, the links from the end-user customers to the Tier 2 ISP tend to be slow links, but the Tier 2 ISP aggregates many links, hoping that not all subscribers will use the links at the same time. If they do, packets get buffered up and eventually are dropped.
Implementing QoS
You can implement QoS in many different ways. Each domain has control over its resources and can implement QoS on its portion of the end-to-end path using different technologies. Two domains of implementation are enterprises and network service providers.
s
Enterprise Enterprises can control their own networks and systems. From a local Ethernet or token ring LAN perspective, IEEE 801.p can be used to mark frames according to priorities. These marks allow the switch to offer preferential treatment to certain flows across VLANS. For computing devices, there are facilities that allow processes to run at higher priorities, thus obtaining differentiated services from a process computing perspective. Network Service Provider (NSP) The NSP aggregates traffic and forwards either within its own network or hands off to another NSP. The NSP can use technologies such as DiffServ or IntServ to prioritize the handling traffic within its networks. Service Level Agreements (SLAs) are required between NSPs to obtain a certain level of QoS for transit traffic.
Constant Bit Rate (CBR) Provides a constant bandwidth, delay, and jitter throughout the life of the ATM connection. Variable Bit Rate-Real Time (VBR-rt) Provides constant delay and jitter, but variations in bandwidth.
98
Variable Bit Rate-Non Real Time (VBR-nrt) Provides variable bandwidth, delay, and jitter, but has a low cell loss rate. Unspecified Bit Rate (UBR) Provides Best Effort service but no guarantees. Available Bit Rate (ABR) Provides no guarantees and expects the applications to adapt according to network availability. Guaranteed Frame Rate (GFR) Provides some minimum frame rate, delivers entire frame or none, and is used for ATM Adaptation Layer 5 (AAL5).
s s
One of the main difficulties in providing an end-to-end QoS solution is that so many private networks must be traversed, and each network has its own QoS implementations and business objectives. The Internet is constructed so that networks interconnect or peer with other networks. One network might need to forward traffic of other networks. Depending on the arrangements, competitors might not forward the traffic in the most optimal manner. This is what is meant by business objectives.
Chapter 4
99
Tier 2 ISP/ access networks Dial up 56 Kbps POTS OC3 Access switch ATM T1OC3
Enterprise network
ATM
Access switch
Ethernet
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Time delays
Pa
cke
1 (Not to scale)
FIGURE 4-11
10
11
12
13
14
15
16
Propagation delay that depends on the media and distance Line rate that primarily depends on the link rate and loss rate or Bit Error Rate (BER) Node transit delay that is the time it takes a packet to traverse an intermediate network switch or router
The odd-numbered links of FIGURE 4-11 represent the link delays. Note that segment and link are used interchangeably.
s
Link 1, in a typical deployment, is the copper wire, or the last mile connection from the home or Small Office/Home Office (SOHO) to the Regional Bell Operating Company (RBOC). This is how a large portion of consumer clients connect to the Internet.
100
Link 3 is an ATM link inside the carriers internal network, usually a Metropolitan Area Network link. Link 5 connects the Tier 2 ISP to the Tier 1 ISP. This provides a Backbone Network. This link is a larger pipe, which can range from T1 to OC-3 while growing.
Link 7 is the Core Network of the backbone Tier 1 provider. Typically, this core is extremely fast, consisting of DS3 links (the same ones used by IDT) or more modern links (like those used by VBNS of OC-48) and links that are beta testing OC-192 links while running Packet over SONET and eliminating the inefficiencies of ATM altogether.
s s
Links 9 and 11 are a reflection of links 5 and 3. Link 13 is a typical leased line, T1 link to the enterprise. This is how most enterprises connect to the Internet. However, after the 1996 Telecommunications Act, competitive local exchange carriers (CLECs) emerged. CLECs provide superior service offerings at lower prices. Providers such as Qwest and Telseon provide gigabit Ethernet connectivity at prices that are often below OC-3 costs. Link 15 is the enterprises internal network. There should be a channel service time division multiplexing (TDM) and data service device (data side) that terminates the T1 line and converts it to Ethernet.
The even-numbered links of FIGURE 4-11 represent the delays experienced in switches. These delays are composed of switching delays, route lookups, packet classification, queueing, packet scheduling, and internal switch forwarding delays, such as sending a packet from the ingress unit through the backplane to the egress unit. As FIGURE 4-11 illustrates, QoS is needed to control access to shared resources during episodes of congestion. The shared resources are servers and specific links. For example, Link 1 is a dedicated point-to-point link, where a dedicated voice channel is set up at call time with a fixed bandwidth and delay. Link 13 is a permanent circuit as opposed to a switched dedicated circuit. However, this is a digital line. QoS is usually implemented in front of a congestion point. QoS restricts the traffic that is injected into the congestion point. Enterprises have QoS functions that restrict the traffic being injected into their service provider. The ISP has QoS functions that restrict the traffic injected into their core. Tier 2 ISPs oversubscribe their bandwidth capacities, hoping that not all their customers will need bandwidth at the same time. During episodes of congestion, switches buffer packets until they can be transmitted. Links 5 and 9 are boundary links that connect two untrusted parties. The Tier 2 ISP must control the traffic injected into the network that must be handled by the Tier 1 ISPs core network. Tier 1 polices the traffic that customers inject into the network at Links 5 and 9. At the enterprise, many clients need to access the servers.
Chapter 4
101
QoS-Capable Devices
This section describes the internals of QoS-capable devices. One of the difficulties of describing QoS implementations is the number of different perspectives that can be used to describe all the features. The scope of this section is limited to the prioritybased model and the related functional components to implement this model. The priority-based model is the most common implementation approach because of its scalability advantage.
Implementation Approaches
There are two completely different approaches to implementing a QoS-capable IP switch or server: The Reservation Model, also known as Integrated Services/RSVP or ATM, is the original approach, requiring applications to signal their traffic handling requirements. After signaling, each switch that is in the path from source to destination reserves resources, such as bandwidth and buffer space, that either guarantee the desired QoS service or ensure that the desired service is provided. This model is not widely deployed because of scalability limitations. Each switch has to keep track of all this information for each flow. As the number of flows increases, the amount of memory and processing increases, hence limiting scalability. The Precedence Priority Model, also known as Differentiated Services, IP Precedence TOS, or IEEE 802.1pQ, takes aggregated traffic, segregates the traffic flows into classes, and provides preferential treatment of classes. It is only during episodes of congestion that noticeable differentiated services effects are realized. Packets are marked or tagged according to priority. Switches then read these markings and treat the packets according to their priority. The interpretation of the markings must be consistent within the autonomous domain.
102
Admission Control accepts or rejects access to a shared resource. This is a key component for Integrated Services and ATM networks. Admission control ensures that resources are not oversubscribed. Due to this, admission control is more expensive and less scalable than other components. Congestion Management prioritizes and queues traffic access to a shared resource during congestion periods. Congestion Avoidance prevents congestion early, using preventive measures. Algorithms such as Weighted Random Early Detection (WRED) exploit TCPs congestion avoidance algorithms to reduce traffic injected into the network, preventing congestion. Traffic Shaping reduces the burstiness of egress network traffic by smoothing the traffic and then forwarding it out to the egress link. Traffic Rate Limiting controls the ingress traffic by dropping packets that exceed burst thresholds, thereby reducing device resource consumption such as buffer memory. Packet Scheduling schedules packets out the egress port so that differentiated services are effectively achieved.
The next section describes the modules that implement these high-level functions in more detail.
QoS Profile
The QoS Profile contains information put in by the network or systems administrator on the definition of classes of traffic flows and how these flows should be treated in terms of QoS. For example, a QoS profile might have a definition that Web traffic from the CEO should be given EF DiffServ Marking, Committed Information Rate (CIR) 1 Mbit/sec, Peak Information Rate (PIR) 5 Mbit/sec, Excess Burst Size (EBS) 100 Kbyte, and Committed Burst Size (CBS) 50 Kbyte. This profile defines the flow and level of QoS the Web traffic from the CEO should receive. This profile is compared against the actual measured traffic flow. Depending on how the actual traffic flow compares against this profile, the type of service (TOS) field of the IP header is re-marked or an internal tag is attached to the packet header, which controls how the packet is handled inside this device.
FIGURE 4-12 shows the main functional components involved in delivering prioritized
differentiated services that apply to a switch or a server. These include the packet classification engine, the metering, the marker function, policing/shaping, I/P forwarding module, queuing, congestion control management, and the packet scheduling function.
Chapter 4
103
Flo
Flo
Me Packet classification Me
ter
Ma
rke
rke
Po l sh icer/ ap er
Po l sh icer/ ap er IP forwarding
FIGURE 4-12
104
at
pl
an
C m on an tr a o pl ge l an an m d e en t
t r ke le ac u P hed sc
implemented in the network protocol stack, either in the IP module, adjacent to the IP module, or possibly on the network interface card, offering superior performance due to the ASIC/FPGA implementation. There are two planes:
s
The Data Plane operates the functional components that actually read and write the IP header. The Control Plane operates the functional components that control how the functional units read information from the Network Administrator, directly or indirectly.
Packet Classifier
The Packet Classifier is a functional component responsible for identifying a flow and matching it with a filter. The filter is composed of source and destination, IP address, port, protocol, and the type of service fieldall in the IP Header. The filter is also associated with information that describes the treatment of this packet. Aggregate ingress traffic flows are compared against these filters. Once a packet header is matched with a filter, the QoS profile is used by the meter, marker, policing, and shaping functions.
Metering
The metering function compares the actual traffic flow against the QoS profile definition. FIGURE 4-13 illustrates the different measurement points. On average, the input traffic arrives at 100 Kbyte/sec. However, for a short period of time, the switch or server allows the input flow rate to reach 200 Kbyte/sec for one second, which computes to a buffer of 200 Kbyte. For the time period of t=3 to t=5, the buffer drains at a rate of 50 Kbyte/sec as long as the input packets arrive at 50 Kbyte/sec, keeping the output constant. Another more aggressive burst arrives at the rate of 400 Kbyte/sec for 5.5 sec, filling up the 200 Kbyte buffer. From t=5.0 to 5.5, however, 50 Kbyte are drained, leaving 150 Kbyte at t=5.5 sec. This buffer drains for 1.5 sec at a rate of 100 Kbyte/sec. This example is simplified, so the real figures need to be adjusted to account for the fact that the buffer is not completely filled at t=5.5 sec because of the concurrent draining. Notice that the area under the graph, or the integral, represents the approximate number of bytes in the buffer, and bursts represent the high sloped lines above the dotted line, representing the average rate or the CIR.
Chapter 4
105
Kbyte
400 300 CBS= 200 Kbyte 200 100 0 0 1 2 3 4 Time 5 6 7 8 9 CIR= 100 Kbyte/sec.
FIGURE 4-13
Marking
Marking is tied in with metering so that when the metering function compares the actual measured traffic against the agreed QoS profile the traffic is handled appropriately. The measured traffic measures the actual burst rate and amount of packets in the buffer against the CIR, PIR, CBS, and EBS. The Two Rate Three Color (TrTCM) algorithm is a common algorithm that marks the packets green if the actual traffic is within the agreed-upon CIR. If the actual traffic is between CIR and PIR, the packets are marked yellow. Finally, if the actual metered traffic is at PIR or above, the packets are marked red. The device then uses these markings on the packet in the policing and shaping functions to determine how the packets are treated (for example, whether the packets should be dropped, shaped, or queued in a lower priority queue).
106
IP Forwarding Module
The IP forwarding module inspects the destination IP address and determines the next hop using the forwarding information base. The forwarding information base is a set of tables populated by routing protocols and/or static routes. The packet is then forwarded internally to the egress board, which places the packet in the appropriate queue.
Queuing
Queuing encompasses two dimensions, or functions. The first function is congestion control that controls the number of packets queued up in a particular queue (see the following section). The second function is differential services. Differential services queues are serviced by the packet scheduler in a certain manner (providing preferential treatment to preselected flows) by servicing packets in certain queues more often than others.
Congestion Control
There is a finite amount of buffer space or memory, so the number of packets that can be buffered within a queue must be controlled. The switch or server forwards packets at line rate. However, when a burst occurs, or if the switch is oversubscribed and congestion occurs, packets are buffered. There are several packet discard algorithms. The simplest is Tail Drop: Once the queue fills up, any new packets are dropped. This works well for UDP packets, but causes severe disadvantages for TCP traffic. Tail Drop causes TCP traffic in already-established flows to quickly go into congestion avoidance mode, and it exponentially drops the rate at which packets are sent. This problem is called global synchronization. It occurs when all TCP traffic simultaneously increases and decreases flow rates. What is needed is to have some of the flows slow down so that the other flows can take advantage of the freed-up buffer space. Random Early Detection (RED) is an active queue management algorithm that drops packets before buffers fill up and randomly reduces global synchronization.
Chapter 4
107
FIGURE 4-14 describes the RED algorithm. Looking at line C on the far right, when the
average queue occupancy goes from empty up to 75 percent full, no packets are dropped. However, as the queue grows past 75 percent, the probability that random packets are discarded quickly increases until the queue is full, where the probability reaches certainty. Weighted Random Early Detection (WRED) takes RED one step further by giving some of the packets different thresholds at which packet probabilities of discard start. As illustrated in FIGURE 4-14, Line A starts to get random packets dropped at only 25 percent average queue occupancy, making room for higher-priority flows B and C.
1.0
.25
108
Packet Scheduler
The packet scheduler is one of the most important QoS functional components. The packet scheduler pulls packets from the queues and sends them out the egress port or forwards them to the adjacent STREAMS module, depending on implementation. There are several packet scheduling algorithms that service the queues in a different manner. Weighted Round-Robin (WRR) scans each queue, and depending on the weight assigned a certain queue, allows a certain number of packets to be pulled from the queue and sent out. The weights represent a certain percentage of the bandwidth. In actual practice, unpredictable delays are still experienced because a large packet at the front of the queue can hold up smaller-sized packets behind it. Weight Fair Queuing (WFQ) is a more sophisticated packet scheduling algorithm that computes the time the packet arrives and the time to actually send out the entire packet. WFQ is then able to handle varying-sized packets and optimally select packets for scheduling. WFQ conserves work, meaning that no packets wait idly when the scheduler is free. WFQ can also put a bound on the delay, as long as the input flows are policed and the lengths of the queues are bound. In Class-Based Queuing (CBQ), used in many commercial products, each queue is associated with a class, where higher classes are assigned a higher weight translating to relatively more service time from the scheduler than the lower-priority queues. Competitive product offerings by Packeteer and Allot offer hardware solutions that sit between the clients and servers. These products offer pure QoS solutions, but they use the term policy as a specific QoS rule. These products are limited in their flexibility and integration with policy servers.
Chapter 4
109
Downstream The record layer receives clear messages from the handshake layer. The record layer encapsulates, encrypts, fragments, and compresses the messages using the Message Authentication Code (MAC) operations before sending the messages downstream to the TCP Layer. Upstream The record layer receives TCP packets from the TCP layer and uncompresses, reassembles, decrypts, runs a MAC verification, and decapsulates the packets before sending them to higher layers.
Handshake Layer This layer exchanges messages between client and server in order to exchange public keys, negotiate and advertise capabilities, and agree on: s SSL version s Cryptographic algorithm s Cipher suite The cipher suite contains key exchange method, data transfer cipher, and Message Digest for Message Authentication Code (MAC). SSL 3.0 supports a variety of key exchange algorithms.
110
Client
Server
HTTP
ClientHello
HTTP
ChangeCipherSpec Finished
Fragment Message
FIGURE 4-15
Once the first set of messages is successfully completed, an encrypted communication channel is established. The following sections describe the differences between using a pure software solution and an SSL accelerator appliance in terms of packet processing and throughput.
Chapter 4
111
We will not be discussing SSL in depth. The purpose of this section is to describe the different network architectural deployment scenarios you can apply to SSL processing. The following sections describe various approaches to scaling SSL processing capabilities from a network architecture perspective.
Software-SSL libraries This approach uses the bundled SSL libraries and offers the most cost-effective option for processing SSL transactions. Crypto Accelerator Board This approach can offer a massive improvement in performance for SSL processing for certain types of SSL traffic. Conclusions Drawn from the Tests on page 121 suggests when best to use the Sun Crypto Accelerator 1000 board, for example. SSL Accelerator Appliance This solution might have a high initial cost, but it proves to be very effective and manageable for large-scale SSL Web server farms. Conclusions Drawn from the Tests on page 121 suggests when best to deploy an appliance such as Netscaler or ArrayNetworks.
There are several deployment options for SSL acceleration. This section describes where it makes sense to deploy different SSL acceleration options. It is important to consider certain characteristics, including:
s s s s s
The level or degree of security The number of client SSL transactions The volume of bulk encrypted data to be transferred in the secure channel Cost The number of horizontally scaled SSL Web servers.
112
TCP
IP
NIC Driver
PCI Bridge
PCI
NIC
FIGURE 4-16
Chapter 4
113
an ASIC that can compute the cryptographic algorithms in very few clock cycles. However, there is a an overhead of transferring data to the card, as the PCI bus must first be arbitrated and traversed. Note that in the case of small data transfers, the overhead of PCI transfers might not outweigh the benefit of the cryptographic computation acceleration offered by the card. Further, it is important to make sure the PCI slot used is 64 bit and 66 MHz. Using a 32-bit slot could have a performance impact.
TCP
IP
NIC Driver
PCI Bridge
PCI
3 SSL Accelerator
NIC
FIGURE 4-17
114
Chapter 4
115
IP
NIC Driver
PCI Bridge
Client 5000
NIC
Server
Server
Server
FIGURE 4-18
116
Sun ONE Web Server Client - benchmark Alteon Switch SSL Libs
deepak2 129.146.138.98
FIGURE 4-19
deepak 129.146.138.99
Chapter 4
117
Hardware SSL implementation, including hardware coprocessor for mathematically intensive computations of cryptographic algorithms. Reuse of backend SSL tunnel. By keeping one SSL tunnel alive and reusing it, the result is massive server SSL offload.
We ran the benchmark load generator on client (deepak2). The client points to the VIP on the Netscaler, which terminates one side of the SSL connection. The Netscaler then reuses the backend SSL connection. This is also more secure because the client is unaware of the backend servers and hence can do less damage:
#abc -n 100 -c 10 -v 4 http://129.146.138.52:443/100m.file1 >./netscaler100mfel1n100c10.softwareonly
612 packets were transferred to complete 100 SSL handshakes in less that one second!
118
FIGURE 4-21
FIGURE 4-22 shows the SSL appliance tests. Larger clients were required to saturate the servers. We used two additional Sun Fire 3800 servers in addition to the Enterprise 450 server. The reason for this was that the SSL appliance terminated the SSL connection, performed all SSL processing, and maintained very few socket connections to the backend servers, thereby reducing the load on the servers.
Chapter 4
119
FIGURE 4-22
FIGURE 4-23 suggests that there is a sweet spot for the number of threads to be used for the client load generator. After a certain point, performance drops. This suggests that the SSL processing of software only approaches benefits from increased threads up to a certain maximum point. These are initial tests and not comprehensive by any means. Our intent is to show that this is one potentially important configuration consideration, which might be beyond the scope of pure design.
10
15
20
30
40
50
Number of Threads
FIGURE 4-23
FIGURE 4-24 shows the impact of file size on SSL performance. Note that these are SSL encrypted bulk files. The SSL appliance has a dramatic impact on increasing performance of SSL throughput for large files. However, the number of transactions 120 Networking Concepts and Technology: A Designers Resource
decreases in direct proportion to the file size. The link was a 1-gigabit pipe, which can support 125 MByte/sec throughput. The results show that the limiting factor actually is not the network pipe.
SSL Performance and File Size 100 80 Fetches/Sec 60 40 20
1KB 100KB
500KB
File Size Kilobytes (KB) 1MB 800KB Kilobytes (KB) Transferred/Sec 600KB 400KB 200KB
FIGURE 4-24
Chapter 4
121
The accelerator device can be installed in an existing infrastructure and can offer very good performance. The servers do not need to be modified. Hence, only one device must be managed for SSL acceleration. Another benefit is that the appliance exploits the fact that not every server will be loaded with SSL at the same time. Hence, from a utilization standpoint, an appliance is more economically feasible.
122
CHAPTER
Token Ring Networks Fiber Distributed Data Interface (FDDI) Networking Ethernet Networking
123
sending station checks to see that the destination station copied the data. If there is no more data to be sent, the sending station alters the frames bit configuration so that it now functions as a free token available to another station on the ring. If a station fails, it is physically switched out of the ring, dynamically. The ring is then automatically reconfigured. When the station has been repaired, the ring is again automatically reconfigured to include the added station.
Free token
Station A is waiting to send data to Station C. Station A waits for free token.
Station A changes free token to busy. Station A sends data. Station D repeats data. A Station C receives data. Station B will repeat data. Data D
B Data
Station A receives its own busy token. Station A generates new free token. A C
Free token D
FIGURE 5-1
124
The IEEE standard specifies the lower two layers of the OSI 7-layer model. The two layers are the Physical layer (Layer 1) and the Data Link layer (Layer 2). The Data Link layer is further divided into the Logical Link sublayer (LLC) and the Media Access Control (MAC) sublayar. The token ring driver is a multi-threaded, loadable, clonable, STREAMS hardware driver that supports the connectionless Data Link Provider Interface, dlpi(7p), over a token ring controller. Multiple token ring controllers installed within the system are supported by the driver. SunTRI/S software can support different protocol architectures concurrently, via the SNAP encapsulation technique of RFC1042. From this SNAP encapsulation, high-level applications can communicate through their different protocols over the same SunTRI/S interface. Support also exists for adding different protocol packages (not included with SunTRI/S). These protocol packages include OSI and other protocols available directly from Sun or through third-party vendors. TCP/IP is implicit with the Solaris operating system. The software driver also provides source routing, which enables the workstation to access multiple ring networks connected by source-route bridges. Locally administered addressing is also supported and aids in management of certain userspecific and vendor-specific network configurations.
Support for IBM LAN Manager is provided by the TMS380 MAC-level firmware that complies with the IEEE 802.5 standard.
Chapter 5
125
parameters of the tr device that can be altered in the driver.conf file and global parameters that can be altered using /etc/system. TABLE 5-1 describes the tr.conf parameters.
TABLE 5-1 Parameter
tr.conf Parameters
Description
mtu sr ari
Maximum transfer unit index Source routing enable Disabling ARI/FCI Soft Error Reporting
MTU Sizes
MTU Size (bytes)
MTU Index
0 1 2 3 4 5 6
126
By default, source routing is enabled. To disable source routing, set the sr value in the tr.conf file to 0.
TABLE 5-3 Parameter
sr
ari
Chapter 5
127
By default, the adapter is set to classic mode (half duplex). If the mode is set to DTR, the adapter will come up in full duplex mode. If the mode is set to auto, the adapter will automatically choose between classic and DTR mode, depending on the capabilities of the switch or media access unit (MAU).
TABLE 5-5 Parameter
mode
The Operation mode values can be set to: 0 Classic mode 1 Auto mode 2 DTR mode
xxx is the number of 2-kilobyte buffers desired. You should not see a value less than the default value of 64. Proper setting of this parameter requires tuning. Numbers between 400 and 500 should be reasonable for medium load. You must reboot the system after you have updated the /etc/system file for the changes to take effect.
128
page 234. The rest of this section describes the configuration of individual parameters of the trp device that can be altered in the driver.conf file and global parameters that can be altered using /etc/system.
TABLE 5-6 Parameter
trp.conf Parameters
Description
mtu sr ari
Maximum Transfer unit Source routing enable Disabling ARI/FCI Soft Error Reporting
mtu
The maximum MTU sizes supported are 4472 for 4 Mbit/sec operation and 17800 for 16 Mbit/sec operation. The default MTU size is 4472 bytes.
Chapter 5
129
Ring Speed
Parameter Description
trpinstance_ring_speed
The ring speed setting applied to the node: 0= auto-detect (default) 4= 4 Mbit/sec 16= 16 Mbit/sec
To change the value of the ring speed on trp0 to 4 Mbit/sec and the ring speed on trp1 to 16 Mbit/sec, change the following settings in the trp.conf file:
trp0_ring_speed = 4 trp1_ring_speed = 16
130
For example, to use an LAA of 04:00:ab:cd:11:12 on the tr0 interface, use the following
The least significant bit of the most significant byte of the address used in the above command should never be 1. That bit is individual/group bit and used by multicasting. For example, the address 09:00:ab:cd:11:12 would be invalid and would cause unexpected networking problems.
Chapter 5
131
FDDI Stations
FDDI Stations
FDDI Stations
FDDI Stations
FIGURE 5-2
FDDI Stations
An FDDI station is any device that can be attached to a fiber FDDI network through an FDDI interface. The FDDI protocols define two types of FDDI stations:
s s
Single-Attached Station
A SAS is attached to the FDDI network through a single connector, called the S-port. The S-port has a primary input (Pin) and a primary output (Pout). Data from an upstream station enters through Pin and exits from Pout to a downstream station, as shown in FIGURE 5-3. Single-attached stations are normally attached to single- and dual-attached concentrators as described in FDDI Concentrators on page 134.
132
PHY
S-port
Pout
Pin
Dual-Attached Station
A DAS is attached to the FDDI network through two connectors, called the A-port and the B-port, respectively. The A-port has a primary input (Pin) and a secondary output (Sout); the B-port has a primary output (Pout) and a secondary input (Sin). The primary input/output is attached to the primary ring and the secondary input/output is attached to the secondary ring. The flow of data during normal operation is shown in FIGURE 5-4. To complete the ring, you must ensure that the B-port of an upstream station is always connected to the A-port of a downstream station. For this reason, most FDDI DAS connectors are keyed to prevent connections between two ports of the same type.
Chapter 5
133
MAC
PHY B
PHY A
B-port
A-port
Pout
Sin
Sout
Pin
FDDI Concentrators
FDDI concentrators are multiplexers that attach multiple single-attached stations to the FDDI ring. An FDDI concentrator is analogous to an Ethernet hub. The FDDI protocols define two types of concentrator:
s s
Single-Attached Concentrator
A SAC is attached to the FDDI network through a single connector, which is identical to the S-port on a single-attached station. It has multiple M-ports to which single-attached stations are connected, as shown in FIGURE 5-5.
134
M-port
M-port
M-port
S-port
Pout
Pin
Dual-Attached Concentrator
A DAC is attached to the FDDI network through two portsthe A-port and the Bport, which are identical to the ports on a dual-attached station. A DAC has multiple M-ports, to which single-attached stations are connected as shown in FIGURE 5-6. Dual-attached concentrators and FDDI stations are often arranged in a flexible network topology called the ring of trees. Additionally, many failover capabilities are built into the FDDI network to ensure it is robust.
Chapter 5
135
M-port
M-port
M-port
B-port
B-port
Pout
Sin
Sout
Pin
FIGURE 5-6
FDDI Interfaces
Sun supports two FDDI drivers for its range of SPARC platforms: the SBus driver, known as the SunFDDI/S, and the PCI driver, known as SunFDDI/P. The SBus-based and the PCI FDDI interfaces provide access to 100 mbit/s FDDI local area networks.
136
nf.conf Parameters
Description
nf_mtu nf_treq
nf_mtu
If the network load is irregular (bursty traffic), the TTRT should be set as high as possible to avoid lengthy queueing at any one station. If the network is used for the bulk transfer of large data files, the TTRT should be set relatively high to obtain maximum throughput without allowing any one station to monopolize the network resources.
Chapter 5
137
If the network is used for voice, video, or real-time control applications, the TTRT should be set low to decrease access delay.
The TTRT is established during the claim process. Each station on the ring bids a value (T_req) for the operating value of the TTRT (T_opr) and the station with the lowest bid wins the claim. Setting the value of T_req on a single station does not guarantee that this bid will win the claim process.
TABLE 5-11 Parameter
nf_treq
pf.conf Parameters
Description
pf_mtu pf_treq
pf_mtu
138
pf_tReq
Ethernet Technology
This section discusses low-level network interface controller (NIC) architecture features. It explains the elements that make up a NIC adapter, breaking it down into the transmit (Tx) data path and the receive (Rx) data path followed by acceleration features available with more modern NICs. The components are broken down in this manner to provide the necessary high-level understanding required to discuss the finer details of the Sun NIC devices available. These broad concepts plus the finer details included in this explanation will help you understand the operation of these devices and how to tune them for maximum benefit in throughput, request/response performance, and level of CPU utilization. These concepts are also useful in explaining the development path that Sun took for its NIC technology. Each concept is retained from one Sun NIC to the next as each new product builds on the strengths of the last. With the NIC architecture concepts in place, the next area of discussion is by far the largest source of customer discomfort with Ethernet technology: the physical layer. The original ubiquitous Ethernet technology was 10 Mbit/sec. Ethernet technology has been improved continuously over the years, going from 10 Mbit/sec to 100 Mbit/sec and most recently to 1 Gbit/sec. Along the way Ethernet technology always promised to be backward compatible and accomplished this using a technology called auto-negotiation, which allows new Ethernet arrivals to connect to the existing infrastructure and establish the correct speed to operate with and be part of that infrastructure. On the whole, the technology works very well, but there are some difficulties with understanding the Ethernet physical layer. Hopefully, our explanation of this layer will facilitate better use of this feature. The last addition to the Ethernet technology is network congestion control using pause flow control. This is a useful but under-utilized feature of Ethernet that we hope to demystify.
Chapter 5 Server Network Interface Cards: Datalink and Physical Layer 139
DLPI Interface
Software Domain
Receive
Transmit
Hardware Domain
NIC Device
FIGURE 5-7
Transmit
The Transmit portion of the software device driver level is the simpler of the two and basically is made up of a Media Access Control module (MAC), a direct memory access (DMA) engine, and a descriptor ring and buffers. FIGURE 5-8 illustrates these items in relation to the computer system.
140
Phy Device
TX DMA Engine
FIGURE 5-8
Transmit Architecture
The key element to this transmit architecture is the descriptor ring. This is the part of the architecture where the transmit hardware and the device driver transmit software share information required to move data from the system memory to the Ethernet network connection. The transmit descriptor ring is a circular array of descriptor elements that are constantly being used by the hardware to find data to be transmitted from main memory to the Ethernet media. At a minimum, the transmit descriptor element contains the length of the Ethernet packet data to be transmitted and a physical pointer to a location in system physical memory to find the data. The transmit descriptor element is created by the NIC device driver as a result of a request at the DLPI interface layer to transmit a packet. That element is placed on the descriptor ring at the next available free location in the array. Then the hardware is notified that a new element is available. The hardware fetches the new descriptor, and using the pointer to the packet data physical memory, moves the data from the physical memory to the Ethernet media for the given length of the packet provided in the Tx descriptor. Note that requests for more packets to be transmitted by the DLPI interface continue while the hardware is transmitting the packets already posted on the descriptor ring. Sometimes the arrival rate of the transmit packets at the DLPI interface exceeds the rate of transmission of the packets by the hardware to the media. In that case, the descriptor ring fills up and further attempts to transmit must be postponed until previously posted transmissions are completed by the hardware and more descriptor elements are made available by the device driver software. This is a typical producer-consumer effect where the producer is the DLPI interface producing requests for the transmit descriptor ring and the hardware is the consumer consuming those requests and moving data to the media.
Chapter 5
141
This producer-consumer effect can be reduced by increasing the size of the transmit descriptor ring to accommodate the delay that the hardware or the underlying media imposes on the movement of the data. This delay is also known as transmission latency. Later sections describe how many of the device drivers give a measurement of how often the transmission latency becomes so large that data transmission is postponed, awaiting transmit descriptor ring space. The aim is to avoid this situation. In some cases, NIC hardware allows you to increase the size of the descriptor ring, allowing a larger transmit latency. In other cases, the hardware has a fixed upper limit for the size of the transmit descriptor ring. In those cases, theres a hard limit to how much latency the transmit can endure before postponing packets is inevitable.
Consistent mode uses a consistency protocol in hardware, which is common in both x86 platforms and SPARC platforms. Streaming mode uses a software synchronization method.
142
The trade-offs between consistent and streaming modes are largely due to the prefetch capability of the DMA transaction. In the consistent mode theres no pre-fetch, so when a DMA transaction is started by the device, each cache line of data is requested individually. In streaming mode a few extra cache lines can be pre-fetched in anticipation of being required by the hardware, hence reducing per cache line rearbitration costs. All of these trade-offs lead to the following rules for using ddi_dma, fast dvma, and consistent versus streaming mode:
s
If the packets are small, avoid setting up a mapping on a per-packet basis. This means that small packets are copied out of the message and passed down from the upper layer to a pre-mapped buffer. That pre-mapped buffer is usually a consistent mode buffer, as the benefits of streaming mode are difficult to realize for small packets. Large packets should use the fast dvma mapping interface. Streaming mode is assumed in this mode. On x86 platforms, streaming mode is not available. Mid-range packets should use the ddi_dma mapping interface. This range applies to all cases where fast dvma is not available. The mid-range can be further split, as one can control explicitly whether the DMA transaction uses consistent mode or streaming mode. Given that streaming mode pre-fetch capability works best for larger transactions, the upper half should use streaming mode while the lower half uses consistent mode.
Setting the thresholds for these rules requires clear understanding of the memory latencies of the system and the distance between the I/O Expander card and the CPU card in a system. The rule of thumb here is the larger the system, the larger the memory latency. Once the course-grain tuning is applied, more fine-grain tuning is required. The best tuning is established by experimentation. A good way to do this is by running FTP or NFS transfers of large files and measuring the throughput.
Chapter 5
143
This feature requires a new interface to the driver, so only the most recent devices have implemented it. Furthermore, it can only be enabled if TCP/IP is configured to allow it. Even with that, it will only attempt to build an MDT transaction to the driver if the TCP connection is operating in a Bulk transfer mode such as FTP or NFS. The Multi-Data transmit capability is also included as part of the performance enhancements provided in the ce driver. This feature is negotiated with the upper layer protocol so it can be enabled in the ce driver as well as the upper layer protocol. If theres no negotiation, the feature is disabled. The TCP/IP protocol began supporting Multi-Data Transmit capability in the Solaris 9 8/03 operating system, but by default it will not negotiate with the driver to enable it. The first step to making this capability available is to enable the negotiations through an /etc/system tunable parameter.
TABLE 5-15 Parameter
ip_use_dl_cap
0-1
Enables the ability to negotiate special hardware accelerations with a lower layer. 1 Enable 0 Disable Default 0
q To enable the multi-data transmit capability, add the following line to the
/etc/system file:
set ip:ip_use_dl_cap = 1
Receive
The receive side of the interface looks much like the transmission side, but it requires more from the device driver to ensure that packets are passed to the correct stream. There are also multithreading techniques to ensure that the best advantage is made of multiprocessor environments. FIGURE 5-9 shows the basic Rx architecture.
144
PHY Device
RX DMA Engine
FIGURE 5-9
The receive descriptor plays a key role in the process of receiving packets. Unlike transmission, receive packets originate from remote systems. Therefore, the Rx descriptor ring refers to buffers where those incoming packets can be placed. At a minimum, the receive descriptor element provides a buffer length and a pointer to an available buffer. When a packet arrives, its received first by the PHY device and then passed to the MAC, which notifies the Rx DMA engine of an incoming packet. The Rx DMA takes that notification and uses it to initiate a Rx descriptor element fetch. The descriptor is then used by the Rx DMA to post the data from the MAC device internal FIFOs to system main memory. The length provided by the descriptor ensures the Rx DMA doesnt exceed the buffer space provided for the incoming packet. The Rx DMA continues to move data until the packet is complete. Then it places in the current descriptor location a new completion descriptor containing the size of the packet that was just received. In some cases, depending on the hardware capability, there might be more information in the completion descriptor associated with the incoming packet (for example, a TCP/IP partial checksum). When the completion descriptor is placed back onto the Rx descriptor ring, the hardware advances its pointer to the next free Rx descriptor. Then the hardware interrupts the CPU to notify the device driver that it has a packet that needs to be passed to the DLPI layer. Once the device driver receives the packet, it is responsible for replenishing the Rx descriptor ring. That process requires the driver to allocate and map a new buffer for DMA and post it to the ring. When the new buffer is posted to the ring, the hardware is notified that this new descriptor is available. Once the buffer is replenished, the current packet can be passed up for classification to the stream expecting that packets arrival.
Chapter 5
145
It is possible that the allocation and mapping can fail. In that case, the current packet cannot be received, as its buffer is reposted to the ring to allow the hardware to continue to receive packets. This condition is not very likely, but it is an example of an overflow condition. Other overflow conditions can occur on the Rx path starting from the DLPI layer:
s
Overflow can be caused when the DLPI layer cannot receive the incoming packet. In that case, packets are typically dropped, even though they were successfully received by the hardware. Overflow can be caused when the device driver software is unable to replenish the Rx descriptor elements faster than the NIC hardware consumes them. This usually occurs because the system doesnt have enough CPU performance to keep up with the network traffic.
Overflow also occurs within the NIC device between the MAC and the Rx DMA interface. This is known as MAC overflow. It is caused when the descriptor overflows and backfill occurs because of that condition. MAC overflow can occur when a highlatency system bus makes the MAC overflow its internal buffer as it waits for the Rx DMA to get access to the bus to move the data from the MAC buffer to main memory. Finally, if a MAC overflow condition exists, any packet coming in cannot be received. Hence that packet is considered missed. In some cases, overflow conditions can be avoided by careful tuning of the device driver software. The extent of available tuning depends on the NIC hardware. In cases where the Rx descriptor ring is overflowed, many devices allow increases in the number of descriptor elements available. This will be discussed further with respect to example NIC cards in later sections. You can avoid the MAC overflow condition by careful system configuration, which can require more memory, faster CPUs, or more CPUs. It might also require that NIC cards not share the system bus with other devices. Newer devices have the ability to adjust the priority of the Rx DMA versus the Tx DMA, giving one a more favorable opportunity to access the system bus than the other. Therefore, if the MAC overflow condition occurs, it might be possible to adjust the Rx DMA priority to make Rx accesses to the system bus more favorable than the Tx DMA, thus reducing the likelihood of MAC overflow. The overflow condition from the DLPI layer is caused by an overwhelmed CPU. There are a few new hardware features that help reduce this effect. Those features include hardware checksumming, interrupt blanking, and CPU load balancing.
Checksumming
The hardware checksumming feature accelerates the ones complement checksum applied to TCP/IP packets. The TCP/IP checksum is applied to each packet sent by the TCP/IP protocol. The TCP/IP checksum is made up of a ones complement
146
addition of the bytes in the pseudo header plus all the bytes in the payload. The pseudo header is made up of bytes from the source and destination IP address plus the TCP source and destination port numbers. The hardware checksumming feature is merely an acceleration. Most hardware designs dont implement the TCP/IP checksumming directly. Instead, the hardware does the bulk of the ones complement additions over the data and allows the software to take that result and mathematically adjust it to make it appear the complete hardware checksum was calculated. On transmission, the TCP checksum field is filled with an adjustment value that is considered just another two bytes of data that the hardware is applying during the ones complement addition of all the bytes of the packet. The end result of that sequence is a mathematically correct checksum that can be placed in the TCP header on transmission by the MAC to the network.
Hardware here
Start
Checksum From
Ethernet Header
IP Header
TCP Header
Payload Data
Place
Checksum Result
On the Rx path, the hardware completes the ones complement checksum based on a starting point in the packet. That same starting point is passed to TCP/IP along with the ones complement checksum from the bytes in the incoming packet. The TCP/IP software again does a mathematical fix-up, using this information before it finally compares the result with the TCP/IP checksum bytes that arrived as part of the packet. The main advantage of hardware checksumming is the reduction in cost of requiring the system CPU to calculate the checksum for large packets by allowing the majority of the checksum calculation to be completed by the NIC hardware. Because the hardware does not do the complete TCP/IP checksum calculation, this form of TCP/IP checksum acceleration is called partial checksumming.
Chapter 5
147
Hardware here
Start
Checksum From
Ethernet Header
IP Header
TCP Header
Payload Data
Interrupt Blanking
Interrupt blanking is another hardware acceleration. Typically with regular NIC devices, the CPU is interrupted when a receive packet arrives. Hence the CPU is interrupted on a per-packet basis. While this is reasonable for transactional requests, where you would expect a response to a request immediately, it is not always required, especially in large bulk data transfers. In the single-interrupt-per-packet case, a packet arrival interrupting the CPU adds the cost of processing each individual packet to the overhead of the interrupt processing. The interrupt blanking technique allows a set number of packets to arrive before the next receive interrupt is generated. This allows the overhead of the interrupt processing to be distributed, or amortized, across the number of received packets. If that number of packets is not reached, then the packets that have arrived so far will not generate an interrupt and hence would not be processed. A timeout ensures that the receive packet interrupt will be generated and those received packets will be processed. The best setting for the interrupt blanking depends on the type of traffictransactional versus bulk data transfersand the speed of the system. The best way to tune these parameters can be achieved empirically when the given parameters are well known and the interrupt blanking can be tuned dynamically to match. This will be discussed further in the context of individual NICs that provide this feature.
148
Software load balancing can be enhanced with hardware support, but it can also be implemented without hardware support. Essentially, it requires the ability to separate the workload of different connections from the same protocol stack into flows that can then be processed on different CPUs. The interrupt thread is now required only to replenish buffers for the descriptor ring, allowing more packets to arrive. Packets taken off the receive rings are then load balanced into packet flows based on connection information from the packet. A packet flow has a circular array that is updated with receive packets from the interrupt service routine while packets posted earlier are being removed and post-processed in the protocol stack by the kernel worker thread. Usually more than one flow is set up within a system made up of a circular array and a corresponding kernel worker thread. The more CPUs available, the more flows can be allowed. The kernel worker threads are available to run whenever packet data is available on the flows array. The system scheduler participates using its own CPU load-balancing technique to ensure a fair distribution of workload for incoming data.
FIGURE 5-12 demonstrates the architecture of software load balancing.
Protocol Stack
Circular array
Circular array
Circular array
Hardware load balancing requires that the hardware provide built-in load balancing capability. The PCI bus enables receive hardware load balancing by using its four available interrupt lines together with the ability of the UltraSPARC III systems to allow each of those four interrupt lines to be serviced by different CPUs within the system. The advantage of having the four lines receive interrupts running on different CPUs is that it allows not only the protocol post processing to happen in
Chapter 5
149
parallel, as in the case of software load balancing, but it also allows the processing of the descriptor rings in the interrupt service routines to run in parallel, as shown in FIGURE 5-13.
Protocol Stack
PCI BUS
FIGURE 5-13
It is possible to combine the concept of software load balancing with the concept of hardware load balancing if enough CPUs are available to allow all the parallel Rx processing to happen. However, there is a gotcha with this load balancing capability: To realize its benefit, you must have multiple connections in order to provide the load balancing in the first place.
150
The Streams Service Queue model also requires the driver to decouple interrupt processing from protocol stack processing. In this model, theres no requirement to provide a hint because there is only one protocol processing thread per queue open to the driverwith respect to TCP/IP, thats only one stream. This method works best on systems with a small number, but greater than one, of CPUs. Like CPU load balancing, it compromises on latency. The most common received packet delivery method is to do all the interrupt processing and protocol processing in the interrupt thread. This is a widely accepted method, but it is restricted by the available CPU bandwidth taking all the NIC driver interrupts. This is really the only option on a single CPU system. In a multi-CPU system, you can choose one of the other two methods if its established that the CPU taking the NIC interrupts is being overwhelmed. That situation becomes apparent when the system starts to become unresponsive.
The internal device memory is full and the adapter is unable to get timely access to the system bus in order to move data from that device memory to system memory. The system is so busy servicing packets that the descriptor rings fill up with inbound packets and no further packets can be received. This overflow condition is very likely to also trigger the first overflow condition at the same time.
When these overflow conditions occur, the upper layer connections effectively stop receiving packets and a connection appears to have stopped in motion, at least for the duration of the overflow condition. With TCP/IP in particular, this leads to many packets being lost. The connection state is modified to assume a less reliable connection, and in some cases the connections might be lost completely. The impact of a lost connection is obvious, but if the TCP/IP protocol assumes a less reliable connection it will further contribute to the congestion on the network by reducing the number of packets outstanding without an ACK from the regular eight to a smaller value. A technique that can avoid this scenario can take advantage of the TCP/IP ability to allow for the occasional single packet loss associated with a connection and still maintain the same number of packets outstanding without an ACK. The lost packet is simply requested again, and the transmitting end of the connection will perform a retry. Completely avoiding the overflow scenario is impossible, but you can reduce its likelihood by beginning to drop random packets already received in the device memory, avoiding propagating them further into the system and adding to the
Chapter 5
151
workload already piled up for the system. This technique, known as Random Early Discard (RED), has the desired effect of avoiding overwhelming the system, while at the same time having minimal negative effect on the TCP/IP connections. The rate of random discard is done relative to how many bytes of packet data occupy the device internal memory. The internal memory is split into regions. As one region fills up with packet data, it spills into the next until all regions of memory are filled and overflow occurs. When the packet data spills from one region to the next, thats the trigger to randomly discard. The number of packets discarded is based on the number of regions filled; the more regions filled, the more you need to discard, as youre getting closer to the overflow state.
Jumbo Frames
Jumbo frames technology allows the size of an Ethernet data packet to be extended past the current 1514 standard limit, which is the norm for Ethernet networks. The typical size of jumbo frames has been set to 9000 bytes when viewed from the IP layer. Once the Ethernet header is applied, that grows by 14 bytes for the regular case or 18 bytes for VLAN packets. When jumbo frames are enabled on a subnet or VLAN, every member of that subnet or VLAN should be enabled to support jumbo frames. To ensure that this is the case, configure each node for jumbo frames. The details of how to set and check a node for jumbo frames capability tend to be NIC device/driver-specific and are discussed for interfaces that support them below. If any one node in the subnet is not enabled for jumbo frames, no members in the subnet can operate in jumbo frame mode, regardless of their preconfiguration to support jumbo frames. The big advantage of jumbo frames is similar to that provided by MDT. They provide a huge improvement in bulk data transfer throughput with corresponding reduction in CPU utilization, but with the addition that the same level of improvement is also available in the receive direction. Therefore, best bulk transfer results can be achieved using this mode. The jumbo frames mode should be used with care because not all switches or networking infrastructure elements are jumbo frames capable. When you enable jumbo frames, make sure that theyre contained within a subnet or VLAN where all the components in that subnet or VLAN are jumbo frames capable.
protocol, a partition was made between the media-specific portion and the Ethernet protocol portion of the overall Ethernet technology. At that partition was placed the Media Independent Interface (MII). The MII allowed Ethernet to operate over fiberoptic cables to switches built to support fiber. It also allowed the introduction of a new twisted-pair copper technology, 100BASE-T4. These differing technologies for the supporting 100 Mbit/sec ultimately did not survive the test of time, leaving 100BASE-T as the standard 100 Mbit/sec media type. The existing widespread adoption of 10 Mbit/sec Ethernet brought with it a requirement that Ethernet media for 100 Mbit/sec should allow for backward compatibility with existing 10 Mbit/sec networks. Therefore, the MII was required to support 10 Mbit/sec operation as well as 100 Mbit/sec and allow the speed to be user selectable or automatically detected or negotiated. Those requirements led to the ability to force a particular speed setting for a link, known as Forced mode, or based on link speed signaling set a particular speed, known as auto-sensing, where both sides of the link share information about link speed and duplex capabilities and negotiate the best speed and duplex to set for the link, known as Auto-negotiation.
Reset
Speed Select
Auto-Negotiation Enable
Restart Auto-Negotiation
Duplex Mode
FIGURE 5-14
The Reset bit is a self-clearing bit that allows the software to reset the physical layer. This is usually the first bit touched by the software in order to begin the process of synchronizing the software state with the hardware link state. Speed Selection is a single bit, and it is only meaningful in Forced mode. Forced mode of operation is available when auto-negotiation is disabled. If this bit is set to 0, then the speed selected is 10 Mbit/sec. If set to 1, then the speed selected is 100 Mbit/sec.
Chapter 5
153
When the Auto-negotiation Enable bit is set to 1, auto-negotiation is enabled, and the speed selection and duplex mode bits are no longer meaningful. The speed and Duplex mode of the link are established based on auto-sensing or autonegotiation advertisement register exchange. The Restart Auto-negotiation bit is used to restart auto-negotiation. This is required during the transition from Forced mode to Auto-negotiation mode or when the Advertisement register has been updated to a different set of autonegotiation parameters. The Duplex mode bit is only meaningful in Forced mode. When set to 1, the link is set up for full duplex mode. When set to 0, the link operates in Half-duplex mode.
100BASE-T4
Autonegotiation complete
Link Status
Autonegotiation capable
FIGURE 5-15
When the 100BASE-T4 bit is set to 1, it indicates that the physical layer device is capable of 100BASE-T4 networking. When set to 0, it is not. When the 100BASE-T Full-duplex bit is set to 1, it indicates that the physical layer device is capable of 100BASE-T full-duplex networking. When set to 0, it is not capable. When the 100BASE-T Half-duplex bit is set to 1, it indicates that the physical layer device is capable of 100BASE-T half-duplex networking. When set to 0, it is not capable. When the 10BASE-T Full-duplex bit is set to 1, it indicates that the physical layer device is capable of 10BASE-T full-duplex networking. When set to 0, it is not. When the 10BASE-T Half-duplex bit is set to 1, it indicates that the physical layer device is capable of 10BASE-T half-duplex networking. When set to 0, it is not.
154
The Auto-negotiation Complete bit is only meaningful when the physical layer device is capable of auto-negotiation and it is enabled. Auto-negotiation Complete indicates that the auto-negotiation process has completed and the information in the link partner auto-negotiation accurately reflects the link capabilities of the link partner. When the Link Status bit is set to 1, it indicates that the physical link is up. When set to 0, the link is down. When used in conjunction with auto-negotiation, this bit must be set together with the Auto-negotiation Complete before the software can establish that the link is actually up. In Forced mode, as soon as this bit is set to 1, the software can assume the link is up. When the Auto-negotiation Capable bit is set to 1, it indicates that the physical layer device is capable of auto-negotiation. When set to 0, the physical layer device is not capable. This bit is used by the software to establish any further auto-negotiation processing that should occur.
100BASE-T4
FIGURE 5-16
When the 100BASE-T4 is set to 1 in the ANAR, it advertises the intention of the local physical layer device to use 100BASE-T4. When set to 0, this capability is not advertised. This same bit, when set to 1 in the LPANAR, indicates that the link partner physical layer device has advertised 100BASE-T4 capability. When set to 0, the link partner is not advertising this capability.
Chapter 5
155
The 100BASE-T Full-duplex, 100BASE-T Half-duplex, 10BASE-T Full-duplex, and 100BASE-T Half-duplex bits all have the same functionality as 100BASE-T4 and provide the ability to decide what link capabilities should be shared for the link. The decision process is made by the physical layer hardware and is based on priority, as shown in FIGURE 5-17. It is the result of logically ANDing ANAR and the LPANAR on completion of auto-negotiation.
Highest
100BASE-T Full Duplex 100BASE-T4 100BASE-T Half Duplex 10BASE-T Full Duplex
Lowest
10BASE-T Half Duplex Link Partner Priority for Hardware Decision Process
FIGURE 5-17
Auto-negotiation in the purest sense requires that both sides participate in the exchange of ANAR. This allows both sides to complete loading of the LPANAR and establish a link that operates at the best negotiated value. It is possible that one side, or even both sides, of the link might be operating in Forced mode instead of Auto-negotiation mode. This can happen because the new device is connected to an existing 10/100 Mbit/sec link that was never designed to support auto-negotiation or because the auto-negotiation is switched off on one or both sides. If both sides are in Forced mode, one needs to set the correct speed and duplex for both sides. If the speed is not matched, the link will not come up, so speed mismatches can be easily tracked down once the physical connection is checked and considered good. If the duplex is not matched yet the speed is matched, the link will come up, but theres an often unnoticed gotcha in that. If one side is set to half duplex while the other is set to full duplex, then the half-duplex side will operate with the Ethernet protocol Carrier Sense Multiple Access with Collision Detection (CSMA/CD) while the full-duplex side will not. To the physical layer, this means that the full-duplex side is not adhering to the half-duplex CSMA/CD protocol and will not back off if someone is currently transmitting. For the half-duplex side of the connection, this appears as a collision, and its transmit is stopped. These collisions will occur frequently, preventing the link from operating to its best capacity. If one side of the connection is running Auto-negotiation mode and the other is running Forced mode and the auto-negotiating side is capable and advertising all available MII speeds and duplex settings, the link speed will always be negotiated successfully by the auto-sensing mechanism provided as part of the auto-negotiation protocol. Auto-sensing uses physical layer signaling to establish the operating speed
156
of the Forced side of the link. This allows the link to at least come up at the correct speed. The link duplex, on the other hand, needs the Advertisement register exchange and cannot be established by auto-sensing. Therefore, if the link duplex setting on the Forced mode side of the link is full duplex, then the best guess the auto-negotiating side of the link can make is half duplex. This gives rise to the same effect discussed when both sides are in Forced mode and theres a duplex mismatch. The only solution to the issue of duplex mismatch is to be aware that it can happen and make every attempt to configure both sides of the link to avoid it. In most cases, enabling auto-negotiation on both sides wherever possible will eliminate the duplex mismatch issue. The alternative is Forced mode, which should only be employed in infrastructures that have full-duplex configurations. Where possible, those configurations should be replaced with an auto-negotiation configuration. Theres one more MII register worthy of note. The Auto-negotiation Expansion register (ANER) can be useful in establishing whether a link partner is capable of auto-negotiation or not, and providing information about the auto-negotiation algorithm.
FIGURE 5-18
The Parallel Detection Fault bit indicates that the auto-sensing part of the autonegotiation protocol was unable to establish the link speed, and the regular ANAR exchange was also unsuccessful in establishing a common link parameter. Therefore auto-negotiation failed. If this condition happens, the best course of action is to check each side of the link manually and ensure that the settings are mutually compatible.
Chapter 5
157
The GMII was first implemented using fiber-optic physical layer known as 1000BASE-X and was later extended to support twisted-pair copper known as 1000BASE-Tx. Those extensions led to additional bits in registers in the MII specification and some completely new registers, giving a GMII register set definition. The first register to be extended was the BMCR because it can be used to force speed. Then the ability to force 1-gigabit operation was added. All existing bit definition was maintained with the addition of one bit taken from the existing reserved bits to allow the enumeration of the different speeds that can now be forced with GMII devices.
Reset
Auto-Negotiation Enable
Restart Auto-Negotiation
Duplex Mode
FIGURE 5-19
The next register of interest was the BMSR. This register was extended to indicate to the driver software that there are more registers that apply to 1-gigabit operation.
AutoNegotiation Complete
Link Status
AutoNegotiation Capable
FIGURE 5-20
When the 1000BASE-T Extended Status is set, thats the indication to the driver software to look at the new 1-gigabit operating registers. The function is similar to the Basic Mode Status and the ANAR. The Gigabit Extended Status Register (GESR) is the first of the gigabit operating registers. Like the BMSR, it gives an indication of the types of gigabit operation the physical layer device is capable of.
158
FIGURE 5-21
The 1000BASE-X full duplex indicates that the physical layer device is capable of operating with 1000BASE-X fiber media with full-duplex operation. The 1000BASE-X half duplex indicates that the physical layer device is capable of operating with 1000BASE-X fiber media with half-duplex operation. The 1000BASE-T full duplex indicates that the physical layer device is capable of operating with 1000BASE-T twisted-pair copper media with full-duplex operation. The 1000BASE-T half duplex indicates that the physical layer device is capable of operating with 1000BASE-T twisted-pair copper media with half-duplex operation. The information provided by the GESR gives the possible 1-gigabit capabilities of the physical layer device. From that information you can choose the gigabit capabilities that will be advertised through the Gigabit Control Register (GCR). In the case of twisted-pair copper physical layer, there is also the ability to advertise the Clock Mastership.
Clock Mastership is a new concept that only applies to copper media running at 1 gigabit. At such high signaling frequencies, it becomes increasingly difficult to continue to have separate clocking for the remote and the local physical layer devices. Hence, a single clocking domain was introduced, which the remote and the local physical layer devices share while a link is established. To achieve the single clocking domain only one end of the connection provides the clock (the link master), and the other (the link slave) simply uses it. The Gigabit Status Register (GSR) bits, Master/Slave Manual Config Enable, and Master/Slave Config Value control how your local physical layer device will behave in this master/slave relationship.
Chapter 5
159
When the Master/Slave manual config enable bit is set, the master slave configuration is controlled by the master config value. When it is cleared, the Master/Slave configuration is established during auto-negotiation by a clocklearning sequence, which automatically establishes a clock master and slave for the link. Typically in a network, the master is the switch port and the slave is the end port or NIC. The Master/Slave manual config enable setting is only meaningful when the Master/Slave manual config enable bit is set. If set to 1, it will force the local clock mastership setting to be Master. If set to 0, the local clock becomes the Slave. When using the Master/Slave manual configuration, take care to ensure that the link partner is set accordingly. For example, if 1-gigabit Ethernet switches are set up to operate as link masters, then the computer system attached to the switches should be set up as a slave.
When the driver fills in the bits in the GSR, its equivalent to filling in the ANAR in MII: It controls the 1-gigabit capabilities that are advertised. Likewise the GSR is like the LPANAR, providing the capabilities of the link partner. The register definition for the GSR is similar to the GCR. With GMII operation, once auto-negotiation is complete, the contents of the GCR are compared with those in the GSR and the highest-priority shared capability is used to decide the gigabit speed and duplex. It is possible to disable 1-gigabit operation. In that case, the shared capabilities must be found in the MII registers as described above. In GMII mode, at the end of auto-negotiation, once the GCR and GSR are compared and the ANAR and LPANAR are compared, then the choice of the operating speed and duplex is established by the hardware based on the following descending priority:
160
Highest
1000BASE-T Full Duplex 1000BASE-T Half Duplex 100BASE-T Full Duplex 100BASE-T4 100BASE-T Half Duplex 10BASE-T Full Duplex
Lowest
FIGURE 5-24
Once the correct setting is established, the device software makes that setting known to the user through kernel statistics. It is also possible to manipulate the configuration using the ndd utility.
Destination Address
Source Address
PAD to 42 bytes
Frame CRC
FIGURE 5-25
The key to this feature is the use of MAC control frames known as pause frames, which have the following formats:
s
The Destination Address is a 6-byte address defined for Ethernet Flow Control as a multicast address of 01:80:C2:00:00:1. The Source Address is a 6-byte address that is the same as the Ethernet station address of the producer of the pause frame. The Protocol Type Field is a 2-byte address set to the MAC control protocol 0x8808. Pause capability is one example of the usage of MAC control protocol.
Chapter 5
161
The MAC Control Pause Opcode is a 2-byte value, 0x0001, that indicates the type of MAC control feature to be used, in this case pause. The Mac Control Pause Parameter is a 2-byte value that indicates whether the flow control is startedalso referred to as XOFF or XON. When the MAC Control Pause Parameter is non zero, you have an XOFF pause frame. When the value is 0, you have an XON pause frame. The value of the parameter is in units of slot time.
To understand the Flow Control capability, consider symmetric flow control first. With symmetric flow control, a network node can generate flow control frames or react to flow control frames. Generating a flow control frame is known as Transmit Pause capability and is triggered by congestion on the Rx side. The Transmit Pause sends an XOFF flow control message to the link partner, who should react to pause frames (Receive pause capability). By reacting to pause frames, the link partner uses the transmitted pause parameter as a duration that the link partners transmitter should remain silent while the Rx congestion clears. If the Rx congestion clears within that pause parameter period, an XON flow control message can be transmitted telling the link partner that the congestion has cleared and transmission can continue as normal. In many cases, Flow Control capability is available in only one direction. This is known as Asymmetric Flow Control. This might be a configuration choice or simply a result of a hardware design. Therefore, the MII/GMII specification was altered to allow Flow Control capability to be advertised to a link partner along with the best type of flow control to be used for the shared link. The changes were applied to the ANAR along with two new bits: Pause Capability and Asymmetric Pause Capability. FIGURE 5-26 shows the updated register.
100BASE-T4
Pause Capability
FIGURE 5-26
Starting with Asymmetric Pause Capability, if the value of this bit is set to 0, then the ability to pause is managed by the Pause Capability. If Pause Capability is set to 1, it indicates the local ability to pause in both Rx and Tx direction. If the Asymmetric Pause Capability is set to 1, it indicates the local ability to pause in either the Rx or the Tx direction. When its set to 1, it indicates that the local setting is to Receive flow control. In other words, reception of XOFF can stop transmitting, and XON can
162
restart it. When set to 0, it indicates transmit flow control, which means when Rx becomes congested, it will transmit XOFF, and once the congestion clears, it can transmit XON.
Remote Machine
On Reception of an XOFF, stop transmitting until Pause time has elapsed or XON Arrives.
Local Machine
Enough packets have been serviced to reduce the RX FIFO occupancy to below the threshold; send XON.
Tx FIFO
Incoming Packet
Rx FIFO
Receive in the FIFO has exceeded the Pause threshold; send XOFF. FIGURE 5-27
Now that the Pause Capability and Asymmetric Pause Capability are established, it is required to advertise these parameters to the link partner and negotiate the pause setting to be used for the link.
TABLE 5-16 enumerates all the possibilities for resolving the pause capabilities for a
link.
TABLE 5-16 Local Device cap_pause cap_asmpause
0 0 0 1 1 1 1
0 1 1 0 0 1 1
X 0 1 0 1 0 1
X X 0 X X 0 X
0 0 0 0 1 0 1
X 0 0 0 0 0 0
Chapter 5
163
The link_pause and link_asmpause parameters have the same meanings as the cap_pause and cap_asmpause parameters and enumerate meaningful information for a link given the pause capabilities available for both sides of the link.
Example 1
cap_asmpause = 1 cap_pause = 0 lp_cap_asmpause = 0 lp_cap_pause = 1 The device is capable of asymmetric pause. The device will send pauses if the Receive side becomes congested. The device is capable of symmetric pause. The device will send pauses if the Receive side becomes congested, and it will respond to pause by disabling transmit. Because both the local and remote partner are set to send a pause on congestion, only the remote partner will respond to that pause. This is equivalent to no flow control, as it requires both ends to stop transmitting to alleviate the Rx congestion. Further indication that no meaningful flow control is happening on the link.
link_asmpause = 0
link_pause = 0
Example 2
cap_asmpause = 1 cap_pause = 1 lp_cap_asmpause = 0 lp_cap_pause = 1 The device is capable of asymmetric pause. The device will send pauses if the receive side becomes congested. The device is capable of symmetric pause. The device will send pauses if the receive side becomes congested, and it will respond to pause by disabling transmit. Because the local setting is to stop sending on arrival of a flow control message and the remote end is set to send flow control messages when it gets congested, we have flow control on the receive direction of the link. Hence its asymmetric. The direction of the pauses is incoming.
link_asmpause = 1
link_pause = 1
164
There are more examples of flow control from the table that can be discussed in terms of flow control in action. Well return to this topic when discussing individual devices that support this feature. This concludes all the options for controlling the configuration of the Ethernet Physical Layer MII/GMII. The preceding information should come in useful when configuring your network and making sure that each Ethernet link is coming up as required by the configuration. Finally, there is a perception that auto-negotiation has difficulties, but most of these were cleared up with the introduction of Gigabit Ethernet technology. Therefore it is no longer required to disable auto-negotiation to achieve reliable operation with available Gigabit switches.
hme Fast Ethernet qfe Quad Fast Ethernet eri Fast Ethernet dmfe Fast Ethernet
Chapter 5
165
MII Connectors
Media Attachment unit allows a connection to Ethernet via an alternative Physical layer media type.
FIGURE 5-28
At a MAC level, the interface stages packets for transmission using a single 256element descriptor ring array. Once that ring is exhausted, other packets get queued waiting in the streams queue. When the hardware completes transmission of packets currently occupying space on the ring, the packets waiting on the streams queue are moved to the descriptor ring. The Rx side of the descriptor ring is again a maximum of 256 elements. Once those elements are exhausted, no further buffering is available for incoming packets and overflows begin to occur. When hme was introduced to the market, 256 descriptors to Tx and Rx were reasonable because the CPU frequencies for that time were around 100 Mhz to 300 Mhz, so the arrival rate of transmission packets posted to the descriptor rings closely matched the transmission capability of the physical media. As time progressed, CPUs became faster and this number of descriptors became inadequate. Often the interface began to exhaust the elements in the transmit ring and incurred more scheduling overhead for transmission. On the Rx side, as CPUs became faster, the ability to receive packets became simpler because less time was required to service each packet. The occupancy on the Rx ring of packets needing to be serviced diminished. The hme interface is limited in tuning capability. If you experience low performance because of overflows or the transmit ring being constantly full, no corrective action is possible.
166
The physical layer of hme is fully configurable using the driver.conf file and ndd command.
TABLE 5-17 Parameter
instance adv_autoneg_cap adv_100T4_cap adv_100fdx_cap adv_100hdx_cap adv_10fdx_cap adv_10hdx_cap use_int_xcvr lance_mode ipg0 ipg1 ipg2 autoneg_cap 100T4_cap 100fdx_cap 100hdx_cap 10fdx_cap 10hdx_cap lp_autoneg_cap lp_100T4_cap lp_100fdx_cap lp_100hdx_cap lp_10fdx_cap lp_10hdx_cap transceiver_inuse link_status link_speed link_mode
Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only
Current device instance in view for ndd Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Transceiver control parameter Inter-packet gap parameters Inter-packet gap parameters Inter-packet gap parameters Inter-packet gap parameters Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Link partner capability Link partner capability Link partner capability Link partner capability Link partner capability Link partner capability Current physical layer status Current physical layer status Current physical layer status Current physical layer status
Chapter 5
167
Instance Parameter
Values 0-1000 Description Current device instance in view for the rest of the ndd configuration variables
Parameter instance
Before you view or alter any of the other parameters, make a quick check of the value of instance to ensure that it is actually pointing to the device you want to view or alter.
Parameter adv_autoneg_cap
adv_100T4_cap
0-1
adv_100fdx_cap
0-1
168
TABLE 5-19
Parameter adv_100hdx_cap
adv_10fdx_cap
0-1
adv_10hdx_cap
0-1
If you are using the interactive mode of ndd with this device to alter the adv_100fdx_cap parameter to adv_10hdx_cap, the changes applied to those parameters are not actually applied to the hardware until adv_autoneg_cap is changed to its alternative value and then back again.
Chapter 5
169
In some cases it might be necessary to override that policy. Therefore the ndd parameter use_int_xcvr is provided. Transceiver Control Parameter
Values 0-1 Description Override for the policy that the external XCVR takes priority over the internal transceiver. 0 = If an external transceiver is present, use it instead of the internal (default). 1 = If an external transceiver is present, ignore it and continue to use the internal.
TABLE 5-20
Parameter use_int_xcvr
Note IPG is sometimes increased on older systems using slower NICs, where
newer NICs and systems are hogging the network. When a server dominates a halfduplex network it's known as server capture effect.
170
For example, if the link speed is 10 Mbit/sec and you set ipg0 to 20 nibble times, multiply 20 by 400 ns to get 8000 ns. If the link speed is 100 Mbit/sec and you set ipg0 to 30 nibble times, multiply 30 by 40 ns to get 1200 ns.
TABLE 5-21
Parameter lance_mode
ipg0
ipg1
ipg2
All of the IPG parameters can be set using ndd or can be hard-coded into the hme.conf files. Details of the methods of setting these parameters are provided in Configuring Driver Parameters on page 238.
Chapter 5
171
external MII port. Therefore the capabilities presented in these statistics might vary according to the capabilities of the external MII physical layer device that is attached. Local Transceiver Auto-negotiation Capability Parameters
Values Description
autoneg_cap
0-1
Local interface is capable of auto-negotiation signaling. 0 = Can only operate in forced mode 1 = Capable of auto-negotiation Local interface is capable of 100-T4 operation. 0 = Not 100 Mbit/sec T4 capable 1 = 100 Mbit/sec T4 capable Local interface is capable of 100 full duplex operation. 0 = Not 100 Mbit/sec full-duplex capable 1 = 100 Mbit/sec full-duplex capable Local interface is capable of 100 half-duplex operation. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable Local interface is capable of 10 full-duplex operation. 0 = Not 10 Mbit/sec full-duplex capable 1 = 10 Mbit/sec full-duplex capable Local interface is capable of 10 half-duplex operation. 0 = Not 10 Mbit/sec half-duplex capable 1 = 10 Mbit/sec half-duplex capable
100T4_cap
0-1
100fdx_cap
0-1
100hdx_cap
0-1
10fdx_cap
0-1
10hdx_cap
0-1
172
TABLE 5-23
Parameter lp_autoneg_cap
lp_100T4_cap
0-1
lp_100fdx_cap
0-1
lp_100hdx_cap
0-1
lp_10fdx_cap
0-1
lp_10hdx_cap
0-1
Chapter 5
173
Parameter transceiver_inuse
link_status
0-1
link_speed
0-1
link_mode
0-1
Note that the physical layer status parameters are only meaningful while ndd is running in interactive mode or while the interface being viewed is already initialized by virtue of the presence of open streams such as snoop -d hme0 or ifconfig hme0 plumb inet up. If these streams dont exist, then the device is uninitialized and the state gets set up when you probe these parameters with ndd. As a result, the parameters are subject to a race between the user viewing them and the link reaching a steady state. This makes these parameters unreliable unless an existing stream is associated with an instance prior to checking. A good rule to follow is to only trust these parameters if the interface is configured up using the ifconfig command.
174
FIGURE 5-29
With the introduction of qfe came the introduction of trunking technology, which will be discussed later. The physical layer of qfe is fully configurable using the driver.conf file and ndd command.
TABLE 5-25 Parameter
instance adv_autoneg_cap adv_100T4_cap adv_100fdx_cap adv_100hdx_cap adv_10fdx_cap adv_10hdx_cap use_int_xcvr lance_mode ipg0 ipg1 ipg2 autoneg_cap 100T4_cap 100fdx_cap
Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read only Read only Read only
Current device instance in view for ndd Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Transceiver control parameter Inter-packet gap parameters Inter-packet gap parameters Inter-packet gap parameters Inter-packet gap parameters Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability
Chapter 5
175
100hdx_cap 10fdx_cap 10hdx_cap lp_autoneg_cap lp_100T4_cap lp_100fdx_cap lp_100hdx_cap lp_10fdx_cap lp_10hdx_cap transceiver_inuse link_status link_speed link_mode
Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only
Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Link partner capability Link partner capability Link partner capability Link partner capability Link partner capability Link partner capability Current physical layer status Current physical layer status Current physical layer status Current physical layer status
Instance Parameter
Values 0-1000 Description Current device instance in view for the rest of the ndd configuration variables
Parameter instance
Before you view or alter any of the other parameters, make a quick check of the value of instance to ensure that it is actually pointing to the device you want to view or alter.
176
adv_autoneg_cap
0-1
Local interface capability of auto-negotiation signaling is advertised by the hardware. 0 = Forced mode 1 = Auto-negotiation Default is set to the autoneg_cap parameter. Local interface capability of 100-T4 is advertised by the hardware. 0 = Not 100 Mbit/sec T4 capable 1 = 100 Mbit/sec T4 capable Default is set to the 100T4_cap parameter. Local interface capability of 100 full duplex is advertised by the hardware. 0 = Not 100 Mbit/sec full-duplex capable 1 = 100 Mbit/sec full-duplex capable Default is set based on the 100fdx_cap parameter. Local interface capability of 100 half duplex is advertised by the hardware. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable Default is set based on the 100hdx_cap parameter. Local interface capability of 10 full duplex is advertised by the hardware. 0 = Not 10 Mbit/sec full-duplex capable 1 = 10 Mbit/sec full-duplex capable Default is set based on the 10fdx_cap parameter. Local interface capability of 10 half duplex is advertised by the hardware. 0 = Not 10 Mbit/sec half-duplex capable 1 = 10 Mbit/sec half-duplex capable Default is set based on the 10hdx_cap parameter.
adv_100T4_cap
0-1
adv_100fdx_cap
0-1
adv_100hdx_cap
0-1
adv_10fdx_cap
0-1
adv_10hdx_cap
0-1
Chapter 5
177
If you are using the interactive mode of ndd with this device to alter the adv_100fdx_cap parameter to adv_10hdx_cap, the changes applied to those parameters are not actually applied to the hardware until adv_autoneg_cap is changed to its alternative value and then back again.
lance_mode
0 1
178
ipg0
Additional IPG before transmitting a packet Default = 4 First IPG parameter Default = 8 Second IPG parameter Default = 8
ipg1
ipg2
All of the IPG parameters can be set using ndd or can be hard-coded into the qfe.conf files. Details about setting these parameters are provided in Reboot Persistence Using driver.conf on page 242.
autoneg_cap
0-1
Local interface is capable of auto-negotiation signaling. 0 = Can only operate in forced mode 1 = Capable of auto-negotiation Local interface is capable of 100-T4 operation. 0 = Not 100 Mbit/sec T4 capable 1 = 100 Mbit/sec T4 capable Local interface is capable of 100 full duplex operation. 0 = Not 100 Mbit/sec full-duplex capable 1 = 100 Mbit/sec full-duplex capable
100T4_cap
0-1
100fdx_cap
0-1
Chapter 5
179
100hdx_cap
0-1
Local interface is capable of 100 half duplex operation. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable Local interface is capable of 10 full duplex operation. 0 = Not 10 Mbit/sec full-duplex capable 1 = 10 Mbit/sec full-duplex capable Local interface is capable of 10 half duplex operation. 0 = Not 10 Mbit/sec half-duplex capable 1 = 10 Mbit/sec half-duplex capable
10fdx_cap
0-1
10hdx_cap
0-1
lp_autoneg_cap
0-1
Link partner interface is capable of autonegotiation signaling. 0 = Can only operate in forced mode 1 = Capable of auto-negotiation Link partner interface is capable of 100-T4 operation. 0 = Not 100 Mbit/sec T4 capable 1 = 100 Mbit/sec T4 capable Link partner interface is capable of 100 fullduplex operation. 0 = Not 100 Mbit/sec full-duplex capable 1 = 100 Mbit/sec full-duplex capable
lp_100T4_cap
0-1
lp_100fdx_cap
0-1
180
lp_100hdx_cap
0-1
Link partner interface is capable of 100 halfduplex operation. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable Link partner interface is capable of 10 full-duplex operation. 0 = Not 10 Mbit/sec full-duplex capable 1 = 10 Mbit/sec full-duplex capable Link partner interface is capable of 10 halfduplex operation. 0 = Not 10 Mbit/sec half-duplex capable 1 = 10 Mbit/sec half-duplex capable
lp_10fdx_cap
0-1
lp_10hdx_cap
0-1
transceiver_inuse
0-1
Indicates which transceiver is currently in use. 0 = Internal transceiver is in use. 1 = External transceiver is in use. Current link status 0 = Link down 1 = Link up This parameter provides the link speed and is only valid if the link is up. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable This parameter provides the link duplex and is only valid if the link is up. 0 = Half duplex 1 = Half duplex
link_status
0-1
link_speed
0-1
link_mode
0-1
Chapter 5
181
Note that the physical layer status parameters are only meaningful while ndd is running in interactive mode or the interface being viewed is already initialized by virtue of the presence of open streams such as snoop -d qfe0 or ifconfig qfe0 plumb inet up. If these streams dont exist, the device is uninitialized and the state gets set up when you probe these parameters with ndd. As a result, the parameters are subject to a race between the user viewing them and the link reaching a steady state. This makes these parameters unreliable unless an existing stream is associated with an instance prior to checking. A good rule to follow is to only trust these parameters if the interface is configured up using the ifconfig command.
Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write
Current device instance in view for ndd Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters
182
use_int_xcvr lance_mode ipg0 ipg1 ipg2 autoneg_cap 100T4_cap 100fdx_cap 100hdx_cap 10fdx_cap 10hdx_cap lp_autoneg_cap lp_100T4_cap lp_100fdx_cap lp_100hdx_cap lp_10fdx_cap lp_10hdx_cap transceiver_inuse link_status link_speed link_mode
Read and Write Read and Write Read and Write Read and Write Read and Write Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only
Transceiver control parameter Inter-packet gap parameters Inter-packet gap parameters Inter-packet gap parameters Inter-packet gap parameters Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Link partner capability Link partner capability Link partner capability Link partner capability Link partner capability Link partner capability Current physical layer status Current physical layer status Current physical layer status Current physical layer status
Chapter 5
183
Instance Parameter
Values Description
instance
0-1000
Current device instance in view for the rest of the ndd configuration variables
Before you view or alter any of the other parameters, make a quick check of the value of instance to ensure that it is actually pointing to the device you want to view or alter.
Parameter adv_autoneg_cap
adv_100T4_cap
0-1
adv_100fdx_cap
0-1
184
TABLE 5-34
Parameter adv_100hdx_cap
adv_10fdx_cap
0-1
adv_10hdx_cap
0-1
If you are using the interactive mode of ndd with this device to alter the adv_100fdx_cap parameter to adv_10hdx_cap, the changes applied to those parameters are not actually applied to the hardware until the adv_autoneg_cap is changed to its alternative value and then back again.
Chapter 5
185
The additional delay set by ipg0 helps to reduce collisions. Systems that have lance_mode enabled might not have enough time on the network. If lance_mode is disabled, the value of ipg0 is ignored and no additional delay is set. Only the delays set by ipg1 and ipg2 are used. Disable lance_mode if other systems keep sending a large number of back-to-back packets. You can add the additional delay by setting the ipg0 parameter, which is the nibble time delay, from 0 to 31. Note that nibble time is the time it takes to transfer four bits on the link. If the link speed is 10 Mbit/sec, nibble time is equal to 400 ns. If the link speed is 100 Mbit/sec, nibble time is equal to 40 ns. For example, if the link speed is 10 Mbit/sec and you set ipg0 to 20 nibble times, multiply 20 by 400 ns to get 800 ns. If the link speed is 100 Mbit/sec and you set ipg0 to 30 nibble times, multiply 30 by 40 ns to get 120 ns.
TABLE 5-35 Parameter
Default = 4 First Inter-packet gap parameter Default = 8 Second Inter-packet gap parameter Default = 8
All of the IPG parameters can be set using ndd or can be hard-coded into the eri.conf files. Details of the methods of setting these parameters are provided in Configuring Driver Parameters on page 238.
186
intr_blank_time
0-127
Interrupt after this number of clock cycles has passed and the packets pending have not reached the number of intr_blank_packets. One clock cycle equals 2048 PCI clock cycles. (Default = 6) Interrupt after this number of packets has arrived since the last packet was serviced. A value of zero indicates no packet blanking. (Default = 8)
intr_blank_packets
0-255
TABLE 5-37
Parameter autoneg_cap
100T4_cap
0-1
100fdx_cap
0-1
Chapter 5
187
TABLE 5-37
Parameter 100hdx_cap
10fdx_cap
0-1
10hdx_cap
0-1
lp_autoneg_cap
0-1
Link partner interface is capable of autonegotiation signaling. 0 = Can only operate in forced mode 1 = Capable of auto-negotiation Link partner interface is capable of 100-T4 operation. 0 = Not 100 Mbit/sec T4 capable 1 = 100 Mbit/sec T4 capable Link partner interface is capable of 100 fullduplex operation. 0 = Not 100 Mbit/sec full-duplex capable 1 = 100 Mbit/sec full-duplex capable
lp_100T4_cap
0-1
lp_100fdx_cap
0-1
188
lp_100hdx_cap
0-1
Link partner interface is capable of 100 halfduplex operation. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable Link partner interface is capable of 10 full-duplex operation. 0 = Not 10 Mbit/sec full-duplex capable 1 = 10 Mbit/sec full-duplex capable Link partner interface is capable of 10 halfduplex operation. 0 = Not 10 Mbit/sec half-duplex capable 1 = 10 Mbit/sec half-duplex capable
lp_10fdx_cap
0-1
lp_10hdx_cap
0-1
Parameter transceiver_inuse
link_status
0-1
link_speed
0-1
link_mode
0-1
Chapter 5
189
Note that the physical layer status parameters are only meaningful while ndd is running in interactive mode or the interface being viewed is already initialized by virtue of the presence of open streams such as snoop -d eri0 or ifconfig eri0 plumb inet up. If these streams dont exist, the device is uninitialized and the state gets set up when you probe these parameters with ndd. As a result, the parameters are subject to a race between the user viewing them and the link reaching a steady state. This makes these parameters unreliable unless an existing stream is associated with an instance prior to checking. A good rule to follow is to only trust these parameters if the interface is configured up using the ifconfig command.
adv_autoneg_cap adv_100T4_cap adv_100fdx_cap adv_100hdx_cap adv_10fdx_cap adv_10hdx_cap autoneg_cap 100T4_cap 100fdx_cap 100hdx_cap 10fdx_cap 10hdx_cap lp_autoneg_cap lp_100T4_cap
Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read only Read only Read only Read only Read only Read only Read only Read only
Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Link partner capability Link partner capability
190
Read only Read only Read only Read only Read only Read only Read only
Link partner capability Link partner capability Link partner capability Link partner capability Current physical layer status Current physical layer status Current physical layer status
adv_autoneg_cap
0-1
Local interface capability of auto-negotiation signaling is advertised by the hardware. 0 = Forced mode 1 = Auto-negotiation Default is set to the autoneg_cap parameter. Local interface capability of 100-T4 is advertised by the hardware. 0 = Not 100 Mbit/sec T4 capable 1 = 100 Mbit/sec T4 capable Default is set to the 100T4_cap parameter. Local interface capability of 100 full duplex is advertised by the hardware. 0 = Not 100 Mbit/sec full-duplex capable 1 = 100 Mbit/sec full-duplex capable Default is set based on the 100fdx_cap parameter.
adv_100T4_cap
0-1
adv_100fdx_cap
0-1
Chapter 5
191
adv_100hdx_cap
0-1
Local interface capability of 100 half duplex is advertised by the hardware. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable Default is set based on the 100hdx_cap parameter. Local interface capability of 10 full duplex is advertised by the hardware. 0 = Not 10 Mbit/sec full-duplex capable 1 = 10 Mbit/sec full-duplex capable Default is set based on the 10fdx_cap parameter. Local interface capability of 10 half duplex is advertised by the hardware. 0 = Not 10 Mbit/sec half-duplex capable 1 = 10 Mbit/sec half-duplex capable Default is set based on the 10hdx_cap parameter.
adv_10fdx_cap
0-1
adv_10hdx_cap
0-1
If you are using the interactive mode of ndd with this device to alter the adv_100fdx_cap parameter to adv_10hdx_cap, the changes applied to those parameters are not actually applied to the hardware until the adv_autoneg_cap is changed to its alternative value and then back again.
192
the external MII port. Therefore, the capabilities presented in these statistics might vary according to the capabilities of the external MII physical layer device that is attached. Local Transceiver Auto-negotiation Capability Parameters
Values 0-1 Description Local interface is capable of auto-negotiation signaling. 0 = Can only operate in Forced mode 1 = Capable of auto-negotiation Local interface is capable of 100-T4 operation. 0 = Not 100 Mbit/sec T4 capable 1 = 100 Mbit/sec T4 capable Local interface is capable of 100 full-duplex operation. 0 = Not 100 Mbit/sec full-duplex capable 1 = 100 Mbit/sec full-duplex capable Local interface is capable of 100 half-duplex operation. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable Local interface is capable of 10 full-duplex operation. 0 = Not 10 Mbit/sec full-duplex capable 1 = 10 Mbit/sec full-duplex capable Local interface is capable of 10 half-duplex operation. 0 = Not 10 Mbit/sec half-duplex capable 1 = 10 Mbit/sec half-duplex capable
TABLE 5-42
Parameter autoneg_cap
100T4_cap
0-1
100fdx_cap
0-1
100hdx_cap
0-1
10fdx_cap
0-1
10hdx_cap
0-1
Chapter 5
193
TABLE 5-43
Parameter lp_autoneg_cap
lp_100T4_cap
0-1
lp_100fdx_cap
0-1
lp_100hdx_cap
0-1
lp_10fdx_cap
0-1
lp_10hdx_cap
0-1
194
link_status
0-1
Current link status 0 = Link down 1 = Link up This parameter provides the link speed and is only valid if the link is up. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable This parameter provides the link duplex and is only valid if the link is up. 0 = Half duplex 1 = Half duplex
link_speed
0-1
link_mode
0-1
Chapter 5
195
FIGURE 5-30
196
The ge interface also provides Layer 2 flow control capability. The physical layer and performance features of ge are fully configurable using the driver.conf file and ndd command.
TABLE 5-45 Parameter
instance adv_autoneg_cap adv_1000fdx_cap adv_1000hdx_cap adv_100T4_cap adv_100fdx_cap adv_100hdx_cap adv_10fdx_cap adv_10hdx_cap use_int_xcvr lance_mode ipg0 ipg1 ipg2 intr_blank_time intr_blank_packets autoneg_cap 1000fdx_cap 1000hdx_cap 100T4_cap 100fdx_cap 100hdx_cap 100fdx_cap 100hdx_cap 10fdx_cap 10hdx_cap lp_autoneg_cap
Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only
Current device instance in view for ndd Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Transceiver control parameter Inter-packet gap parameters Inter-packet gap parameters Inter-packet gap parameters Inter-packet gap parameters Receive interrupt blanking parameters Receive interrupt blanking parameters Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Link partner capability
Chapter 5
197
lp_1000fdx_cap lp_1000hdx_cap lp_100T4_cap lp_100fdx_cap lp_100hdx_cap lp_10fdx_cap lp_10hdx_cap transceiver_inuse link_status link_speed link_mode
Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only Read only
Link partner capability Link partner capability Link partner capability Link partner capability Link partner capability Link partner capability Link partner capability Current physical layer status Current physical layer status Current physical layer status Current physical layer status
Instance Parameter
Values Description
instance
0-1000
Current device instance in view for the rest of the ndd configuration variables
Before you view or alter any of the other parameters, make a quick check of the value of instance to ensure that it is actually pointing to the device you want to view or alter.
198
adv_autoneg_cap
0-1
Local interface capability of auto-negotiation signaling is advertised by the hardware. 0 = Forced mode 1 = Auto-negotiation Default is set to the autoneg_cap parameter. Local interface capability of 1000 full duplex is advertised by the hardware. 0 = Not 1000 Mbit/sec full-duplex capable 1 = 1000 Mbit/sec full-duplex capable Default is set to the 1000fdx_cap parameter. Local interface capability of 1000 half duplex is advertised by the hardware. 0 = Not 1000 Mbit/sec half-duplex capable 1 = 1000 Mbit/sec half-duplex capable Default is set to the 1000hdx_cap parameter. Local interface capability of 100-T4 is advertised by the hardware. 0 = Not 100 Mbit/sec T4 capable 1 = 100 Mbit/sec T4 capable Default is set to the 100T4_cap parameter. Local interface capability of 100 full duplex is advertised by the hardware. 0 = Not 100 Mbit/sec full-duplex capable 1 = 100 Mbit/sec full-duplex capable Default is set based on the 100fdx_cap parameter.
adv_1000fdx_cap
0-1
adv_1000hdx_cap
0-1
adv_100T4_cap
0-1
adv_100fdx_cap
0-1
Chapter 5
199
adv_100hdx_cap
0-1
Local interface capability of 100 half duplex is advertised by the hardware. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable Default is set based on the 100hdx_cap parameter. Local interface capability of 10 full duplex is advertised by the hardware. 0 = Not 10 Mbit/sec full-duplex capable 1 = 10 Mbit/sec full-duplex capable Default is set based on the 10fdx_cap parameter. Local interface capability of 10 half duplex is advertised by the hardware. 0 = Not 10 Mbit/sec half-duplex capable 1 = 10 Mbit/sec half-duplex capable Default is set based on the 10hdx_cap parameter.
adv_10fdx_cap
0-1
adv_10hdx_cap
0-1
If you are using the interactive mode of ndd with this device to alter the adv_100fdx_cap parameter to adv_10hdx_cap, the changes applied to those parameters are not actually applied to the hardware until the adv_autoneg_cap is changed to its alternative value and then back again.
200
The additional delay set by ipg0 helps to reduce collisions. Systems that have lance_mode enabled might not have enough time on the network. If lance_mode is disabled, the value of ipg0 is ignored and no additional delay is set. Only the delays set by ipg1 and ipg2 are used. Disable lance_mode if other systems keep sending a large number of back-to-back packets. You can add the additional delay by setting the ipg0 parameter, which is the nibble time delay, from 0 to 31. Note that nibble time is the time it takes to transfer four bits on the link. If the link speed is 10 Mbit/sec, nibble time is equal to 400 ns. If the link speed is 100 Mbit/sec, nibble time is equal to 40 ns. If the link speed is 1000 Mbit/sec, the nibble time is 4 ns. For example, if the link speed is 10 Mbit/sec and you set ipg0 to 20 nibble times, multiply 20 by 400 ns to get 800 ns. If the link speed is 100 Mbit/sec and you set ipg0 to 30 nibble times, multiply 30 by 40 ns to get 120 ns.
TABLE 5-48 Parameter
lance_mode
lance_mode disabled lance_mode enabled (default) Additional IPG before transmitting a packet Default = 4 First inter-packet gap parameter Default = 8 Second inter-packet gap parameter Default = 8
ipg0
ipg1
ipg2
All of the IPG parameters can be set using ndd or can be hard-coded into the ge.conf files. Details of the methods of setting these parameters are provided in Configuring Driver Parameters on page 238.
Chapter 5
201
intr_blank_time
0-127
Interrupt after this number of clock cycles has passed and the packets pending have not reached the number of intr_blank_packets. One clock cycle equals 2048 PCI clock cycles. Note: Given that this time is linked to PCI clock, an adapter plugged into a 66-MHz PCI slot will have a shorter blanking time. Relative to one 33-MHz slot, it will be a multiple of two. (Default = 6) Interrupt after this number of packets has arrived since the last packet was serviced. A value of zero indicates no packet blanking. (Default = 8)
intr_blank_packets
0-255
Note ge and ce fiber devices do not support 100 Mbit/sec capabilities. They
support 1000 Mbit/sec only.
202
TABLE 5-50
Parameter autoneg_cap
1000fdx_cap
0-1
1000hdx_cap
0-1
100T4_cap
0-1
100fdx_cap
0-1
100hdx_cap
0-1
10fdx_cap
0-1
10hdx_cap
0-1
Chapter 5
203
lp_autoneg_cap
0-1
Link partner interface is capable of autonegotiation signaling. 0 = Can only operate in forced mode 1 = Capable of auto-negotiation Link partner interface is capable of 1000 fullduplex operation. 0 = Not 1000 Mbit/sec full-duplex capable 1 = 1000 Mbit/sec full-duplex capable Link partner interface is capable of 1000 halfduplex operation. 0 = Not 1000 Mbit/sec half-duplex capable 1 = 1000 Mbit/sec half-duplex capable Link partner interface is capable of 100-T4 operation. 0 = Not 100 Mbit/sec T4 capable 1 = 100 Mbit/sec T4 capable Link partner interface is capable of 100 fullduplex operation. 0 = Not 100 Mbit/sec full-duplex capable 1 = 100 Mbit/sec full-duplex capable
lp_1000fdx_cap
0-1
lp_1000hdx_cap
0-1
lp_100T4_cap
0-1
lp_100fdx_cap
0-1
204
lp_100hdx_cap
0-1
Link partner interface is capable of 100 halfduplex operation. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable Link partner interface is capable of 10 full-duplex operation. 0 = Not 10 Mbit/sec full-duplex capable 1 = 10 Mbit/sec full-duplex capable Link partner interface is capable of 10 halfduplex operation. 0 = Not 10 Mbit/sec half-duplex capable 1 = 10 Mbit/sec half-duplex capable
lp_10fdx_cap
0-1
lp_10hdx_cap
0-1
transceiver_inuse
0-1
This parameter indicates which transceiver is currently in use. 0 = Internal transceiver is in use. 1 = External transceiver is in use.
Chapter 5
205
link_status
0-1
Current link status 0 = Link down 1 = Link up This parameter provides the link speed and is only valid if the link is up. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable This parameter provides the link duplex and is only valid if the link is up. 0 = Half duplex 1 = Half duplex
link_speed
0-1
link_mode
0-1
Note that the physical layer status parameters are only meaningful while ndd is running in interactive mode, or the interface being viewed is already initialized by virtue of the presence of open streams such as snoop -d ge0 or ifconfig hme0 plumb inet up. If these streams dont exist, the device is uninitialized and the state gets set up when you probe these parameters with ndd. As a result, the parameters are subject to a race between the user viewing them and the link reaching a steady state. This makes these parameters unreliable unless an existing stream is associated with an instance prior to checking. A good rule to follow is to only trust these parameters if the interface is configured up using the ifconfig command.
206
Note that just as the tunables can be used to enhance performance, they can also degrade performance.
TABLE 5-53 Parameter
ge_intr_mode
0-1
Enables the ge driver to send packets directly to the upper communication layers rather than queueing. 0 = Packets are not passed in the interrupt service routine but are placed in a streams service queue and passed to the protocol stack later, when the streams service routine runs. (default) 1 = Packets are passed directly to the protocol stack in the interrupt context. Default = 0 (queue packets to upper layers) Enables infinite burst mode for PCI DMA transactions rather than using cache-line size PCI DMA transfers. This feature is supported only on Sun platforms with the UltraSparc III CPU. 0 = Disabled (default) 1 = Enabled Minimum packet size to use fast dvma interfaces rather than standard dma interfaces. Default = 1024 Number of transmit descriptors used by the driver. Default = 512 Maximum packet size to use copy of buffer into premapped dma buffer rather than remapping. Default = 256
ge_dmaburst_mode
0-1
ge_tx_fastdvma_min
59-1500
ge_nos_tmd
32-8192
ge_tx_bcopy_max
60-256
Chapter 5
207
ge_nos_txdvma
0-8192
Number of dvma buffers (for transmit) used in the driver. Default = 256 Number of fragments that must exist in any one packet before ge_tx_onemblk coalesces them into a fresh mblk. Default = 2 For DMA, this parameter determines whether to use DDI_DMA_CONSISTENT or DDI_DMA_STREAMING. If the packet length is less than ge_tx_stream_min, then we use DDI_DMA_CONSISTENT. Default = 512
ge_tx_onemblk
1-100
ge_tx_stream_min
256-1000
The ge tunable parameters require that the /etc/system file be modified and the system rebooted to apply the changes. See Using /etc/system to Tune Parameters on page 244. The tuning variables ge_use_rx_dvma and ge_do_fastdvma are of particular interest because they control whether the ge driver uses fast dvma or the regular ddi_dma interface. Currently the setting applied is fast dvma, but with every new operating system release the ddi_dma interface is being improved and the performance difference between the two interfaces might be eliminated. The ge_nos_tmd can be used to adjust the size of the transmit descriptor ring. This might be required if the driver is experiencing a large number of notmd, as this indicates that the arrival rate of packets for the descriptor ring exceeds the rate that the hardware can transmit. In that case, increasing the descriptor ring might be a remedy. The ge_put_cfgin conjunction with ge_intr_mode controls the receive packet delivery model. When the ge_intr_mode is 1, the interface passes packets to the protocol stack in the interrupt context. When set to 0, the delivery model is controlled by ge_put_cfg. When it is set to 0, the ge driver provides a special-case software load balancing where theres only one worker thread; when set to 1, it uses the regular streams service routine. The transmit control tunables, ge_tx_bcopy_max, ge_tx_stream_min, and ge_tx_fastdvma_min, define the thresholds for the transmit buffer method. The ge_tx_onemblk controls coalescing of multiple message blocks that make up a single packet into one message block. In many cases where system memory latency is high, it makes sense to avoid individually mapping packet fragments. Instead,
208
you can have the driver create a new buffer, bring all the fragments together, and use only one DMA buffer. This feature is especially useful for HTTP server applications. The ge_nos_txdvma controls the pool of fast dvma resources associated with a driver. Since fast dvma resources are finite within a system, it is possible for one device to monopolize all of those resources. The tunable is designed to avoid this scenario and allow the ge driver to allocate a limited number of resources that can be shared at runtime with instances switching to transmit packets using the dvma interface. A clearer description of this will be presented later based on kstat information feedback.
FIGURE 5-31
The Sun GigaSwift Ethernet UTP adapter is a single-port gigabit Ethernet copperbased PCI Bus card. It can be configured to operate in 10 Mbit/sec, 100 Mbit/sec, or 1000 Mbit/sec Ethernet networks.
FIGURE 5-32
There is also a Dual Fast Ethernet/Dual SCSI PCI adapter card that is supported by the GigaSwift Ethernet device driver yet is limited to 100BASE-TX capability.
Chapter 5
209
The ce interface employs the hardware checksumming capability described above to reduce the cost of the TCP/IP checksum calculation. The ce interface also provides Layer 2 flow control capability, RED, and Infinite Burst. The physical layer and performance features of ce are configurable using the driver.conf file and ndd command.
TABLE 5-54 Parameter
instance adv-autoneg-cap adv-1000fdx-cap adv-1000hdx-cap adv-100T4-cap adv-100fdx-cap adv-100hdx-cap adv-10fdx-cap adv-10hdx-cap adv-asmpause-cap adv-pause-cap master-cfg-enable master-cfg-value use-int-xcvr enable-ipg0 ipg0 ipg1 ipg2 rx-intr-pkts rx-intr-time red-dv4to6k red-dv6to8k red-dv8to10k
Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write
Current device instance in view for ndd Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Flow control parameter Flow control parameter Gigabit link clock mastership controls Gigabit link clock mastership controls Transceiver control parameter Inter-packet gap parameters Inter-packet gap parameters Inter-packet gap parameters Inter-packet gap parameters Receive interrupt blanking parameters Receive interrupt blanking parameters Random early detection and packet drop vectors Random early detection and packet drop vectors Random early detection and packet drop vectors
210
Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write
Random early detection and packet drop vectors PCI interface parameters PCI interface parameters PCI interface parameters PCI interface parameters Jumbo frames enable parameter
With the ce driver, any changes applied to the above parameters take effect immediately.
Instance Parameter
Values 0-1000 Description Current device instance in view for the rest of the ndd configuration variables
Parameter instance
Before viewing or altering any of the other parameters, be sure to check of the value of instance to ensure that it is actually pointing to the device you want to configure.
Chapter 5
211
adv-autoneg-cap
0-1
Local interface capability is advertised by the hardware. 0 = Forced mode 1 = Auto-negotiation (default) Local interface capability is advertised by the hardware. 0 = Not 1000 Mbit/sec full-duplex capable 1 = 1000 Mbit/sec full-duplex capable (default) Local interface capability is advertised by the hardware. 0 = Not 1000 Mbit/sec half-duplex capable 1 = 1000 Mbit/sec half-duplex capable (default) Local interface capability is advertised by the hardware. 0 = Not 100-T4 capable (default) 1 = 100-T4 capable Local interface capability is advertised by the hardware. 0 = Not 100 Mbit/sec full-duplex capable 1 = 100 Mbit/sec full-duplex capable (default) Local interface capability is advertised by the hardware. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable (default) Local interface capability is advertised by the hardware. 0 = Not 10 Mbit/sec full-duplex capable 1 = 10 Mbit/sec full-duplex capable (default) Local interface capability is advertised by the hardware. 0 = Not 10 Mbit/sec half-duplex capable 1 = 10 Mbit/sec half-duplex capable (default)
adv-1000fdx-cap
0-1
adv-1000hdx-cap
0-1
adv-100T4-cap
0-1
adv-100fdx-cap
0-1
adv-100hdx-cap
0-1
adv-10fdx-cap
0-1
adv-10hdx-cap
0-1
212
adv-asmpause-cap
The adapter supports asymmetric pause, which means it can pause only in one direction. 0 = Off (default) 1 = On This parameter has two meanings depending on the value of adv-asmpause-cap. (Default = 0) If adv-asmpause-cap = 1 while adv-pause-cap = 1, pauses are received. If adv-asmpause-cap = 1 while adv-pause-cap = 0, pauses are transmitted. If adv-asmpause-cap = 0 while adv-pause-cap = 1, pauses are sent and received. If adv-asmpause-cap = 0, then adv-pause-cap determines whether Pause capability is on or off.
adv-pause-cap
Chapter 5
213
physical layer parameters control whether a side is the master or the slave or whether mastership is negotiated with the link partner. Those parameters are as follows.
TABLE 5-58 Parameter
master-cfg-enable master-cfg-value
Determines whether or not during the auto-negotiation process the link clock mastership is set up automatically. If the master-cfg-enable parameter is set, the mastership is not set up automatically but is dependant on the value of mastercfg-value. If the master-cfg-value is set, the physical layer expects the local device to be the link master. If it is not set, the physical layer expects the link partner to be the master. If auto-negotiation is not enabled, the value of master-cfgenable is ignored and the value of master-cfg-value is key to the link clock mastership. If the master-cfg-value is set, the physical layer expects the local device to be the link master. If its not set, the physical layer expects the link partner to be the master.
214
You can add the additional delay by setting the ipg0 parameter, which is the media byte time delay, from 0 to 255. Note that nibble time is the time it takes to transfer four bits on the link. If the link speed is 10 Mbit/sec, nibble time is equal to 400 ns. If the link speed is 100 Mbit/sec, nibble time is equal to 40 ns. For example, if the link speed is 10 Mbit/sec and you set ipg0 to 20 nibble times, multiply 20 by 400 ns to get 800 ns. If the link speed is 100 Mbit/sec and you set ipg0 to 30 nibble times, multiply 30 by 40 ns. to get 120 ns. If the link speed is 1000 Mbit/sec and you set ipg0 to 30 nibble times, multiply 30 by 40 ns to get 1200 ns.
TABLE 5-59 Parameter
enable-ipg0
0-1
Enables ipg0. 0 = ipg0 disabled 1 = ipg0 enabled Default = 1 Additional IPG before transmitting a packet Default = 8 First inter-packet gap parameter Default = 8 Second inter-packet gap parameter Default = 4
ipg0
ipg1
ipg2
All of the IPG parameters can be set using ndd or can be hard-coded into the ce.conf files. Details of the methods of setting these parameters are provided in Configuring Driver Parameters on page 238.
rx-intr-pkts
0 to 511
Interrupt after this number of packets have arrived since the last packet was serviced. A value of zero indicates no packet blanking. (Default = 8) Interrupt after 4.5 microsecond ticks have elapsed since the last packet was serviced. A value of zero indicates no time blanking. (Default = 3)
rx-intr-time
0 to 524287
Chapter 5
215
enable random early drop (RED) thresholds. When received packets reach the RED range, packets are dropped according to the preset probability. The probability should increase when the FIFO level increases. Control packets are never dropped and are not counted in the statistics.
TABLE 5-61 Field Name
red-dv4to6k
0 to 255
Random early detection and packet drop vectors when FIFO threshold is greater than 4096 bytes and less than 6144 bytes. Probability of drop can be programmed on a 12.5 percent granularity. For example, if bit 0 is set, the first packet out of every eight will be dropped in this region. (Default = 0) Random early detection and packet drop vectors when FIFO threshold is greater than 6144 bytes and less than 8192 bytes. Probability of drop can be programmed on a 12.5 percent granularity. For example, if bit 0 is set, the first packet out of every eight will be dropped in this region. (Default = 0) Random early detection and packet drop vectors when FIFO threshold is greater than 8192 bytes and less than 10,240 bytes. Probability of drop can be programmed on a 12.5 percent granularity. For example, if bits 1 and 6 are set, the second and seventh packets out of every eight will be dropped in this region. (Default = 0) Random early detection and packet drop vectors when FIFO threshold is greater than 10,240 bytes and less than 12,288 bytes. Probability of drop can be programmed on a 12.5 percent granularity. If bits 2, 4, and 6 are set, then the third, fifth, and seventh packets out of every eight will be dropped in this region. (Default = 0)
red-dv6to8k
0 to 255
red-dv8to10k
0 to 255
red-dv10to12k
0 to 255
216
tx-dma-weight
0-3
Determines the multiplication factor for granting credit to the Tx side during a weighted round-robin arbitration. Values are 0 to 3. (Default = 0) Zero means no extra weighting. The other values are powers of 2 extra weighting, on that traffic. For example, if tx-dmaweight = 0 and rx-dma-weight = 3, then as long as Rx traffic is continuously arriving, its priority will be eight times greater than Tx to access the PCI. Determines the multiplication factor for granting credit to the Rx side during a weighted round-robin arbitration. Values are 0 to 3. (Default = 0) Allows the infinite burst capability to be utilized. When this is in effect and the system supports infinite burst, the adapter will not free the bus until complete packets are transferred across the bus. Values are 0 or 1. (Default = 0) Switches off 64-bit capability of the adapter. In some cases, it is useful to switch off this feature. Values are 0 or 1. (Default = 0, which enables 64-bit capability)
rx-dma-weight
0-3
infinite-burst
0-1
disable-64bit
0-1
accept-jumbo
0-1
Once jumbo frames capability is enabled, the MTU can be controlled using ifconfig. The MTU can be raised to 9000 or reduced to the regular 1500-byte frames.
Chapter 5
217
Performance Tunables
GigaSwift Ethernet pushes systems even further than ge did. Many lessons were learned from ge, leading to a collection of special system tunables that assist in tuning the ce card for a specific system or application. Note that just as the tunables can be used to enhance performance, they can also degrade performance. Handle with great care.
TABLE 5-64 Parameter
ce_taskq_disable
0-1
Disables the use of task queues and forces all packets to go up to Layer 3 in the interrupt context. Default depends on whether the number of CPUs in the system exceeds the ce_cpu_threshold. Controls the number of taskqs set up per ce device instance. This value is only meaningful if ce_taskq_disable is false. Any value less than 64 is meaningful. (Default = 4). The size of the service FIFO, in number of elements. This variable can be any integer value. (Default = 2048) The threshold for the number of CPUs required in the system and online before the taskqs are utilized to Rx packets. (Default = 4) An enumerated type that can have a value of 0 or and 1. 0 = Transmit algorithm doesnt do serialization, 1 = Transmit algorithm does serialization. (Default = 0) An enumerated type that can have a value of 0, 1, or 2. 0 = Receive processing occurs in the interrupt context. 1 = Receive processing occurs in the worker threads. 2 = Receive processing occurs in the streams service queues routine. (Default = 0)
ce_inst_taskqs
0-64
ce_srv_fifo_depth
30-100000
ce_cpu_threshold
1-1000
ce_start_cfg
0-1
ce_put_cfg
0-2
218
ce_reclaim_pending
1-4094
The threshold when reclaims start happening. Currently 32 for both ge and ce drivers. Keep it less than ce_tx_ring_size/3. (Default = 32) The size of the Rx buffer ring, a ring of buffer descriptors for Rx. One buffer = 8K. This value must be Modulo 2, and its maximum value is 8K. (Default = 256) The size of each Rx completion descriptor ring. It also is Modulo 2. (Default = 2048) The size of each Tx descriptor ring. It also is Modulo 2. (Default = 2048) A mask to control which Tx rings are used. (Default = 3) Disables the Tx load balancing and forces all transmission to be posted to a single descriptor ring. 0 = Tx load balancing is enabled. 1 = Tx load balancing is disabled. (Default = 1) The mblk size threshold used to decide when to copy a mblk into a pre-mapped buffer as opposed to using DMA or other methods. (Default = 256)
ce_ring_size
32-8216
ce_comp_ring_size
0-8216
ce_comp_ring_size
0-8216
ce_tx_ring_mask
0-3 0-1
ce_no_tx_lb
ce_bcopy_thresh
0-8216
Chapter 5
219
ce_dvma_thresh
0-8216
The mblk size threshold used to decide when to use the fast path DVMA interface to transmit mblk. (Default = 1024) This global variable splits the ddi_dma mapping method further by providing Consistent mapping and Streaming mapping. In the Tx direction, Streaming is better for larger transmissions than Consistent mappings. The mblk size falls in the range greater than 256 bytes but less than 1024 bytes; then mblk fragment will be transmitted using ddi_dma methods. (Default = 512) The number of receive packets that can be processed in one interrupt before it must exit. (Default = 512)
ce_dma_stream_thresh
0-8216
ce_max_rx_pkts
321000000
The performance tunables require an understanding of some key kernel statistics from the ce driver to be used successfully. There might also be an opportunity to use the RED features and interrupt blanking both configurable using the ndd commands. A clearer description of this will be presented later based on kstat information feedback.
220
The physical layer of bge is fully configurable using the bge.conf file and ndd commands.
TABLE 5-65 Parameter
adv_autoneg_cap adv_1000fdx_cap adv_1000hdx_cap adv_100T4_cap adv_100fdx_cap adv_100hdx_cap adv_10fdx_cap adv_10hdx_cap adv_asm_pause_cap adv_pause_cap autoneg_cap 100T4_cap 100fdx_cap 100hdx_cap 10fdx_cap 10hdx_cap asm_pause_cap pause_cap lp_autoneg_cap lp_100T4_cap lp_100fdx_cap lp_100hdx_cap lp_10fdx_cap lp_10hdx_cap lp_asm_pause_cap lp_pause_cap
Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read and Write Read only Read only Read only Read only Read only Read only Read and Write Read and Write Read only Read only Read only Read only Read only Read only Read only Read only
Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Operational mode parameters Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Local transceiver auto-negotiation capability Link partner capability Link partner capability Link partner capability Link partner capability Link partner capability Link partner capability Link partner capability Link partner capability
Chapter 5
221
Current physical layer status Current physical layer status Current physical layer status
adv_autoneg_cap
0-1
Local interface capability of auto-negotiation signaling is advertised by the hardware. 0 = Forced mode 1 = Auto-negotiation Default is set to the autoneg_cap parameter. Local interface capability of 100-T4 advertised is by the hardware. 0 = Not 100 Mbit/sec T4 capable 1 = 100 Mbit/sec T4 capable Default is set to the 100T4_cap parameter. Local interface capability of 100 full duplex is advertised by the hardware. 0 = Not 100 Mbit/sec full-duplex capable 1 = 100 Mbit/sec full-duplex capable Default is set based on the 100fdx_cap parameter. Local interface capability of 100 half duplex is advertised by the hardware. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable Default is set based on the 100hdx_cap parameter. Local interface capability of 10 full duplex is advertised by the hardware. 0 = Not 10 Mbit/sec full-duplex capable 1 = 10 Mbit/sec full-duplex capable Default is set based on the 10fdx_cap parameter.
adv_100T4_cap
0-1
adv_100fdx_cap
0-1
adv_100hdx_cap
0-1
adv_10fdx_cap
0-1
222
adv_10hdx_cap
0-1
Local interface capability of 10 half duplex is advertised by the hardware. 0 = Not 10 Mbit/sec half-duplex capable 1 = 10 Mbit/sec half-duplex capable Default is set based on the 10hdx_cap parameter. The adapter supports asymmetric pause, which means it can pause only in one direction. 0 = Off 1 = On (Default = 1) This parameter has two meanings, depending on the value of adv_asm_pause_cap. If adv_asm_pause_cap = 1 while adv_pause_cap = 1, pauses are received and Transmit is limited. If adv_asm_pause_cap = 1 while adv_pause_cap = 0, pauses are transmitted. If adv_asm_pause_cap = 0 while adv_pause_cap = 1, pauses are sent and received. If adv_asm_pause_cap = 0, adv_pause_cap determines whether pause capability is on or off. (Default = 0)
adv_asm_pause_cap
0-1
adv_pause_cap
0-1
If you are using the interactive mode of ndd with this device to alter the adv_100fdx_cap parameter to adv_10hdx_cap, the changes applied to those parameters are not actually applied to the hardware until the adv_autoneg_cap is changed to its alternative value and then back again.
Chapter 5
223
autoneg_cap
0-1
Local interface is capable of auto-negotiation signaling. 0 = Can only operate in Forced mode 1 = Capable of auto-negotiation Local interface is capable of 100-T4 operation. 0 = Not 100 Mbit/sec T4 capable 1 = 100 Mbit/sec T4 capable Local interface is capable of 100 full-duplex operation. 0 = Not 100 Mbit/sec full-duplex capable 1 = 100 Mbit/sec full-duplex capable Local interface is capable of 100 half-duplex operation. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable Local interface is capable of 10 full-duplex operation. 0 = Not 10 Mbit/sec full-duplex capable 1 = 10 Mbit/sec full-duplex capable
100T4_cap
0-1
100fdx_cap
0-1
100hdx_cap
0-1
10fdx_cap
0-1
224
10hdx_cap
0-1
Local interface is capable of 10 half-duplex operation. 0 = Not 10 Mbit/sec half-duplex capable 1 = 10 Mbit/sec half-duplex capable The adapter supports asymmetric pause, which means it can pause only in one direction. 0 = Off 1 = On (Default = 1) This parameter has two meanings depending on the value of asm_pause_cap. If asm_pause_cap = 1 while pause_cap = 1, pauses are received, and transmit is limited. If asm_pause_cap = 1 while pause_cap = 0, pauses are transmitted. If asm_pause_cap = 0 while pause_cap = 1, pauses are sent and received. If asm_pause_cap = 0, pause_cap determines whether pause capability is on or off. (Default = 0)
asm_pause_cap
0-1
pause_cap
0-1
Chapter 5
225
lp_autoneg_cap
0-1
Link partner interface is capable of auto-negotiation signaling. 0 = Can only operate in Forced mode 1 = Capable of auto-negotiation Link partner interface is capable of 100-T4 operation. 0 = Not 100 Mbit/sec T4 capable 1 = 100 Mbit/sec T4 capable Link partner interface is capable of 100 full-duplex operation. 0 = Not 100 Mbit/sec full-duplex capable 1 = 100 Mbit/sec full-duplex capable Link partner interface is capable of 100 half-duplex operation. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable Link partner interface is capable of 10 full-duplex operation. 0 = Not 10 Mbit/sec full-duplex capable 1 = 10 Mbit/sec full-duplex capable
lp_100T4_cap
0-1
lp_100fdx_cap
0-1
lp_100hdx_cap
0-1
lp_10fdx_cap
0-1
226
lp_10hdx_cap
0-1
Link partner interface is capable of 10 half-duplex operation. 0 = Not 10 Mbit/sec half-duplex capable 1 = 10 Mbit/sec half-duplex capable The adapter supports asymmetric pause, which means it can pause only in one direction. 0 = Off 1 = On (Default = 1) This parameter has two meanings depending on the value of lp_asm_pause_cap. If lp_asm_pause_cap = 1 while lp_pause_cap = 1, pauses are received, and Transmit is limited. If lp_asm_pause_cap = 1 while lp_pause_cap = 0, pauses are transmitted. If lp_asm_pause_cap = 0 while lp_pause_cap = 1, pauses are sent and received. If lp_asm_pause_cap = 0, then lp_pause_cap determines whether pause capability is on or off. (Default = 0)
lp_asm_pause_cap
0-1
lp_pause_cap
0-1
Chapter 5
227
link_status
0-1
Current link status 0 = Link down 1 = Link up This parameter provides the link speed and is only valid if the link is up. 0 = Not 100 Mbit/sec half-duplex capable 1 = 100 Mbit/sec half-duplex capable This parameter provides the link duplex and is only valid if the link is up. 0 = Half duplex 1 = Full duplex
link_speed
0-1
link_mode
0-1
228
Although VLANs are commonly used to create individual broadcast domains and/or separate IP subnets, it is sometimes useful for a server to have a presence on more than one VLAN simultaneously. Several Sun products support multiple VLANs on a per-port or per-interface basis, allowing very flexible network configurations.
FIGURE 5-33 shows an example network that uses VLANs.
Software PC 1 (VLAN 2)
Software PC 2 (VLAN 2)
Engineering PC 3 (VLAN 1)
Accounting PC 4 (VLAN 3)
FIGURE 5-33
Chapter 5
229
VLAN Configuration
VLANs can be created according to various criteria, but each VLAN must be assigned a VLAN tag or VLAN ID (VID). The VID is a 12-bit identifier between 1 and 4094 that identifies a unique VLAN. For each network interface (ce0, ce1, ce2, and so on), 4094 possible VLAN IDs can be selected over an individual ce instance. Once the VLAN tag is chosen, a VLAN can be configured on a subnet using a ce interface with the ifconfig command. The VLAN tag is multiplied by 1000 and the instance number of the device, also the device Primary Point of Attachment (PPA), is added to give a VLAN PPA. For a VLAN with VID 123 that needs to be configured over ce0, the new VLAN PPA would be 123000. With this new PPA you can proceed to configure the ce interface within the VLAN.
# ifconfig ce123000 plumb inet up
You can also set up a configuration that is persistent through a reboot by creating a hostname file.
# hostaname.ce123000 inet
In summary, the VLAN PPA is calculated using the simple formula: VLAN PPA = VID * 1000 + Device PPA
Note Only GigaSwift NICs using the ce driver and Solaris 8 VLAN packages have
VLAN tagging capabilities. Other NICs do not.
230
Note Sun Trunking is not included with the Solaris operating system. This is an
unbundled software product. Sun Trunking provides trunking support for the following network interface cards:
s s s s s
Quad FastEthernet adapter, qfe GigabitEthernet adapter, ge GigaSwift Ethernet UTP or MMF adapter, ce Dual FastEthernet and Dual SCSI/P adapter, ce Quad GigaSwift Ethernet adapter, ce
The key to enabling the trunking capability is the nettr command. This command can be used to trunk devices of the same technology together. Once trunked, a trunk head interface is established, and that interface is used by ifconfig to complete the configuration. For example, if the two qfe instances (qfe0 and qfe1) need to be trunked, once the nettr command is complete, the trunk head would be assigned to qfe0. Then you could proceed to ifconfig to make the trunk operate under the TCP/IP protocol stack.
Trunking Configuration
The nettr(1M) utility is used to configure trunking. nettr(1M)can be used to:
s s s s
set up a trunk release a trunk display a trunk configuration display statistics of trunked interfaces
Following is the command syntax for nettr for setting up a trunk or modifying the configuration of the trunk members. The items in the square brackets are optional.
nettr -setup head-instance device=<qfe | ce | ge> members=<instance,instance,.,.> [ policy=<number> ]
Chapter 5
231
Trunking Policies
MAC
s
Is the default policy used by the Sun Trunking software. MAC is the preferred policy to use with switches. Most trunking-capable switches require use of the MAC hashing policy, but check your switch documentation. Uses the last three bits of the MAC address of both the source and destination. For two ports, the MAC address of the source and destination is first XORed: Result = 00, 01, which selects the port. Favors a large population of clients. For example, using MAC ensures that 50 percent of the client connections will go through one of two ports in a two-port trunk.
Round-Robin
s
Is the preferred policy with a back-to-back connection used between the output of a transmitting device and the input of an associated receiving device. Uses each network interface of the trunk in turn as a method of distributing packets over the assigned number of trunking interfaces. Could have an impact on performance because the temporal ordering of packets is not observed.
IP Destination Address
s
Uses the four bytes of the IP destination address to determine the transmission path. If a trunking interface host has one IP source address and it is necessary to communicate to multiple IP clients connected to the same router, then the IP Destination Address policy is the preferred policy to use.
Connects the source server to the destination based on where the connection originated or terminated. Uses the four bytes of the source and destination IP addresses to determine the transmission path. The primary use of the IP Source/IP Destination Address policy occurs where you use the IP virtual address feature to give multiple IP addresses to a single physical interface.
232
For example, you might have a cluster of servers providing network services in
which each service is associated with a virtual IP address over a given interface. If a service associated with an interface fails, the virtual IP address migrates to a physical interface on a different machine in the cluster. In such an arrangement, the IP Source Address/IP Destination Address policy gives you a greater chance of using more different links within the trunk than would the IP Destination Address policy.
Network Configuration
This section describes how to edit the network host files after any of the Sun adapters have been installed on your system. The section contains the following topics:
s s s s
Configuring the System to Use the Embedded MAC Address on page 233
Configuring the Network Host Files on page 234 Setting Up a GigaSwift Ethernet Network on a Diskless Client System on page 235
# eeprom local-mac-address\?=true
Chapter 5
233
In the previous example, the device instance is from a Sun GigaSwift Ethernet adapter installed in slot 1. For clarity, the instance number is in bold italics. Be sure to write down your device path and instance, which in the example is /pci@1f,0/pci@1/network@4 0. While your device path and instance might be different, they will be similar. You will need this information to make changes to the ce.conf file. See Setting Network Driver Parameters Using the ndd Utility on page 238. 2. Use the ifconfig command to set up the adapters ce interface. 3. Use the ifconfig command to assign an IP address to the network interface. Type the following at the command line, replacing ip_address with the adapters IP address:
# ifconfig ce0 plumb ip_address up
Refer to the ifconfig(1M) man page and the Solaris documentation for more information. If you want a setup that will remain the same after you reboot, create an /etc/hostname.ceinstance file, where instance corresponds to the instance number of the ce interface you plan to use. To use the adapters ce interface in the Step 1 example, create an /etc/hostname.ce0 file where 0 is the instance number of the ce interface. If the instance number were 1, the filename would be /etc/hostname.ce1.
Do not create an /etc/hostname.ceinstance file for a Sun GigaSwift Ethernet adapter interface you plan to leave unused.
s
The /etc/hostname.ceinstance file must contain the host name for the appropriate ce interface.
234
The host name should have an IP address and should be listed in the /etc/hosts file. The host name should be different from any other host name of any other interface; for example: /etc/hostname.ce0 and /etc/hostname.ce1 cannot share the same hostname.
The following example shows the /etc/hostname.ceinstance file required for a system called zardoz that has a Sun GigaSwift Ethernet adapter (zardoz-11). # cat /etc/hostname.hme0
zardoz
# cat /etc/hostname.ce0
zardoz-11
4. Create an appropriate entry in the /etc/hosts file for each active ce interface. For example:
# cat /etc/hosts # # Internet host table # 127.0.0.1 localhost 129.144.10.57 zardoz loghost 129.144.11.83 zardoz-11
Chapter 5
235
2. Use the pkgadd -R command to install the network device driver software packages to the diskless clients root directory on the server.
# pkgadd -R root_directory/Solaris_2.7/Tools/Boot -d . SUNWced
3. Create a hostname.ceinstance file in the diskless clients root directory. You will need to create an /export/root/client_name/etc/hostname.deviceinstance file for the network interface. See Configuring the Network Host Files on page 234 for instructions. 4. Edit the hosts file in the diskless clients root directory. You will need to edit the /export/root/client_name/etc/hosts file to include the IP address of the Network interface. See Configuring the Network Host Files on page 234 for instructions.
s
Be sure to set the MAC address on the server side and rebuild the device tree if you want to boot from the GigaSwift Ethernet port.
5. To boot the diskless client from the Network interface port, type the following boot command:
ok boot path-to-device:link-param, -v
236
1. Prepare the install server and client system to install the Solaris operating system over the network. 2. Find the root directory of the client system. The client systems root directory can be found in the install servers /etc/bootparams file. Use the grep command to search this file for the root directory.
# grep client_name /etc/bootparams client_name root=server_name:/netinstall/Solaris_2.7/Tools/Boot install=server_name:/netinstall boottype=:in rootopts=:rsize=32768
In the previous example, the root directory for the Solaris 7 client is /netinstall. In Step 4, you would replace root_directory with /netinstall. 3. Use the pkgadd -R command to install the network device driver software packages to the diskless clients root directory on the server.
# pkgadd -R root_directory/Solaris_2.7/Tools/Boot -d . SUNWced
4. Shut down and halt the client system. 5. At the ok prompt, boot the client system using the full device path of the network device. 6. Proceed with the Solaris operating system installation. 7. After installing the Solaris operating system, install the network interface software on the client system. This step is required because the software installed in Step 2 was required to boot the client system over the network interface. Often network interface cards are not a bundled option with Solaris. Therefore, after installation is complete, you will need to install the software in order for the operating system to use the clients network interfaces in normal operation. 8. Confirm that the network host files have been configured correctly during the Solaris installation. Although the Solaris software installation creates the clients network configuration files, you might need to edit these files to match your specific networking environment. See Configuring the Network Host Files on page 234 for more information about editing these files.
Chapter 5
237
Setting networking driver parameters using the ndd utility Reboot persistence with driver.conf
Style 1 drivers have a /dev/name instance symbolic link to a physical network device instance. Style 2 drivers have a /dev/name symbolic link to a physical network device instance.
Once the style is established, the way you use the ndd command has to be adjusted, as the way of getting exclusive access to the device instance with ndd is different based on the style. 1. Determine the style of driver youre using.
238
a. If there exists a Style 1 node /dev/name instance, then you can use the Style 1 command form.
# ndd /dev/bge0 -get adv_autoneg_cap 1 # ndd /dev/bge0 -set adv_autoneg_cap 0 # ndd /dev/bge0 -get adv_autoneg_cap 0
b. If there exists a Style 2 node /dev/name, then you cannot use the Style 1 form. You must use the Style 2 form. This requires an initial step, which is to set the configuration context.
# ndd /dev/hme -set instance 0
2. Once you are pointing to the correct instance, you can alter as many parameters as required for that instance.
# ndd /dev/hme -set instance 0
In all networking drivers, the instance number is allocated at the time of enumeration once the adapter is installed. The instance number is recorded permanently in the /etc/path_to_inst file. Take note of the instance numbers in /etc/path_to_inst so you can configure the instance using ndd.
# grep ce /etc/path_to_inst "/pci@1f,2000/pci@1/network@0" 2 "ce" "/pci@1f,2000/pci@2/network@0" 1 "ce" "/pci@1f,2000/pci@4/network@0" 0 "ce"
The instance association is shown in bold italics and can be used in both the configuration styles. The preceding examples show the ndd utility being used in the non-interactive mode. In that mode, only one parameter can be modified per command line. There is also an interactive mode that allows you to enter an ndd shell where you can read or issue writes to ndd parameters associated with a device.
Chapter 5
239
This mode assumes that you remember all the parameter options of the network interface.
If ndd is pointed to a device node that can only be a Style 1 device, then ndd is already pointing to a device instance.
# ndd /dev/bge0
If the device node can be a Style 2 device, then ndd is pointing to a driver and not necessarily a device instance. Therefore, you must always first set the instance variable to ensure that ndd is pointing to the right device instance before configuration begins.
# ndd /dev/ce name to get/set? instance value ? 0
240
A very useful feature of ndd is the ? query, which you can use to get a list of possible parameters that a particular driver supports.
# ndd /dev/ce name to get/set ? ? ? instance adv_autoneg_cap adv_1000fdx_cap adv_1000hdx_cap adv_100T4_cap adv_100fdx_cap adv_100hdx_cap adv_10fdx_cap adv_10hdx_cap adv_asmpause_cap adv_pause_cap master_cfg_enable master_cfg_value use_int_xcvr enable_ipg0 ipg0 ipg1 ipg2 rx_intr_pkts rx_intr_time red_dv4to6k red_dv6to8k red_dv8to10k red_dv10to12k tx_dma_weight rx_dma_weight infinite_burst disable_64bit name to get/set ? #
(read (read (read (read (read (read (read (read (read (read (read (read (read (read (read (read (read (read (read (read (read (read (read (read (read (read (read (read (read
only) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write) and write)
Once you have set the desired parameters, they will persist until reboot. To have them persist through a reboot, you must set those parameters in the driver.conf file.
Chapter 5
241
In both cases, the driver.conf file resides in the same directory as the device driver. For example, ge resides in /kernel/drv. Therefore, the ge.conf file also resides in /kernel/drv. Note that even when a system is booted in 64-bit mode, the driver.conf file is still located in the same directory as the 32-bit driver.
There are other older examples of global driver.conf parameters, which are a design choice of the driver developer. Those configuration parameters embody information about the device name and instance in the property.
trp.conf trp0_ring_speed = 4;
A more common method is to take advantage of the driver.conf framework to identify an instance that is unique for you.
242
The name is simply the driver name. In the previous example, hme is the name. The parent and unit address are found using the /etc/path_to_inst file. It is assumed that when you write a driver.conf file and apply instance properties, you know the instance to which you are applying the parameters. The instance becomes the key to finding the parent and unit address from the /etc/path_to_inst file.
# grep ce /etc/path_to_inst "/pci@1f,2000/network@2" 2 "hme" "/pci@1f,2000/network@1" 1 "hme" "/pci@1f,2000/network@0" 0 "hme"
In the example above, the instance number being configured is 1. The instance numbers in the example are shown in bold italics to guide the discussion. Taking the second line as being the line associated with hme1, you can begin to extract the information required to write the driver.conf file. The unit-address and parent are part of the leaf node information, which is the first string in quotes.
"/pci@1f,2000/network@1" 1 "hme"
leaf node
instance
name
The leaf node can be thought of as a file in a directory structure, so you can address it relative to root or relative to a parent. If it is relative to a parent, then the leaf node breaks down to the string to the right of the last / and the string remaining to the left of the / is the parent.
Chapter 5
243
"/pci@1f,2000/network@1"
parent
leaf node
Therefore, in the above example the parent = /pci@1f,2000, the unit address is the number or byte sequence to the right of the @ in the remaining leaf node, and the unit address = 1. The resulting driver.conf file to disable auto-negotiation for instance 1 is as follows:
hme.conf name = "hme" parent = "/pci@1f,2000" \ unit-address = "1" adv_autoneg_cap = 0;
/etc/system
set ge:ge_dmaburst_mod = 1;
Value Parameter Driver Module name
Once this file has been modified, the system must be rebooted for the changes to take effect.
244
The number of packets received by the interface A 64-bit version of ipackets so a larger count can be kept The number of bytes received by the interface A 64-bit version of rbytes so a larger count can be kept The number of multicast packets received by the interface The number of broadcast packets received by the interface The number of packets that are received by an interface but cannot be classified to any Layer 3 or above protocol available in the system The number of receive packet errors that led to a packet being discarded The number receive packets that could not be received because the NIC had no buffers available The number of packets transmitted by the interface A 64-bit version of opackets so a larger count can be kept The number of bytes transmitted by the interface
Chapter 5
245
A 64-bit version of obytes so a larger count can be kept The number of multicast packets transmitted by the interface The number of broadcast packets transmitted by the interface The number of packets that encountered an error on transmission, causing the packets to be dropped The number of transmit packets that were stalled for transmission because the NIC had no buffers available The number of collisions encountered while transmitting packets The current speed of the network connection in megabits per second The current MTU allowed by the driver, including the Ethernet header and 4-byte CRC
Provides the MII address of the transceiver currently in use. Provides the specific Vendor/Device ID of the transceiver currently in use. Indicates the type of transceiver currently in use.
246
Indicates the device is 1 Gbit/sec full-duplex capable. Indicates the device is 1 Gbit/sec half-duplex capable. Indicates the device is 100 Mbit/sec full-duplex capable. Indicates the device is 100 Mbit/sec half-duplex capable. Indicates the device is 10 Mbit/sec full-duplex capable. Indicates the device is 10 Mbit/sec full-duplex capable. Indicates the device is capable of asymmetric pause Ethernet flow control. Indicates the device is capable of symmetric pause Ethernet flow control when set to 1 and cap_asmpause is 0. If cap_asmpause = 1 while cap_pause = 0, transmit pauses based on receive congestion. cap_pause = 1, receive pauses and slows down transmit to avoid congestion. Indicates the device is capable of remote fault indication. Indicates the device is capable of auto-negotiation. Indicates the device is advertising 1 Gbit/sec Full duplex capability. Indicates the device is advertising 1 Gbit/sec Half duplex capability. Indicates the device is advertising 100M bits/s Full duplex capability. Indicates the device is advertising 100 Mbit/sec halfduplex capability. Indicates the device is advertising 10 Mbit/sec fullduplex capability. Indicates the device is advertising 10 Mbit/sec fullduplex capability. Indicates the device is advertising the capability of asymmetric pause Ethernet flow control.
state
state state
Chapter 5
247
adv_cap_pause
state
Indicates the device is advertising the capability of symmetric pause Ethernet flow control when adv_cap_pause = 1 and adv_cap_asmpause = 0. If adv_cap_asmpause = 1 while adv_cap_pause = 0, transmit pauses based on receive congestion. If adv_cap_pause = 1, receive pauses and slows down transmit to avoid congestion. Indicates the device is experiencing a fault that it is going to forward to the link partner. Indicates the device is advertising the capability of auto-negotiation. Indicates the link partner device is 1 Gbit/sec fullduplex capable. Indicates the link partner device is 1 Gbit/sec halfduplex capable. Indicates the link partner device is 100 Mbit/sec fullduplex capable. Indicates the link partner device is 100 Mbit/sec halfduplex capable. Indicates the link partner device is 10 Mbit/sec fullduplex capable. Indicates the link partner device is 10 Mbit/sec halfduplex capable. Indicates the device is advertising the capability of asymmetric pause Ethernet flow control. Indicates the link partner device is capable of symmetric pause Ethernet flow control when set to 1 and lp_cap_asmpause is 0. If lp_cap_asmpause = 1 while lp_cap_pause = 0, transmit pauses based on receive congestion. If lp_cap_pause = 1, receive pauses and slows down transmit to avoid congestion. Indicates the link partner is experiencing a fault with the link. Indicates the link partner device is capable of autonegotiation.
adv_cap_rem_fault adv_cap_autoneg lp_cap_1000fdx lp_cap_1000hdx lp_cap_100fdx lp_cap_100hdx lp_cap_10fdx lp_cap_10hdx lp_cap_asmpause lp_cap_pause
state state state state state state state state state state
lp_cap_rem_fault lp_cap_autoneg
state state
248
link_asmpause
state
Indicates the shared link asymmetric pause setting the value is based on local resolution column of Table 37-4 IEEE 802.3 spec. link_asmpause = 0 Link is symmetric Pause link_asmpause = 1 Link is asymmetric Pause Indicates the shared link pause setting. The value is based on local resolution shown above. If link_asmpause = 0 while link_pause = 0, the link has no flow control. If link_pause = 1, link can flow control in both directions. If link_asmpause = 1 while link_pause = 0, local flow control setting can limit link partner. If link_pause = 1, link will flow control local Tx. The current speed of the network connection in megabits per second. Indicates the link duplex. link_duplex = 0, indicates link is down and duplex will be unknown. link_duplex = 1, indicates link is up and in half duplex mode. link_duplex = 2, indicates link is up and in full duplex mode. Indicates whether the link is up or down. link_up = 1, indicates link is up. link_up = 0, indicates link is down.
link_pause
state
link_speed link_duplex
state state
link_up
state
Chapter 5
249
The starting point for this discussion is the physical layer because that layer is the most important with respect to creating the link between two systems. At the physical layer, failures can prevent the link from coming up. Or worse, the link comes up and the duplex is mismatched, giving rise to less-visible problems. Then the discussion will move to the data link layer, where most problems are performance related. During that discussion, the architecture features described above can be used to address many of these performance problems.
0 2 0 1000 1
If the link_up variable is set, then things are positive, and a physical connection is present. But also check that the speed matches your expectation. For example, if the interface is 1000BASE-Tx interface and you expect it to run at 1000 Mbit/sec, then the link_speed parameter shown should indicate 1000. If this is not the case, then a check of the link partner capabilities might be required to establish if they are the limiting factor. The following kstat command line will show output similar to the following:
kstat ce:0 | grep lp_cap lp_cap_1000fdx lp_cap_1000hdx lp_cap_100T4 lp_cap_100fdx lp_cap_100hdx lp_cap_10fdx lp_cap_10hdx lp_cap_asmpause lp_cap_autoneg lp_cap_pause
1 1 1 1 1 1 1 0 1 0
250
If the link partner appears to be capable of all the desired speed, then the problem might be local. There are two possibilities: The NIC itself is not capable of the desired speed. Or the configuration has no shared capabilities that can be agreed on hence the link will not come up. You can check this using the following kstat command line.
kstat ce:0 | grep cap_ cap_1000fdx cap_1000hdx cap_100T4 cap_100fdx cap_100hdx cap_10fdx cap_10hdx cap_asmpause cap_autoneg cap_pause .....
1 1 1 1 1 1 1 0 1 0
If all the required capabilities are available for the desired speed and duplex, yet there remains a problem with achieving the desired speed, the only remaining possibility is an incorrect configuration. You can check this by looking at individual ndd adv_cap_* parameters or you can use the kstat command:
kstat ce:0 | grep adv_cap_ adv_cap_1000fdx adv_cap_1000hdx adv_cap_100T4 adv_cap_100fdx adv_cap_100hdx adv_cap_10fdx adv_cap_10hdx adv_cap_asmpause adv_cap_autoneg adv_cap_pause
1 1 1 1 1 1 1 0 1 0
Configuration issues are where most problems lie. All the issues of configuration can be addressed using the kstat command above to establish the local and remote configuration, and adjusting the adv_cap_* parameters using ndd to correct the problem.
Chapter 5
251
The most common configuration problem is duplex mismatch, which is induced when one side of a link is enabled for auto-negotiation and the other is not. This is known as Forced mode and can only be guaranteed for 10/100 Mode operation. For 1000BASE-T UTP Mode operation, the Forced mode (auto-negotiation disabled) capability is not guaranteed because not all vendors support it. If Auto-negotiation is turned off, you must ensure that both ends of the connection are also in Forced mode, and that the speed and duplex are matched perfectly. If you fail to match Forced mode in gigabit operation, the impact will be that the link will not come up at all. Note that this result is quite different from the 10/100 Mode case. While in 10/100 Mode operation, if only one end of the connection is autonegotiating (with full capabilities advertised) the link will come up with the correct speed, but the duplex will always be set to half duplex (creating the potential for a duplex mismatch if the forced end is set to full duplex). If both sides are set to Forced mode and you fail to match speeds, the link will never come up. If both sides are set to forced mode and you fail to match duplex, the link will come up, but you will have a duplex mismatch. Duplex mismatch is a silent failure that manifests itself from an upper layer point of view as really poor performance as many of the packets get lost because of collisions and late collisions occurring on the half-duplex end of the connection due to violations of Ethernet protocol induced by the full-duplex end. The half-duplex end experiences collisions and late collisions while the full-duplex end experiences a whole manner of smashed packets, leading to MIB counters measuring, crc, runts, giants, alignment errors all being incremented. If the node experiencing poor performance is the half duplex end of the connection, you can look at the kstat values for collisions and late_collisions.
kstat ce:0 | grep collisions collisions 22332
late_collisions 15432 If the node experiencing poor performance is the full duplex end of the connection, you can look at the packet corruption counters, for example, crc_err, alignment_err.
kstat ce:0 | grep crc_err crc_err 22332 kstat ce:0 | grep alignment_err alignment_err 224532
252
Depending on the capability of the switch end or remote end of the connection, it may be possible to do similar measurements there. Forced mode while having the problem of creating a potential duplex mismatch also has the drawback of isolating the link partner capabilities from the local station. In Forced mode, you cannot view the lp_cap* values and determine the capabilities of the remote link partner locally. Where possible, use the default of Auto-negotiation with all capabilities advertised and avoid tuning the physical link parameters. Given the maturity of the Auto-negotiation protocol and its requirement in the 802.3z specification for one gigabit UTP Physical implementations, ensure that Autonegotiation to enabled.
Or you could use the interactive mode, described previously. The mechanism used for enabling Ethernet Flow control on the ge interface is also different, using the parameters in the table below.
TABLE 34 Statistic
adv_pauseTX adv_pauseRX
0-1 0-1
Transmit Pause if the Rx buffer is full. When you receive a pause slow down Tx.
Chapter 5
253
Theres also a deviation in ge for adjusting ndd parameters. For example, when modifying ndd parameters like adv_1000fdx_cap, the changes will not take effect until the adv_autoneg_cap parameter is toggled to change state (from 0-1 or from 1-0). This is a deviation from the General Ethernet MII/GMII convention for the take affect immediately rule of ndd.
kstat to view device-specific statistics mpstat to view system utilization information lockstat to show areas of contention
You can use the information from these tools to tune specific parameters. The tuning examples that follow describe where this information is most useful. You have two options for tuning: using the /etc/system file or the ndd utility. Using the /etc/system file to modify the initial value of the driver variables requires a system reboot for the to take effect. If you use the ndd utility for tuning, the changes take effect immediately. However, any modifications you make using the ndd utility will be lost when the system goes down. If you want the ndd tuning properties to persist through a reboot, add these properties to the respective driver.conf file. Parameters that have kernel statistics but have no capability to tune for improvement are omitted from this discussion because no troubleshooting capability is provided in those cases.
254
ge Gigabit Ethernet
The ge interface provides some kstats that can be used to measure the performance bottlenecks in the driver in the Tx or the Rx. The kstats allow you to decide what corrective tuning can be applied based on the tuning parameters previously described. The useful statistics are shown in TABLE 5-72.
TABLE 5-72 kstat name
rx_overflow no_free_rx_desc
counter counter
Number of times the hardware is unable to receive a packet due to the Internal FIFOs being full. Number of times the hardware is unable to post a packet because there are no more Rx Descriptors available. Number of times transmit packets are posted on the driver streams queue for processing some time later, the queues service routine. Number of times a packet is simply dropped by the driver because the module above the driver cannot accept the packet. The PCI bus speed that is driving the card.
no_tmds
counter
nocanput
counter
pci_bus_speed
value
When rx_overflow is incrementing, packet processing is not keeping up with the packet arrival rate. If rx_overflow is incrementing and no_free_rx_desc is not, this indicates that the PCI bus or SBus bus is presenting an issue to the flow of packets through the device. This could be because the ge card is plugged into a slower I/O bus. You can confirm the bus speed by looking at the pci_bus_speed statistic. An SBus bus speed of 40 MHz or a PCI bus speed of 33 MHz might not be sufficient to sustain full bidirectional one-gigabit Ethernet traffic. Another scenario that can lead to rx_overflow incrementing on its own is sharing the I/O bus with another device that has similar bandwidth requirements to those of the ge card. These scenarios are hardware limitations. There is no solution for SBus. For PCI bus, a first step in addressing them is to enable infinite burst capability on the PCI bus. You can achieve that by using the /etc/system tuning parameter ge_dmaburst_mode. Alternatively, you can reorganize the system to give the ge interface a 66-MHz PCI slot, or you can separate devices that contend for a shared bus segment by giving each of them a bus segment.
Chapter 5
255
The probability that rx_overflow incrementing is the only problem is small. Typically, Sun systems have a fast PCI bus, and memory subsystem, so delays are seldom induced at that level. It is more likely is that the protocol stack software might fall behind and lead to the Rx descriptor ring being exhausted of free elements with which to receive more packets. If this happens, then the kstat parameter no_free_rx_desc will begin to increment, meaning the CPU cannot absorb the incoming packet in the case of a single CPU. If more than one CPU is available, it is still possible to overwhelm a single CPU. But given that the Rx processing can be split using the alternative Rx data delivery models provided by ge, it might be possible to distribute the processing of incoming packets to more than one CPU. You can do this by first ensuring that ge_intr_mode is not set to 1. Also be sure to tune ge_put_cfg to enable the load-balancing worker thread or streams service routine. Another possible scenario is where the ge device is adequately handling the rate of incoming packets, but the upper layer is unable to deal with the packets at that rate. In this case, the kstat nocanputs parameter will be incrementing. The tuning that can be applied to this condition is available in the upper layer protocols. If you're running the Solaris 8 operating system or an earlier version, then upgrading to the Solaris 9 version will help your application experience fewer nocanputs. The upgrade might reduce nocanput errors due to improved multithreading and IP scalability performance improvements in the Solaris 9 operating system. While the Tx side is also subject to an overwhelmed condition, this is less likely than any Rx-side condition. If the Tx side is overwhelmed, it will be visible when the no_tmds parameter begins to increment. If the Tx descriptor ring size can be increased, the /etc/system tunable parameter ge_nos_tmd provides that capability.
ce Gigabit Ethernet
The ce interface provides a far more extensive list of kstats that can be used to measure the performance bottlenecks in the driver in the Tx or the Rx. The kstats allow you to decide what corrective tuning can be applied based on the tuning parameters described previously. The useful statistics are shown in TABLE 5-73.
TABLE 5-73 kstat name
Number of times the hardware is unable to receive a packet due to the Internal FIFOs being full. Number of times the hardware is unable to receive a packet due to Rx buffers being unavailable. Number of times the hardware is unable to receive a packet due to no space in the completion ring to post Received packet descriptor.
256
Number of packets being directed to load balancing thread XX. Number of packets sent using Multidata interface. Number of packets arriving that are less than 252 bytes in length. Number of packets arriving that are greater than 252 bytes in length. Number of packets arriving that are greater than 1522 bytes in length. Number of times a packet is simply dropped by the driver because the module above the driver cannot accept the packet. Number of packets dropped due to Service Fifo Queue being full. Number of packets hitting the small packet transmission method. Packets are copied into a premapped DMA buffer. Number of packets hitting the mid-range DDI DMA transmission method. Number of packets hitting the top range DVMA fast path DMA Transmission method. Number of packets being sent that are greater than 1522 bytes in length. Measure of the maximum number of packets ever queued on a Tx ring. Number of times a packet transmit was attempted and Tx Descriptor Elements were not available. The packet is postponed until later. Number of packets transmitted on a particular queue. The maximum packet size allowed past the MAC. The PCI bus speed that is driving the card.
rx_pkts_dropped tx_hdr_pkts
counter counter
When rx_ov_flow is incrementing, it indicates that packet processing is not keeping up with the packet arrival rate. If rx_ov_flow is incrementing while rx_no_buf or rx_no_comp_wb is not, this indicates that the PCI bus is presenting an issue to the flow of packets through the device. This could be because ce card is
Chapter 5
257
plugged into a slower PCI bus. This can be established by looking at the pci_bus_speed statistic. A bus speed of 33 MHz might not be sufficient to sustain full bidirectional one gigabit Ethernet traffic. Another scenario that can lead to rx_ov_flow incrementing on its own is sharing the PCI bus with another device that has bandwidth requirements similar to those of the ce card. These scenarios are hardware limitations. A first step in addressing them is to enable the infinite burst capability on the PCI bus. Use the ndd tuning parameter infinite-burst to achieve that. Infinite burst will help give ce more bandwidth, but the Tx and Rx of the ce device will still be competing for that PCI bandwidth. Therefore, if the traffic profile shows a bias toward Rx traffic and this condition is leading to rx_ov_flow, you can adjust the bias of PCI transactions in favor of the Rx DMA channel relative to the Tx DMA channel, using ndd parameters rx-dma-weight and tx-dma-weight Alternatively, you can reorganize the system by giving the ce interface a 66-MHz PCI slot, or you can separate devices that contend for a shared bus segment by giving each of them a bus segment. If this doesnt contribute much to reducing the problem, then you should consider using Random Early Detection (RED) to ensure that the impact of dropping packets is minimized with respect to keeping connections alive that normally would be terminated due to regular overflow. The following parameters that allow enabling RED are configurable using ndd: red-dv4to6k, red-dv6to8k, red-dv8to10k, and red-dv10to12k. The probability that rx_overflow incrementing is the only problem is small. Typically Sun systems have a fast PCI bus and memory subsystem, so delays are seldom induced at that level. It is more likely that the protocol stack software might fall behind and lead to the Rx buffers or completion descriptor ring being exhausted of free elements with which to receive more packets. If this happens, then the kstats parameters rx_no_buf and rx_no_comp_wb will begin to increment. This can mean that theres not enough CPU power to absorb the packets, but it can also be due to a bad balance of the buffer ring size versus the completion ring size, leading to the rx_no_comp_wb incrementing without the rx_no_buf incrementing. The default configuration is one buffer to four completion elements. This works great provided that the packets arriving are larger than 256 bytes. If they are not and that traffic dominates, then 32 packets will be packed into a buffer leading to a greater probability that configuration imbalance will occur. For that case, more completion elements need to be made available. This can be addressed using the /etc/system tunables ce_ring_size to adjust the number of available Rx buffers and ce_comp_ring_size to adjust the number of Rx packet completion elements. To understand the trafc prole of the Rx so you can tune these parameters, use kstat to look at the distribution of Rx packets across the rx_hdr_pkts and rx_mtu_pkts.
258
If ce is being run on a single CPU system and rx_no_buf and rx_no_comp_wb are incrementing, then you will have to resort again to RED or enable Ethernet flow control. If more than one CPU is available, it is still possible to overwhelm a single CPU. Given that the Rx processing can be split using the alternative Rx data delivery models provided by ce, it might be possible to distribute the processing of incoming packets to more than one CPU, described earlier as Rx load balancing. This will happen by default if the system has four or more CPUs, and it will enable four load-balancing worker threads. The threshold of CPUs in the system and the number of load-balancing worker threads enabled can be managed using the /etc/system tunables ce_cpu_threshold and ce_inst_taskqs. The number of load balancing worker threads and how evenly the Rx load is being distributed to each worker thread can be viewed with the ipacket_cpuxx kstats. The highest number of xx tells you how many load balancing worker threads are running while the value of these parameters gives you the spread of the work across the instantiated load balancing worker threads. This, in turn, gives an indication if the load balancing is yielding a benefit. For example, if all ipacket_cpuxx kstats have an approximately even number of packets counted on each, then the load balancing is optimal. On the other hand, if only one is incrementing and the others are not, then the benefit of Rx load balancing is nullified. It is also possible to measure whether the system is experiencing a even spread of CPU activity using mpstat. In the ideal case, if you experience good load balancing as shown in the kstats ipackets_cpuxx, it should also be visible in mpstat that the workload is evenly distributed to multiple CPUs. If none of this benefit is visible, then disable the load balancing capability completely, using the /etc/system variable ce_taskq_disable. The Rx load balancing provides packet queues, also known as service FIFOs, between the interrupt threads that fan out the workload and the service FIFO worker threads that drain the service FIFO and complete the workload. These service FIFOs are of fixed size and are controlled by the /etc/system variable ce_srv_fifo_depth. It is possible that the service FIFOs can also overflow and drop packets as the rate of packet arrival exceeds the rate with which the service FIFO draining thread can complete the post processing. These dropped packets can be measured using the rx_pkts_dropped kstat. If this is measured as occurring, you can increase the size of the service FIFO or you can increase the number of service FIFOs, allowing more Rx load balancing. In some cases, it may be possible to eliminate increments in rx_pkts_dropped, but the problem may move to rx_nocanputs, which is generally only addressable by tuning that can be applied by upper layer protocol. If you're running the Solaris 8 operating system or an earlier version, then upgrading to the Solaris 9 version will help your application experience fewer nocanputs. The upgrade might reduce nocanput errors due to improved multithreading and IP scalability performance improvements in the Solaris 9 operating system.
Chapter 5
259
There is a difficulty is maximizing the Rx load balancing, and it is contingent on the Tx ring processing. This is measurable using the lockstat command and will show contention on the ce_start routine at the top as the most contended driver function. This contention cannot be eliminated, but it is possible to employ a new Tx method known as Transmit serialization, which keeps contention to a minimum while forcing the Tx processes on a fixed set of CPUs. Keeping the Tx process on a fixed CPU reduces the risk of CPUs spinning waiting for other CPUs to complete their Tx activity, ensuring CPUs are always kept busy doing useful work. This transmission method can be enabled using the /etc/system variable ce_start_cfg, setting it to 1. When you enable Transmit serialization, you will be trading off Transmit latency for avoiding mutex spins induced by contention. The Tx side is also subject to overwhelmed condition, although this is less likely than any Rx side condition. This becomes visible when tx_max_pending value matches the size of the /etc/system variable ce_tx_ring_size. If this occurs, then you know that packets are being postponed because Tx descriptors are being exhausted. Therefore the size of the ce_tx_ring_size should be increased. The tx_hdr_pkts, tx_ddi_pkts, and tx_dvma_pkts are useful for establishing the traffic profile of an application and matching that with the capabilites of a system. For example, many small systems have very fast memory access times making the cost of setting up DMA transactions more expensive than transmitting directly from a pre-mapped DMA buffer, in which case you can adjust the DMA thresholds programmable via /etc/system to push more packet into the preprogrammed DMA versus the per packet programming. Once the tuning is complete, these statistics can be viewed again to see if the tuning took effect. The tx_queueX kstats give a good indication if Tx load balancing matches the Rx side. If no load balancing is visible, meaning all the packets appear to be getting counted by only one tx_queue, then it may make sense to switch this feature off. The /etc/system variable that does that is ce_no_tx_lb. The mac_mtu gives an indication of the maximum size of packet that will make it through the ce device. It is useful to know if jumbo frames is enabled at the DLPI layer below TCP/IP. If jumbo frames is enabled, then the MTU indicated by mac_mtu will be 9216. This is helpful, as it will show if theres a mismatch between the DLPI layer MTU and the IP layer MTU, allowing troubleshooting to occur in a layered manner. Once jumbo frames is successfully configured at the driver layer and the TCP/IP layer, then you should ensure that jumbo frames packets are being communicated using the rx_jumbo_pkts and tx_jumbo_pkts to ensure Transmits and Receives of jumbo frame packets respectively is happening correctly.
260
CHAPTER
A flat architecture is composed of a multi-layer switch that performs multiple switching functions in one physical network device. This implies that a packet will traverse fewer network switching devices when communicating from the client to the server. This results in higher availability. A multi-level architecture is composed of multiple small switches where each switch performs one or two switching functions. This implies that a packet will traverse more network switching devices when communicating from the client to the server. This results in a lower availability.
261
Serial components reduce availability and parallel components increase availability. A serial design requires that every component is functioning at the same time. If any one component fails, the entire system fails. A parallel design offers multiple paths in case one path fails. In a parallel design, if any one component fails, the entire system still survives by using the backup path. Three network architecture aspects impact network availability:
s
Component failure This aspect is the probability of the device failing. It is measured using statistics averaging the amount of time the device works divided by the average time the device works plus the failed time. This value is called the MTBF. In calculating the MTBF, components that are connected serially dramatically reduce the MTBF, while components that are in parallel increase the MTBF. System failure This aspect captures failures that are caused by external factors, such as a technician accidentally pulling out a cable. The number of components that are potential candidates for failure is directly proportional to the complexity of the system. Design B in FIGURE 6-1 has more components that can go wrong, which contributes to the increased probability of failure. Single points of failure This aspect captures the number of devices that can fail and still have the system functioning. Neither Design A nor Design B shown in FIGURE 6-1 has a single point of failure (SPOF), so they are equal in this regard. However, Design B is somewhat more resilient because if a network interface card (NIC) fails, that failure is isolated by the Layer 2 switch and does not impact the rest of the architecture. This issue has a trade-off to consider, where availability is sacrificed for increased resiliency and isolation of failures.
FIGURE 6-1 shows two network designs. In both designs, Layer 2 switches provide physical connectivity for one virtual local area network (VLAN) domain. Layer 27 switches are multilayer devices providing routing, load balancing, and other IP services in addition to physical connectivity.
Design A shows a flat architecture, often seen with multilayer chassis-based switches using Extreme Networks Black Diamond, Foundry Networks BigIron, or Cisco switches. The switch can be partitioned into VLANs, isolating traffic from one segment to another, yet providing a much better overall solution. In this approach, the availability will be relatively high because there are two parallel paths from the ingress to each server and only two serial components that a packet must traverse to reach the target server. In Design B, the architecture provides the same functionality, but across many small switches. From an availability perspective, this solution will have a relatively lower mean time between failures (MTBF) because there are more serial components that a packet must traverse to reach a target server. Other disadvantages of this approach include manageability, scalability, and performance. However, one can argue that there might be increased security using this approach, which for some customers outweighs all other factors. In Design B, multiple switches must be hacked to control the network, whereas in Design A, only one switch needs to be hacked to bring down the entire network.
262 Networking Concepts and Technology: A Designers Resource
A) Flat architecture, higher MTBF Redundant multilayer switches Web service Directory service Application service Database service Integration service Integration layer 3 5 17 19 Switch Layer 2-7
Network Availability Design Strategies
2 4
10 12 14 Switch l Layer 2
FIGURE 6-1
11 13 15
16 18
Switch Layer 3
Chapter 6
Service modules
Distribution layer
263
Layer 2 Strategies
There are several Layer 2 availability design options. Layer 2 availability designs are desirable because any fault detection and recovery is transparent to the IP layer. Further, the fault detection and recovery can be relatively fast if the correct approach is taken. In this section, we explain the operation and recovery times for three approaches:
s s
Trunking and variants based on IEEE 802.3ad SMLT and DMLT, a relatively new and promising approach available from Nortel Networks Spanning Tree, a time-tested and proven Layer 2 availability strategy, originally designed for bridged networks by the brilliant Dr. Radia Perlman from DEC and now with Sun Microsystems.
264
MAC client
IP
LMAC Frame collector Logical MAC LACP Aggregator parser/Mux Aggregator parser/Mux Frame distributor
Physical MAC
FIGURE 6-2
PMAC
PHY1
PHY2
Theory of Operation
The Link Aggregation Control Protocol (LACP) allows both ends of the trunk to communicate trunking or link aggregation information. The first command sent is the Query command, where each link partner discovers the link aggregation capabilities of the other. If both partners are willing and capable, a Start Group command is sent. The Start Group command indicates that a link aggregation group is to be created followed by adding segments to this group that include link identifiers tied to the ports participating in the aggregation. The LACP can also delete a link, which might be due to the detection of a failed link. Instead of balancing the load across the remaining ports, the algorithm simply places the failed links traffic onto one of the remaining links. The collector reassembles traffic coming from the different links. The distributor takes an input stream and spreads out the traffic across the ports belonging to a trunk group or link aggregation group.
Availability Issues
To understand suitability for network availability, Sun Trunking 1.2 software was installed on several quad fast Ethernet cards. The client has four trunks connected to the switch. The server also has four links connected to the switch. This setup allows the load to be distributed across the four links, as shown in FIGURE 6-3.
Chapter 6
265
Switch-trunking capable Trunked links point to point Trunked links point to point
The highlighted line (in bold italic) in the CODE EXAMPLE 6-1 output shows the traffic from the client qfe0 moved to the server qfe1 under load balancing.
CODE EXAMPLE 6-1
Jan 10 14:22:05 2002 Name qfe0 qfe1 qfe2 qfe3 Ipkts 210 0 0 0 Ierrs 0 0 0 0 Opkts 130 130 130 130 Oerrs 0 0 0 0 Collis 0 0 0 0 Crc 0 0 0 0 %Ipkts 100.00 0.00 0.00 0.00 %Opkts 25.00 25.00 25.00 25.00
5.73(New Peak)
31.51(Past Peak)
Jan 10 14:22:06 2002 Name qfe0 qfe1 qfe2 qfe3 Ipkts 0 0 0 0 Ierrs 0 0 0 0 Opkts 0 0 0 0 Oerrs 0 0 0 0 Collis 0 0 0 0 Crc 0 0 0 0 %Ipkts 0.00 0.00 0.00 0.00 %Opkts 0.00 0.00 0.00 0.00
0.00(New Peak)
31.51(Past Peak)
Jan 10 14:22:07 2002 Name qfe0 qfe1 Ipkts 0 0 Ierrs 0 0 Opkts 0 0 Oerrs 0 0 Collis 0 0 Crc 0 0 %Ipkts 0.00 0.00 %Opkts 0.00 0.00
266
Output Showing Traffic from Client qfe0 to Server qfe1 (Continued) 0 0 0 0 0 0 0 0 0.00 0.00 0.00 0.00
0.00(New Peak)
31.51(Past Peak)
Jan 10 14:22:08 2002 Name qfe0 qfe1 qfe2 qfe3 Ipkts 0 1028 0 0 Ierrs 0 0 0 0 Opkts 0 1105 520 520 Oerrs 0 0 0 0 Collis 0 0 0 0 Crc 0 0 0 0 %Ipkts 0.00 100.00 0.00 0.00 %Opkts 0.00 51.52 24.24 24.24
23.70(New Peak)
31.51(Past Peak)
Several test transmission control protocol (TTCP) streams were pumped from one host to the other. When all links were up, the load was balanced evenly and each port experienced a 25 percent load. When one link was cut, the traffic of the failed link (qfe0) was transferred onto one of the remaining links (qfe1), which then showed a 51 percent load. The failover took three seconds. However, if all links were heavily loaded, the algorithm might force one link to be saturated with its original link load in addition to the failed links traffic. For example, if all links were running at 55 percent capacity and one link failed, one link would be saturated at 55 percent + 55 percent = 110 percent traffic. Link aggregation is suitable for point-to-point links for increased availability, where nodes are on the same segment. However, there is a trade-off of port cost on the switch side as well as the host side.
Load-Sharing Principles
The Trunking Layer will break up packets on a frame boundary. This means as long as the server and switch know that a trunk is spanning certain physical ports, neither side needs to know about which algorithm is being used to distribute the load across the trunked ports. What is important is to understand the traffic characteristics in order to optimally distribute the load as evenly as possible across the trunked ports. The following diagrams describe how load sharing across trunks should be configured based on the nature of the traffic, which is often asymmetric.
Chapter 6
267
268
FIGURE 6-5 shows an incorrect trunking policy on a switch. In this case, the ingress traffic, which has single target IP address and target MAC, should not use a trunking policy based solely on the destination IP address or destination or source MAC.
Chapter 6
269
FIGURE 6-7
FIGURE 6-7 shows an incorrect trunking policy on the server. In this example, the egress traffic has a distributed target IP address, but the target MAC of the default router should not use a trunking policy based on the destination MAC because the destination MAC will only point to the default router :0:0:8:8:1, not the actual client MAC. Trunking policy should not use either the source IP address or the source MAC. The trunking policy should use the target IP addresses because that will spread the load across the physical interfaces evenly.
270
FIGURE 6-8
Chapter 6
271
Business Policy Switch 2000. These configurations illustrate how network high availability can be achieved without encountering the scalability issues that have plagued IPMP and VRRP deployments. SMLT is a Layer 2 trunking redundancy mechanism. It is similar to plain trunking except that it spans two physical devices. FIGURE 6-9 shows a typical SMLT deployment using two NortelNetworks Passport 8600 Switches and a Sun server with dual GigaSwift cards. The trunk spans both cards, but each card is connected to a separate switch. SMLT technology, in effect, exposes one logical trunk to the Sun server, when actually there are two physically separate devices.
SW1
IST12
SW2
SW3
IST34
SW4
ce0
ce1
FIGURE 6-9
FIGURE 6-10 shows another integration point where workgroup servers connect to the corporate network at an edge point. In this case, instead of integrating directly into the enterprise core, the servers connect to a smaller Layer 2 switch, which runs DMLT, a scaled version of the SMLT, but similar in functionality. DMLT has fewer features and a smaller binary image than SMLT. This means that DMLT can run on smaller network devices. The switches are viewed as one logical trunking device even though packets are load shared across the links, with the switches ensuring
272
packets arrive in order at the remote destination. FIGURE 6-10 illustrates a server-toedge integration of a Layer 2 high-availability design using Sun Trunking 1.3 and NortelNetworks Business Policy 2000 Wiring Closet Edge Switches.
SW1
IST12
SW2
SW3
IST34
SW4
ce0
ce1
FIGURE 6-10
Chapter 6
273
# mlt mlt mlt mlt mlt mlt mlt mlt mlt mlt mlt #
1 1 1 1 1 1 2 2 2 2 2
create add ports 1/1,1/8 name "IST Trunk" perform-tagging enable ist create ip 10.19.10.2 vlan-id 10 ist enable create add ports 1/6 name "SMLT-1" perform-tagging enable smlt create smlt-id 1
Availability Issues
To better understand failure detection and recovery, a testbed was created, as shown in FIGURE 6-11.
274
Server 11.0.0.51
s48t
7 7 8 sw2 8
sw3
sw4
Client 16.0.0.51
FIGURE 6-11
The switches sw1, sw2, sw3, and sw4 were configured in a Layer 2 network with an obvious loop, which was controlled by running the STP among these switches. On the client, we ran the traceroute server command, resulting in the following output, which shows that the client sees only two Layer 3 networks: the 11.0.0.0 and the 16.0.0.0 network.
client># traceroute server traceroute: Warning: Multiple interfaces found; using 16.0.0.51 @ hme0 traceroute to server (11.0.0.51), 30 hops max, 40 byte packets 1 16.0.0.1 (16.0.0.1) 1.177 ms 0.524 ms 0.512 ms 2 16.0.0.1 (16.0.0.1) 0.534 ms !N 0.535 ms !N 0.529 ms !N
Chapter 6
275
Similarly, the server sees only two Layer 3 networks. We ran the traceroute client command on the server and got the following output:
server># traceroute client traceroute: Warning: Multiple interfaces found; using 11.0.0.51 @ hme0 traceroute to client (16.0.0.51), 30 hops max, 40 byte packets 1 11.0.0.1 (11.0.0.1) 0.756 ms 0.527 ms 0.514 ms 2 11.0.0.1 (11.0.0.1) 0.557 ms !N 0.546 ms !N 0.531 ms !N
The following outputs show the STP configuration and port status of the participating switches, showing the port MAC address of the root switches.
* sw1:17 # sh s0 ports 7-8 Stpd: s0 Port: 7 PortId: 4007 Stp: ENABLED Path Cost: 4 Port State: FORWARDING Topology Change Ack: FALSE Port Priority: 16 Designated Root: 80:00:00:01:30:92:3f:00 Designated Cost: 0 Designated Bridge: 80:00:00:01:30:92:3f:00 Designated Port Id: 4007 Stpd: s0 Port: 8 PortId: 4008 Stp: ENABLED Path Cost: 4 Port State: FORWARDING Topology Change Ack: FALSE Port Priority: 16 Designated Root: 80:00:00:01:30:92:3f:00 Designated Cost: 0 Designated Bridge: 80:00:00:01:30:92:3f:00 Designated Port Id: 4008
* sw2:12 # sh s0 ports 7-8 Port Mode State Cost Flags Priority Port ID Designated Bridge 7 802.1D FORWARDING 4 e-R-- 16 16391 80:00:00:01:30:92:3f:00 8 802.1D FORWARDING 4 e-D-- 16 16392 80:00:00:01:30:92:3f:00 Total Ports: 8 Flags: e=Enable, d=Disable, T=Topology Change Ack R=Root Port, D=Designated Port, A=Alternative Port
276
* sw3:5 # sh s0 ports 7-8 Stpd: s0 Port: 7 PortId: 4007 Stp: ENABLED Path Cost: 4 Port State: FORWARDING Topology Change Ack: FALSE Port Priority: 16 Designated Root: 80:00:00:01:30:92:3f:00 Designated Cost: 0 Designated Bridge: 80:00:00:01:30:92:3f:00 Designated Port Id: 4001 Stpd: s0 Port: 8 PortId: 4008 Stp: ENABLED Path Cost: 4 Port State: FORWARDING Topology Change Ack: FALSE Port Priority: 16 Designated Root: 80:00:00:01:30:92:3f:00 Designated Cost: 4 Designated Bridge: 80:00:00:e0:2b:98:96:00 Designated Port Id: 4008
The following output shows that STP has blocked Port 8 on sw4.
* sw4:10 # sh s0 ports 7-8 Stpd: s0 Port: 7 PortId: 4007 Stp: ENABLED Path Cost: 4 Port State: FORWARDING Topology Change Ack: FALSE Port Priority: 16 Designated Root: 80:00:00:01:30:92:3f:00 Designated Cost: 4 Designated Bridge: 80:00:00:01:30:f4:16:a0 Designated Port Id: 4008 Stpd: s0 Port: 8 PortId: 4008 Stp: ENABLED Path Cost: 4 Port State: BLOCKING Topology Change Ack: FALSE Port Priority: 16 Designated Root: 80:00:00:01:30:92:3f:00 Designated Cost: 4 Designated Bridge: 80:00:00:e0:2b:98:96:00 Designated Port Id: 4008
To get a better understanding of failure detection and fault recovery, we conducted a test where the client continually sent a ping to the server, and we pulled a cable on the spanning tree path.
Chapter 6
277
The following output shows that it took approximately 58 seconds for failure detection and recovery, which is not acceptable in most mission-critical environments. (Each ping takes about one second. The following output shows that from icmp_seq=16 to icmp_seq=74, the pings did not succeed.)
on client --------4 bytes from server (11.0.0.51): icmp_seq=12. time=1. ms 64 bytes from server (11.0.0.51): icmp_seq=13. time=1. ms 64 bytes from server (11.0.0.51): icmp_seq=14. time=1. ms 64 bytes from server (11.0.0.51): icmp_seq=15. time=1. ms 64 bytes from server (11.0.0.51): icmp_seq=16. time=1. ms ICMP Net Unreachable from gateway 16.0.0.1 for icmp from client (16.0.0.51) to server (11.0.0.51) ... ... ICMP Net Unreachable from gateway 16.0.0.1 for icmp from client (16.0.0.51) to server (11.0.0.51) ICMP Net Unreachable from gateway 16.0.0.1 for icmp from client (16.0.0.51) to server (11.0.0.51) for icmp from client (16.0.0.51) to server (11.0.0.51) 64 bytes from server (11.0.0.51): icmp_seq=74. time=1. ms 64 bytes from server (11.0.0.51): icmp_seq=75. time=1. ms 64 bytes from server (11.0.0.51): icmp_seq=76.
Layer 3 Strategies
There are several Layer 3 availability design options. Layer 3 availability designs are desirable because there could be a fault at the IP layer, but not at the lower layers. By implementing a Layer 3 availability strategy, we can infer the status of the network at all layers below, but not at the layers above. The fault detection and recovery can be relatively slower than Layer 2 strategies, depending on the strategy. In this section we explain the operation and recovery times for three approaches:
s
VRRP and IPMP proven to be very useful at the server-to-default router network connectivity segment of the data center network OSPF a proven and effective link-state routing protocol, suitable for inter-switch connectivity RIP a time-tested distance vector routing protocol, suitable for inter-switch connectivity.
278
We describe how these network design strategies work and actually tested configurations.
Chapter 6
279
recovery (MTTR), it has an availability of 0.999989958. With two cards, the MTBF becomes nine 9s at .9999999996 availability. This small incremental cost has a big impact on the overall availability computation.
FIGURE 6-12 shows the Sun server redundant NIC model using IPMP. The server has
two NICs, ge0 and ge1, with a fixed IP addresses of a.b.c.d and e.f.g.h. The virtual IP address of w.x.y.z is the IP address of the service. Client requests use this IP address as the destination. This IP address floats between the two interfaces ge0 or ge1. Only one interface can be associated with the virtual IP address at any one time. If the ge0 interface owns the virtual IP address, then data traffic will follow the P1 path. If the ge0 interface fails, then the ge1 interface will take over and associate the virtual IP address and data traffic will follow the P2 path. Failures can be detected within two seconds, depending on the configuration.
HCS
FIGURE 6-12
280
verify that a particular service is up and running. If it detects that the service has failed, then the VRRP can be configured, on some switches, to take this into consideration to impact the election algorithm and tie this failure to the priority of the VRRP router. Simultaneously, the server also monitors links. Currently, IPMP consists of a daemon, in.mpathd, that constantly sends pings to the default router. As long as the default router can receive a ping, the master interface (ge0) assumes ownership of the IP address. If the in.mpathd daemon detects that the default router is not reachable, automatic failover will occur, which brings down the link and floats the IP address of the server to the surviving interface (ge1). In the lab, we can tune IPMP and Extreme Standby Routing Protocol (ESRP) to achieve failure detection and recovery within one second. Because the ESRP is a CPU-intensive task and the control packets are on the same network as the production network, the trade-off is that if the switches, networks, or servers become overloaded, false failures can occur because the device can take longer than the strict timeout to respond to the peers heartbeat.
VRRP
2 VRRP
3 4 ge1 ge0
6 5
in.mpathd
FIGURE 6-13
authentication, hierarchy, and load balancing; and checksum information. From this information, each node can reliably determine if this LSP is the most recent by comparing seq numbers and computing the shortest path to every node and then collecting all LSPs from all nodes and comparing costs using Dijstras shortest path algorithm. To prevent continuous flooding, the sender never receives the same LSP packet that it sent out. To better understand OSPF for suitability from an availability perspective, the following lab network was set up, consisting of Extreme Network switches and Sun servers. FIGURE 6-14 describes the actual setup used to demonstrate availability characteristics of the interior routing protocol OSPF.
s48t 12.0.0.0
18.0.0.0
sw1 13.0.0.0
sw2 17.0.0.0
FIGURE 6-14
282
To confirm correct configuration, traceroute commands were issued from client to server. In the following output, the highlighted lines show the path through sw2:
client># traceroute server traceroute: Warning: Multiple interfaces found; using 16.0.0.51 @ hme0 traceroute to server (11.0.0.51), 30 hops max, 40 byte packets 1 16.0.0.1 (16.0.0.1) 1.168 ms 0.661 ms 0.523 ms 2 15.0.0.1 (15.0.0.1) 1.619 ms 1.104 ms 1.041 ms 3 17.0.0.1 (17.0.0.1) 1.527 ms 1.197 ms 1.043 ms 4 18.0.0.1 (18.0.0.1) 1.444 ms 1.208 ms 1.106 ms 5 12.0.0.1 (12.0.0.1) 1.237 ms 1.274 ms 1.083 ms 6 server (11.0.0.51) 0.390 ms 0.349 ms 0.340 ms
The following tables show the initial routing tables of the core routers. The first two highlighted lines in CODE EXAMPLE 6-3 show the route to the client through sw2. The second two highlighted lines show the sw2 path.
CODE EXAMPLE 6-3
Router sw1 Routing Table Gateway 12.0.0.1 12.0.0.1 12.0.0.2 13.0.0.1 13.0.0.2 18.0.0.2 13.0.0.2 18.0.0.2 13.0.0.2 18.0.0.2 18.0.0.1 127.0.0.1 Mtr 1 5 1 1 8 12 12 13 13 8 1 0 Flags UG---S-um UG-----um U------uU------uUG-----um UG-----um UG-----um UG-----um UG-----um UG-----um U------uU-H----um Use M-Use VLAN Acct-1 63 0 net12 0 98 0 net12 0 1057 0 net12 0 40 0 net13 0 4 0 net13 0 0 0 net18 0 0 0 net13 0 0 0 net18 0 0 0 net13 0 0 0 net18 0 495 0 net18 0 0 0 Default 0
Destination 10.100.0.0/24 11.0.0.0/8 12.0.0.0/8 13.0.0.0/8 14.0.0.0/8 15.0.0.0/8 15.0.0.0/8 16.0.0.0/8 16.0.0.0/8 17.0.0.0/8 18.0.0.0/8 127.0.0.1/8
Origin(OR): b - BlackHole, bg - BGP, be - EBGP, bi - IBGP, bo - BOOTP, ct - CBT d - Direct, df - DownIF, dv - DVMRP, h - Hardcoded, i - ICMP mo - MOSPF, o - OSPF, oa - OSPFIntra, or - OSPFInter, oe - OSPFAsExt o1 - OSPFExt1, o2 - OSPFExt2, pd - PIM-DM, ps - PIM-SM, r - RIP ra - RtAdvrt, s - Static, sv - SLB_VIP, un - UnKnown. Flags: U - Up, G - Gateway, H - Host Route, D - Dynamic, R - Modified, S - Static, B - BlackHole, u - Unicast, m - Multicast. Total number of routes = 12.
Chapter 6
283
sw2:8 # sh ipr OR *s *oa *oa *oa *oa *oa *oa *d *d *d # # Destination 10.100.0.0/24 11.0.0.0/8 12.0.0.0/8 13.0.0.0/8 14.0.0.0/8 15.0.0.0/8 16.0.0.0/8 17.0.0.0/8 18.0.0.0/8 127.0.0.1/8 Gateway 18.0.0.1 18.0.0.1 18.0.0.1 18.0.0.1 17.0.0.2 17.0.0.2 17.0.0.2 17.0.0.1 18.0.0.2 127.0.0.1 Mtr 1 9 8 8 8 8 9 1 1 0 Flags UG---S-um UG-----um UG-----um UG-----um UG-----um UG-----um UG-----um U------uU------uU-H----um Use M-Use VLAN Acct-1 27 0 net18 0 98 0 net18 0 0 0 net18 0 0 0 net18 0 0 0 net17 0 9 0 net17 0 0 0 net17 0 10 0 net17 0 403 0 net18 0 0 0 Default 0
sw3:5 # sh ipr OR *s *oa *oa *d *d *oa *oa *oa *oa *d Destination 10.100.0.0/24 11.0.0.0/8 12.0.0.0/8 13.0.0.0/8 14.0.0.0/8 15.0.0.0/8 16.0.0.0/8 17.0.0.0/8 18.0.0.0/8 127.0.0.1/8 Gateway 13.0.0.1 13.0.0.1 13.0.0.1 13.0.0.2 14.0.0.1 14.0.0.2 14.0.0.2 14.0.0.2 13.0.0.1 127.0.0.1 Mtr 1 9 8 1 1 8 9 8 8 0 Flags UG---S-um UG-----um UG-----um U------uU------uUG-----um UG-----um UG-----um UG-----um U-H----um Use M-Use VLAN Acct-1 26 0 net13 0 0 0 net13 0 121 0 net13 0 28 0 net13 0 20 0 net14 0 0 0 net14 0 0 0 net14 0 0 0 net14 0 0 0 net13 0 0 0 Default 0
284
The first two highlighted lines in CODE EXAMPLE 6-6 show the route back to the server through sw4. The second two highlighted lines show the sw2 path.
CODE EXAMPLE 6-6
sw4:8 # sh ipr OR *s *oa *oa *oa *oa *oa *d *d *oa *d *oa *d Destination 10.100.0.0/24 11.0.0.0/8 11.0.0.0/8 12.0.0.0/8 12.0.0.0/8 13.0.0.0/8 14.0.0.0/8 15.0.0.0/8 16.0.0.0/8 17.0.0.0/8 18.0.0.0/8 127.0.0.1/8 Gateway 14.0.0.1 17.0.0.1 14.0.0.1 17.0.0.1 14.0.0.1 14.0.0.1 14.0.0.2 15.0.0.1 15.0.0.2 17.0.0.2 17.0.0.1 127.0.0.1 Mtr 1 13 13 12 12 8 1 1 5 1 8 0 Flags UG---S-um UG-----um UG-----um UG-----um UG-----um UG-----um U------uU------uUG-----um U------uUG-----um U-H----um Use M-Use VLAN Acct-1 29 0 net14 0 0 0 net17 0 0 0 net14 0 0 0 net17 0 0 0 net14 0 0 0 net14 0 12 0 net14 0 204 0 net15 0 0 0 net15 0 11 0 net17 0 0 0 net17 0 0 0 Default 0
To check failover capabilities on the OSPF, the interface on the switch sw2 was damaged to create a failure and a constant ping command was run from the client to the server. The interface on the switch sw2 was removed, and the measurement of failover was performed as shown in the following output. The first highlighted line shows when the interface sw2 fails. The second highlighted line shows that the new switch interface sw3 route is established in two seconds.
client reading: 64 bytes from server (11.0.0.51): 64 bytes from server (11.0.0.51): ICMP Net Unreachable from gateway for icmp from client (16.0.0.51) ICMP Net Unreachable from gateway for icmp from client (16.0.0.51) 64 bytes from server (11.0.0.51): 64 bytes from server (11. icmp_seq=11. time=2. ms icmp_seq=12. time=2. ms 17.0.0.1 to server (11.0.0.51) 17.0.0.1 to server (11.0.0.51) icmp_seq=15. time=2. ms
OSPF took approximately two seconds to detect and recover from the failed node.
Chapter 6
285
The highlighted lines in the following output from the traceroute server command shows the new path from the client to the server through the switch interface sw3.
client># traceroute server traceroute: Warning: Multiple interfaces found; using 16.0.0.51 @ hme0 traceroute to server (11.0.0.51), 30 hops max, 40 byte packets 1 16.0.0.1 (16.0.0.1) 0.699 ms 0.535 ms 0.581 ms 2 15.0.0.1 (15.0.0.1) 1.481 ms 0.990 ms 0.986 ms 3 14.0.0.1 (14.0.0.1) 1.214 ms 1.021 ms 1.002 ms 4 13.0.0.1 (13.0.0.1) 1.322 ms 1.088 ms 1.100 ms 5 12.0.0.1 (12.0.0.1) 1.245 ms 1.131 ms 1.220 ms 6 server (11.0.0.51) 1.631 ms 1.200 ms 1.314 ms
The following code examples show the routing tables after the node failure. The first highlighted line in CODE EXAMPLE 6-7 shows the new route to the server through the switch sw3. The second highlighted line shows that the switch sw2 link is down.
CODE EXAMPLE 6-7
sw1:27 # sh ipr OR *s *oa *d *d *oa *oa *oa *oa d *d Destination 10.100.0.0/24 11.0.0.0/8 12.0.0.0/8 13.0.0.0/8 14.0.0.0/8 15.0.0.0/8 16.0.0.0/8 17.0.0.0/8 18.0.0.0/8 127.0.0.1/8 Gateway 12.0.0.1 12.0.0.1 12.0.0.2 13.0.0.1 13.0.0.2 13.0.0.2 13.0.0.2 13.0.0.2 18.0.0.1 127.0.0.1 Mtr 1 5 1 1 8 12 13 12 1 0 Flags UG---S-um UG-----um U------uU------uUG-----um UG-----um UG-----um UG-----um --------U-H----um Use M-Use VLAN 63 0 net12 168 0 net12 1083 0 net12 41 0 net13 4 0 net13 0 0 net13 22 0 net13 0 0 net13 515 0 -------0 0 Default Acct-1 0 0 0 0 0 0 0 0 0 0
sw1:4 # sh ipr OR *s *oa *d Destination 10.100.0.0/24 11.0.0.0/8 12.0.0.0/8 Gateway 12.0.0.1 12.0.0.1 12.0.0.2 Mtr Flags 1 UG---S-um 5 UG-----um 1 U------uUse M-Use VLAN 63 0 net12 168 0 net12 1102 0 net12 Acct-1 0 0 0
286
Switch sw2 Routing Table After Node Failure (Continued) 13.0.0.1 13.0.0.2 13.0.0.2 13.0.0.2 13.0.0.2 18.0.0.1 127.0.0.1 1 U------u8 UG-----um 12 UG-----um 13 UG-----um 12 UG-----um 1 --------0 U-H----um 41 4 0 22 0 515 0 0 net13 0 net13 0 net13 0 net13 0 net13 0 -------0 Default 0 0 0 0 0 0 0
sw1:4 # sh ipr *d 13.0.0.0/8 *oa 14.0.0.0/8 *oa 15.0.0.0/8 *oa 16.0.0.0/8 *oa 17.0.0.0/8 d 18.0.0.0/8 *d 127.0.0.1/8
sw3:6 # sh ipr OR *s *oa *oa *d *d *oa *oa *oa *d Destination 10.100.0.0/24 11.0.0.0/8 12.0.0.0/8 13.0.0.0/8 14.0.0.0/8 15.0.0.0/8 16.0.0.0/8 17.0.0.0/8 127.0.0.1/8 Gateway 13.0.0.1 13.0.0.1 13.0.0.1 13.0.0.2 14.0.0.1 14.0.0.2 14.0.0.2 14.0.0.2 127.0.0.1 Mtr Flags Use M-Use VLAN Acct-1 1 UG---S-um 26 0 net13 0 9 UG-----um 24 0 net13 0 8 UG-----um 134 0 net13 0 1 U------u29 0 net13 0 1 U------u20 0 net14 0 8 UG-----um 0 0 net14 0 9 UG-----um 25 0 net14 0 8 UG-----um 0 0 net14 0 0 U-H----um 0 0 Default 0
The highlighted line in CODE EXAMPLE 6-10 shows the new route back to the client through sw3.
CODE EXAMPLE 6-10
sw4:9 # sh ipr OR *s *oa *oa *oa *d *d *oa *d *d Destination 10.100.0.0/24 11.0.0.0/8 12.0.0.0/8 13.0.0.0/8 14.0.0.0/8 15.0.0.0/8 16.0.0.0/8 17.0.0.0/8 127.0.0.1/8 Gateway 14.0.0.1 14.0.0.1 14.0.0.1 14.0.0.1 14.0.0.2 15.0.0.1 15.0.0.2 17.0.0.2 127.0.0.1 Mtr 1 13 12 8 1 1 5 1 0 Flags UG---S-um UG-----um UG-----um UG-----um U------uU------uUG-----um U------uU-H----um Use M-Use VLAN Acct-1 29 0 net14 0 21 0 net14 0 0 0 net14 0 0 0 net14 0 12 0 net14 0 216 0 net15 0 70 0 net15 0 12 0 net17 0 0 0 Default 0
OSPF is a good routing protocol with enterprise networks. It has fast failure detection and recovery.
Chapter 6
287
288
s48t 12.0.0.0
18.0.0.0
sw1 13.0.0.0
sw3 14.0.0.0 If sw2 fails, backup path becomes active route 15.0.0.0 s48b
Client 16.0.0.51
FIGURE 6-15
The following output shows the server-to-client path before node failure. The highlighted lines show the path through the switch sw3.
server># traceroute client traceroute: Warning: Multiple interfaces found; using 11.0.0.51 @ hme0 traceroute to client (16.0.0.51), 30 hops max, 40 byte packets 1 11.0.0.1 (11.0.0.1) 0.711 ms 0.524 ms 0.507 ms 2 12.0.0.2 (12.0.0.2) 1.448 ms 0.919 ms 0.875 ms 3 13.0.0.2 (13.0.0.2) 1.304 ms 0.977 ms 0.964 ms 4 14.0.0.2 (14.0.0.2) 1.963 ms 1.091 ms 1.151 ms 5 15.0.0.2 (15.0.0.2) 1.158 ms 1.059 ms 1.037 ms 6 client (16.0.0.51) 1.560 ms 1.170 ms 1.107 ms
Chapter 6
289
The following code examples show the initial routing tables. The highlighted line in CODE EXAMPLE 6-11 shows the path to the client through the switch sw3.
CODE EXAMPLE 6-11
Switch sw1 Initial Routing Table Gateway 12.0.0.1 12.0.0.1 12.0.0.2 13.0.0.1 13.0.0.2 18.0.0.2 13.0.0.2 18.0.0.2 18.0.0.1 127.0.0.1 Mtr 1 2 1 1 2 3 4 2 1 0 Flags UG---S-um UG-----um U------uU------uUG-----um UG-----um UG-----um UG-----um U------uU-H----um Use M-Use VLAN Acct-1 32 0 net12 0 15 0 net12 0 184 0 net12 0 52 0 net13 0 1 0 net13 0 0 0 net18 0 10 0 net13 0 0 0 net18 0 12 0 net18 0 0 0 Default 0
OR *s *r *d *d *r *r *r *r *d *d
Destination 10.100.0.0/24 11.0.0.0/8 12.0.0.0/8 13.0.0.0/8 14.0.0.0/8 15.0.0.0/8 16.0.0.0/8 17.0.0.0/8 18.0.0.0/8 127.0.0.1/8
sw2:3 # sh ipr OR *s *r *r *r *r *r *r *d *d *d # # Destination 10.100.0.0/24 11.0.0.0/8 12.0.0.0/8 13.0.0.0/8 14.0.0.0/8 15.0.0.0/8 16.0.0.0/8 17.0.0.0/8 18.0.0.0/8 127.0.0.1/8 Gateway 18.0.0.1 18.0.0.1 18.0.0.1 18.0.0.1 17.0.0.2 17.0.0.2 17.0.0.2 17.0.0.1 18.0.0.2 127.0.0.1 Mtr Flags Use M-Use VLAN Acct-1 1 UG---S-um 81 0 net18 0 3 UG-----um 9 0 net18 0 2 UG-----um 44 0 net18 0 2 UG-----um 0 0 net18 0 2 UG-----um 0 0 net17 0 2 UG-----um 0 0 net17 0 3 UG-----um 3 0 net17 0 1 U------u17 0 net17 0 1 U------u478 0 net18 0 0 U-H----um 0 0 Default 0
sw3:3 # sh ipr OR Destination *s 10.100.0.0/24 *r 11.0.0.0/8 Gateway 13.0.0.1 13.0.0.1 Mtr Flags 1 UG---S-um 3 UG-----um Use M-Use VLAN 79 0 net13 3 0 net13 Acct-1 0 0
290
Switch sw3 Initial Routing Table (Continued) 13.0.0.1 13.0.0.2 14.0.0.1 14.0.0.2 14.0.0.2 14.0.0.2 13.0.0.1 127.0.0.1 2 1 1 2 3 2 2 0 UG-----um U------uU------uUG-----um UG-----um UG-----um UG-----um U-H----um 44 85 33 0 10 0 0 0 0 0 0 0 0 0 0 0 net13 net13 net14 net14 net14 net14 net13 Default 0 0 0 0 0 0 0 0
sw3:3 # sh ipr *r 12.0.0.0/8 *d 13.0.0.0/8 *d 14.0.0.0/8 *r 15.0.0.0/8 *r 16.0.0.0/8 *r 17.0.0.0/8 *r 18.0.0.0/8 *d 127.0.0.1/8
The highlighted line in CODE EXAMPLE 6-14 shows the path to the server through the switch sw3.
CODE EXAMPLE 6-14
sw4:7 # sh ipr OR *s *r *r *r *d *d *r *d *r *d Destination 10.100.0.0/24 11.0.0.0/8 12.0.0.0/8 13.0.0.0/8 14.0.0.0/8 15.0.0.0/8 16.0.0.0/8 17.0.0.0/8 18.0.0.0/8 127.0.0.1/8 Gateway 14.0.0.1 14.0.0.1 14.0.0.1 14.0.0.1 14.0.0.2 15.0.0.1 15.0.0.2 17.0.0.2 17.0.0.1 127.0.0.1 Mtr Flags Use M-Use VLAN Acct-1 1 UG---S-um 29 0 net14 0 4 UG-----um 9 0 net14 3 UG-----um 0 0 net14 0 2 UG-----um 0 0 net14 0 1 U------u13 0 net14 0 1 U------u310 0 net15 0 2 UG-----um 16 0 net15 0 1 U------u3 0 net17 0 2 UG-----um 0 0 net17 0 0 U-H----um 0 0 Default 0
The highlighted lines in the following output from running the traceroute client command show the new path from the server to the client through the switch sw2 after the switch sw3 fails.
server># traceroute client traceroute: Warning: Multiple interfaces found; using 11.0.0.51 @ hme0 traceroute to client (16.0.0.51), 30 hops max, 40 byte packets 1 11.0.0.1 (11.0.0.1) 0.678 ms 0.479 ms 0.465 ms 2 12.0.0.2 (12.0.0.2) 1.331 ms 0.899 ms 0.833 ms 3 18.0.0.2 (18.0.0.2) 1.183 ms 0.966 ms 0.953 ms 4 17.0.0.2 (17.0.0.2) 1.379 ms 1.082 ms 1.062 ms 5 15.0.0.2 (15.0.0.2) 1.101 ms 1.024 ms 0.993 ms 6 client (16.0.0.51) 1.209 ms 1.086 ms 1.074 ms
Chapter 6
291
The following output shows the result of the server ping commands.
64 bytes from client (16.0.0.51): icmp_seq=18. time=2. ms 64 bytes from client (16.0.0.51): icmp_seq=19. time=2. ms 64 bytes from client (16.0.0.51): icmp_seq=20. time=2. ms ICMP Net Unreachable from gateway 12.0.0.2 for icmp from server (11.0.0.51) to client (16.0.0.51) ICMP Net Unreachable from gateway 12.0.0.2 .. .. for icmp from server (11.0.0.51) to client (16.0.0.51) ICMP Net Unreachable from gateway 12.0.0.2 for icmp from server (11.0.0.51) to client (16.0.0.51) ICMP Net Unreachable from gateway 12.0.0.2 for icmp from server (11.0.0.51) to client (16.0.0.51) 64 bytes from client (16.0.0.51): icmp_seq=41. time=2. ms 64 bytes from client (16.0.0.51): icmp_seq=42. time=2. ms 64 bytes from client (16.0.0.51): icmp_seq=43. time=2. ms
The fault detection and recovery took in excess of 21 seconds. The RIPv2 is widely available. However, the failure detection and recovery is not optimal.
Link aggregation is suitable for increasing the bandwidth capacity and availability on point-to-point links only. Layer 2 availability designs using Sun Trunking 1.3 and Split MultiLink Trunking available on Nortel Networks Passport 8600 Switches were configured and tested. Distributed MultiLink Trunking was also configured and tested using Nortels smaller Layer 2 Business Policy switches. Both switches were found to provide rapid failure detection and failover recovery within two to five seconds. Further benefit of this approach was that the failure and recovery events were transparent to the IP layer.
292
Spanning Tree Protocol is not suitable because failure detection and recovery are slow. A recent improvement, IEEE 802.3w Rapid Spanning Tree, designed to improve these limitations, might be worth considering in the future. Layer 3 availability designs using VRRP and IPMP offer an alternative availability strategy combination for server-to-network connection. This approach provides rapid failure detection and recovery and is economically feasible when considering the increased MTBF calculations. Be sure to investigate the processing capabilities of the control processor and consult with the vendor on the impact of additional load due to the ICMP ping commands caused by IPMP.
Chapter 6
293
CHAPTER
Server Load Balancinghow to achieve increased availability and performance by redundancy of stateless applications Layer 7 Switchinghow to decouple internal applications from external references Network Address Translationhow to decouple internal IP addresses from external references Cookie Persistencehow to achieve stateful transactions over a stateless protocol Secure Sockets Layer (SSL)how to achieve secure transactions over a public network IPMPhow to achieve network interface redundancy on servers that is transparent to applications VRRPhow to achieve router redundancy.
s s
The chapter then describes the logical network architecture and various physical realizations. Most important, it describes actual tested network reference implementations. It first describes the original secure multi-tier architecture and its limitations. Then it describes a second architecture based on many small multi-layer and simple Layer 2 switches and their limitations. Finally, it describes in detail a collapsed network architecture based on large chassis-based switches. It is important to note that these designs are vendor independent and could have been realized by Cisco, Nortel, and other similar vendors or combinations thereof. Network Equipment Providers usually implement standard Layer 2 and Layer 3 functions using ASICs and there are few differences in their basic implementations. However, additional features such as load balancing can differentiate vendors significantly in how their products actually impact the network architecture. We explore two vendors and describe reference implementations that were configured
295
and tested. We then describe where it makes sense to use each design. We also discuss how to create virtual firewalls between tiers to increase the level of security without sacrificing wirespeed performance. In particular, we describe the tested configuration of Netscreen firewall and show how one box can be configured to create virtual firewalls, segregating and filtering inter-tier network traffic.
296
External network 192.168.10.0. Production network Web service network 10.10.0.0. Naming services network 10.20.0.0. Application services network 10.30.0.0.
Management network
FIGURE 7-1
Chapter 7
297
IP Services
The following subsections provide a description of some emerging IP services that are often an important component in a complete network design for a Sun ONE deployment. The IP services are divided into two categories:
s
Stateful Session Based This class of IP services requires that the switch maintain session state information so that a particular clients session state is maintained across all packets. This requirement has severe implications for highly available solutions and limits scalability and performance. Stateless Session Based This class of IP services does not require that the switch maintain any state information associated with a particular flow.
Many functions can be implemented either by network switches and appliances or by the Sun ONE software stack. This section describes how these new IP services work and the benefit they provide. It then discusses availability strategies. Later sections describe similar functions that are included in the Sun ONE integrated stack. Modern multilayer network switches perform many Layer 3 IP services in addition to vanilla routing. These services are implemented as functions that operate on a packet by modifying the packet headers and controlling the rate at which the packet is forwarded. IP services include functions such as QoS, server load balancing, application redirection, network address translation, and others. This section starts our discussion on an important service for data centersserver load balancingand then describes adjacent services that can be cascaded.
298
Client
Packet A
srcIP:a.b.c.d srcPort: 123 dstIP:VIP1 http://www.a.com/index.html
VIP1 = a.b.c.d:123
URL String Match HTTP Header SSL Session ID
VIP2 = e.f.g.h:456
Cookie
Cache
SLB
Custom Algorithm
First Function rewrote srcIP and port so that the real server will reply to this switch, which is at srcIP e.f.g.h and port 456. Dest is set to the SLB function.
Round Robin
Least Connections
SLB function finds the best server and rewrites the dstIP of the target real serverReal IP.
Packet A1
srcIP:e.f.g.h srcPort: 456 dstIP:VIP2 http://www.a.com/index.html
Packet A2
srcIP:e.f.g.h srcPort: 456 dstIP:RealIP http://www.a.com/index.html
FIGURE 7-2
FIGURE 7-2 shows that a typical client request is destined for an external VIP with IP
address a.b.c.d and port 123. Various functions, as shown, can intercept this request and rewrite it according to the provisioned configuration rules. The SLB algorithm will eventually intercept the packet and rewrite the destination IP address destined to the real server, which was chosen by a particular algorithm. The packet is then returned as indicated by the source IP address.
Reduces the load on one set of Web servers and redirects it to another set, which is usually cache servers for specific content Intercepts client requests and redirects to another destination for control of certain types of traffic based on filtered criteria
Chapter 7
299
FIGURE 7-3 illustrates the functional model of application redirection, which only rewrites the IP header.
servergroup 1 DEST = A
servergroup 2 DEST = B
FIGURE 7-3
300
servergroup 2 dnsa
servergroup 3 statb
servergroup 4 cacheb http://www.a.com/SMA/stata/index.html servergroup1 http://www.a.com/SMA/dnsa/index.html servergroup2 http://www.a.com/SMB/statb/index.html servergroup3 http://www.a.com/SMB/CACHEB/index.html servergroup4 http://www.a.com/SMB/DYNA/index.html servergroup1
FIGURE 7-4
servergroup 5 dynab
Isolates internal IP addresses from being exposed to the public Internet. Allows reuse of a single IP address. For example, clients can send their Web requests to www.a.com or www.b.com, where DNS maps both domains to a single IP address. The proxy switch receives this request with the packet containing an HTTP header in the payload that contains the target domain, for example a.com or b.com, and makes a decision to which group of servers to redirect this request. Allows parallel fetching of different parts of Web pages from servers optimized and tuned for that type of data. For example, a complex Web page might need GIFs, dynamic content, or cached content. With content switching, one set of Web servers can hold the GIFs and another can hold the dynamic content. The proxy switch can make parallel fetches and retrieve the entire page at a faster rate than would be possible otherwise. Ensures that requests with cookies or SSL session IDs are redirected to the same server to take advantage of persistence.
FIGURE 7-3 shows that the clients socket connection is terminated by the proxy function. The proxy retrieves as much of the URL as needed to make a decision based on the retrieved URL. FIGURE 7-3 shows various URLs mapped to various
Chapter 7
301
server groups, which are VIP addresses. The next step is to forward the URL directly or pass it off to the SLB function that is waiting for traffic destined to the server group. The proxy is configured with a VIP, so the switch forwards all client requests destined to this VIP to the proxy function. The proxy function rewrites the IP header, particularly the source IP and port, so that the server sends back the requested data to the proxy, not the client directly.
Security Prevents exposing internal private IP addresses to the public. IP Address Conservation Requires only one valid exposed IP address to fetch Internet traffic from internal networks with non-valid IP addresses. Redirection Intercepts traffic destined to one set of servers and redirects it to another by rewriting the destination IP and MAC addresses. The redirected servers can send the request directly back to the clients with half NAT translated traffic because the original source IP address has not been rewritten.
NAT is configured with a set of filters, usually a 5-tuple Layer 3 rule. If the incoming traffic matches a certain filter rule, the packet IP header is rewritten or another socket connection is initiated to the target server, which itself can be changed, depending on the rule.
302
the network switch. These advanced products are just emerging from startups such as Wincom Systems. This section discusses the switch and appliance interactions. A later section covers the server SSL implementation.
FIGURE 7-5 shows that once a client makes initial contact to a particular server, which
may have been selected based on SLB, the switch ensures that subsequent requests are forwarded to the same SSL server based on the SSL ID that the switch has stored during the initial SSL handshake. The switch keeps state information about the clients initial request based on HTTPS and port 443, which contain a hello message. This first request is then forwarded to the server selected by the SLB algorithm or by another function. The server responds to the clients hello message with an SSL session ID. The switch then intercepts this SSL session and stores it in a table. The switch forwards all of the clients subsequent requests to the same server as long as each request contains the SSL session ID in the HTTP header. FIGURE 7-5 shows there may be several different TCP socket connections that span the same SSL session. State is maintained by the SSL session ID in each HTTP request sent by the same client.
Client
SSI Server 1
Switch Stores SSL Session ID and switches client to same SSL Server SSI Server 2
FIGURE 7-5
An appliance can be added for increased performance in terms of SSL handshakes and bulk encryption throughput. FIGURE 7-7 illustrates how an SSL appliance would be potentially deployed. Client requests first come in on a specific URL with the HTTPS protocol on port 443. The switch recognizes that these requests must be directed to the appliance, which is configured to provide that SSL service. A typical appliance such as Netscaler can also be configured, in addition to SSL acceleration, to provide content switching and load balancing. The appliance then reads or inserts cookies and resubmits the HTTP request to an appropriate server, which can maintain state based on the cookie that was in the HTTP header.
Chapter 7
303
http
Load balancer
Client
Internet
Multilayer switch
SSL
http
Session persistence based on cookie
https
SSL accelerator appliance key exchange and bulk encryption FIGURE 7-6
304
SW1
SW2
Client Tier
Network Tier Layer 3 Redundancy Session-Based Services require Session Sharing. Other stateless services failover with no problem.
FIGURE 7-7
each tier. Also shown are the availability strategies for the Network and Web tier. External tier availability strategies are outside the scope of this book. We will limit our discussion to the services tiers, which include Web, Application Services, Naming, and so on. Designing network architectures for optimal availability requires maximizing two orthogonal components:
s
Intra Availability Refers to maximizing the function that estimates failure probability of the components themselves. The components that cause the failure are only considered by the following equation:
Inter Availability Refers to minimizing the impact of failures caused by factors external to the system such as single points of failure (SPOFs), power outages, or a technician accidently pulling out a cable.
Chapter 7
305
It is not sufficient to simply maximize the FAvailability function. The SPOF and environmental factors also must be considered. The networks designed in this chapter describe a highly available architecture that conforms to these design principles and is described in further detail later.
External network 192.168.10.0. Production network Web service network 10.10.0.0. Naming services network 10.20.0.0. Application services network 10.30.0.0.
Management network
FIGURE 7-8
306
FIGURE 7-8 is repeated here to simplify a detailed discussion. The diagram shows an overview of the logical network architecture, showing how the tiers map to the different networks, which are also mapped to segregated VLANs. This segregation allows inter-tier traffic to be controlled by filters on the switch or a firewall, which is the only bridge point between VLANs. The following describes each subnetwork:
s
External network The external facing network that directly connects to the Internet. All IP addresses must be registered and should be secured with a firewall. The following networks are assigned non-routable IP addresses based on RFC 1918, which can also be based on the following: 10.0.0.0 10.255.255.255 (10/8 prefix) 172.16.0.0 172.31.255.255 (172.16/12 prefix) 192.168.0.0 192.168.255.255 (192.168/16 prefix) Web services network A dedicated network that contains Web servers. Typical configurations include a load-balancing switch, which can be configured to allow the Web server to return the clients HTTP request directly or to require the load balancing device to return the request on behalf of the provider Web server. Naming services network A dedicated network that consists of servers that provide LDAP, DNS, NIS, and other naming services. The services are for internal use only and should be highly secure. Internal infrastructure support services must be sure that requests originate and are destined to internal servers. Most requests tend to be read intensive, hence their potential for caching strategies for increased performance. Management network A dedicated service network that provides management and configuration of all servers, including jumpstart of new systems. Backup network A dedicated service network that provides backup and restore operations pivotal to minimizing disturbances to other production service networks during backup and other network bandwidth-intensive operations. Device network This is a dedicated network that attaches IP storage devices and other devices. Application services network A dedicated network that typically consists of large multi-CPU servers that host multiple instances of the Sun ONE Application server software image. These requests tend to be low network bandwidth intensive but may span multiple protocols, including HTTP, CORBA, proprietary TCP, and UDP. The network traffic can also be significant when Sun ONE Application server clustering is enabled. Every update to a stateful session bean triggers a multicast update to all servers on this dedicated network so that participating cluster nodes update the appropriate stateful session bean. Network utilization increases in direct proportion to the intensity of session bean updates.
Chapter 7
307
Database network A dedicated network that typically consists of one or two multi-CPU database servers. The network traffic typically consists of Java DataBase Connectivity (JDBC) traffic between the application server or the Web server.
Client network Edge switch Master switch (Layer 3) Standby switch (Layer 3)
Layer 2 switch
Network interface 0
Sun server
FIGURE 7-9
308
The design shown in FIGURE 7-10 results in the same network functionality, but eliminates the need for two Layer 2 devices. This is accomplished using a tagged VLAN interconnect between the two core switches. By collapsing the Layer 2 functionality, there is a reduction in the number of network devices, providing fewer units that might fail, lower cost, and reduced manageability issues.
Client network
Edge switch
FIGURE 7-10
Chapter 7
309
Clients
172.16.0.1
10.50.0.1
10.10.0.1
10.10.0.1
10.50.0.1
10.40.0.1
10.20.0.1
10.20.0.1
10.40.0.1
10.30.0.1
10.30.0.1
Servers
FIGURE 7-11
310
TABLE 7-1 summarizes the eight separate networks and associated VLANs.
Client load generation Connects client network to the data center Web services Directory services Database services Application services DNS services Management and administration
The edge network connects to the internal network in a redundant manner. One of the core switches has ownership of the 192.16.0.2 IP address, which means that switch is the master and the other is in slave mode. When the switch is in slave mode, it does not respond to any traffic, including ARPs. The master also assumes ownership of the MAC that floats along with the virtual IP address of 192.16.0.2.
Note If you have multiple NICs, make sure each NIC uses its unique MAC
address. Each switch is configured with the identical networks and associated VLANS, as shown in TABLE 7-1. An interconnect between the switches extends each VLAN but is tagged to allow multiple VLAN traffic to share a physical link (this requires a network interface, such as the Sun ge, that supports tagged VLANS). The Sun servers connect to both switches in the appropriate slot, where only one of the two interfaces will be active. Although most switches support Routing Information Protocol (RIP and RIPv2), Open Shortest Path First (OSPF), and Border Gateway Protocol v4 (BGP4), static routes provide a more secure environment. A redundancy protocol based on virtual router redundancy protocol (VRRP, RFC 2338) runs between the virtual routers. The MAC address of the virtual routers floats among the active virtual routers so that the ARP caches of the servers do not need any updates when a failover occurs.
Chapter 7
311
312
Clients
12
Switching services
11
Web services
3 4 Directory services 10
8 9
Application services
7 6
Database services
FIGURE 7-12
Logical Network
Chapter 7
313
Interface1
Client
Switch
HTTP/ HTTPS
Client initiates Web request. Client communication can be HTTP or HTTPS (HTTP with secure socket layer). HTTPS can be terminated at the switch or at the Web server. Switch redirects client request to appropriate Web server. The Web server redirects the request to the application server for processing. Communication passes through a Web server plug-in over a proprietary TCP-based protocol. The Java 2 Enterprise Edition (J2EE) application hosted by the application server identifies the requested process as requiring specific authorization. It sends a request to the directory server to verify that the user has valid authorization. The directory server successfully verifies the authorization through the users LDAP role. The validated response is returned to the application server. Application server then processes business logic represented in J2EE application. The business logic requests data from a database as input for processing. The requests may come from servlets, Java Data Objects, or Enterprise Java Beans (EJBs) that in turn use Java DataBase Connectivity (JDBC) to access the database. The JDBC request can contain any valid SQL statement. The database processes the request natively and returns the appropriate result through JDBC to the application server. The J2EE application completes the business logic processing, packages the data for display (usually through a JSP that renders HTML) and returns the response to the Web server. Switch receives reply from Web server. Switch rewrites IP header and returns request to client.
2 3
Application server
Directory server
Directory server
Application server
LDAP
Application server
Database server
JDBC
Database server
Application server
JDBC
Application server
Web server
Application server Web connector over TCP HTTP/ HTTPS HTTP/ HTTPS
9 10
Switch Client
314
Secure Multi-Tier
FIGURE 7-13 shows the overall structure of a classic multi-tier design.
Web
Web
App
App
DB
DB
FIGURE 7-13
Secure Multi-Tier
The advantages of this approach are simplicity and security. Clearly the only way to access the Data tier is through the application servers. There are no other possible network paths to access the Data tier. The drawbacks are limited flexibility and manageability. If an application running on the Web server needs to connect to an
Chapter 7
315
LDAP server or a database through a JDBC connection, a fundamental change to the architecture will be needed. As the number of tiers increases, so does the number of switches, which becomes a management issue.
Multilayer Switch
Multilayer Switch
Layer 2
Layer 2
Multilayer Switch
Multilayer Switch
Multilayer Switch
Multilayer Switch
Layer 2
Layer 2
Layer 2
Layer 2
Layer 2
Layer 2
Layer 2
Layer 2
FIGURE 7-14
316
This approach has few advantages and many disadvantages. One advantage is that the entry cost is low. One can start from a very small deployment, procuring small eight-port multilayer switches and Layer 2 switches and increasing the tiers and servers to the point where the ingress links become a bottleneck or the port density of the small multilayer switches becomes an issue. Actual tested configurations leveraged Alteon 180 switches as the multilayer switches and Extreme Networks Summit 48i for the Layer 2 switches, which had gigabit uplinks and 10/100 ports for connections to the server. This architecture has the following disadvantages:
s
Lower Availability Because of the number of links and devices, more things can go wrong. In particular, the serial connections drastically reduce the MBTF. The links are often prone to accidents and should be kept to a minimum. Due to the architecture, the link failure detection time and recovery is much slower because of the number of layers. Waste In any network architecture, stateless functionality should be deployed towards the center of the network and complex processing should be deployed at the outermost edge. Having two layers of multilayer switches is a tremendous waste in terms of packet processing and equipment cost. When a packet undergoes Layer 7 processing, especially by software, it is extremely slow. The cost of a multilayer switch is much more than that of a plain Layer 2 or Layer 3 device. Manageability As the number of switches increases, so does the manageability workload.
Chapter 7
317
Client 2
L2-L3 edge switch 10.50.0.1 Extreme switch 192.168.10.3 10.30.0.1 10.40.0.1 10.20.0.1 10.10.0.1
Extreme switches
Core
Core
Web service Tier Sun Fire 280R Sun Fire 280R Sun Fire 280R Sun Fire 280R 10.10.0.100 10.10.0.101 10.10.0.102 10.10.0.103
Directory sevice Tier Sun Fire 280R Sun Fire 280R Sun Fire 280R Sun Fire 280R 10.20.0.100 10.20.0.101 10.20.0.102 10.20.0.103
T3
FIGURE 7-15
318
Client 2
192.168.10.1 10.50.0.1 Standby core 192.168.10.3 10.30.0.1 10.40.0.1 10.20.0.1 10.10.0.1 Server load-balancer switches
Sun Fire 280R Sun Fire 280R Sun Fire 280R Sun Fire 280R 10.10.0.100 10.10.0.101 10.10.0.102 10.10.0.103
Directory service Tier Sun Fire 280R Sun Fire 280R Sun Fire 280R Sun Fire 280R 10.20.0.100 10.20.0.101 10.20.0.102 10.20.0.103
T3
FIGURE 7-16
Physical NetworkConnectivity
The physical wiring of the architecture is shown in FIGURE 7-17 and described in
TABLE 7-3
TABLE 7-3
Switch
edge edge
1,2,3,4 5,6
ge ge
172.16.0.1 192.168.10.1
255.255.255.0 255.255.255.0
mls1
External network Web/app service router Directory service router Database services router
ge ge ge ge
mls1 mls1
mls1
mls2
External network Web/app service router Directory services router Database services router
ge ge ge ge
mls2
mls2
mls2
320
ge0:172.16.0.102/24 Client2 1 Edge hme0:10.100.16.102 10.100.16.1 10.100.168.2 mls1 10.30.0.1/24 7 5 192.168.0.1/24 6 2 172.16.0.1/24 3
ge0:172.16.0.101/24 4 Client1
hme0:10.100.16.101 10.100.168.2
192.168.0.101/24
192.168.0.102/24
mls2 10.30.0.1/24
192.168.0.2/24
192.168.0.2/24
10.20.0.1/24
10.10.0.1/24
10.10.0.1/24
10.20.0.1/24
192.168.0.2/24
192.168.0.2/24
ge0:10.40.0.101/24
ge1:10.40.0.102/24
ge0:10.40.0.105/24
ge1:10.40.0.106/24
app1 app2 ge0:10.40.0.103/24 hme010.100.10.101 ge0:10.20.0.101/24 ds1 hme0:10.100.20.101 ge0:10.30.0.101/24 db1 hme0:10.100.30.101
FIGURE 7-17
ge0:10.20.0.103/24 ds2 ge1:10.20.0.102/24 hme0:10.100.20.103 ge0:10.30.0.103/24 db2 ge1:10.30.0.102/24 hme0:10.100.30.103 ge1:10.30.0.104/24 ge1:10.20.0.104/24
Switch Configuration
A high-level overview of the switch configuration is shown in FIGURE 7-18.
edge 192.168.0.2
edge 192.168.0.2
Slot 8
FIGURE 7-18
322
Note Network equipment from Foundry Networks can be used instead. See
Configuring the Foundry Networks Switches on page 324.
Chapter 7
323
2. Configure the edge switch. The following example shows an excerpt of the switch configuration file.
# # Summit7i Configuration generated Mon Dec 10 14:39:46 2001 # Software Version 6.1.9 (Build 11) By Release_Master on 08/30/01 11:34:27 configure dot1q ethertype 8100 configure dot1p type dot1p_priority 0 qosprofile QP1 .................................................... enable system-watchdog config qosprofile QP1 minbw 0% maxbw 100% priority Low minbuf 0% maxbuf 0 K .................................................... delete protocol ip delete protocol ipx delete protocol netbios delete protocol decnet delete protocol appletalk .................................................... # Config information for VLAN Default. config vlan Default tag 1 # VLAN-ID=0x1 Global Tag 1 config vlan Default protocol ANY config vlan Default qosprofile QP1 enable bootp vlan Default ....................................................
324
Client
Client
Client
Client
Servers
Servers
Web service module Directory service module Application service module Database service module
FIGURE 7-19
Servers
Servers
Servers
Servers
Servers
Servers
Servers
Servers
Chapter 7
325
module 1 bi-jc-8-port-gig-m4-management-module module 3 bi-jc-48e-port-100-module ! global-protocol-vlan ! vlan 1 name DEFAULT-VLAN by port vlan 10 name refarch by port untagged ethe 1/1 ethe 3/1 to 3/16 router-interface ve 10 vlan 99 name mgmt by port untagged ethe 3/47 to 3/48 router-interface ve 99 ! hostname MLS0 ip default-network 129.146.138.0/16 ip route 192.168.0.0 255.255.255.0 172.0.0.1 ip route 129.148.181.0 255.255.255.0 129.146.138.1 ip route 0.0.0.0 0.0.0.0 129.146.138.1 ! router vrrp-extended interface ve 10 ip address 20.20.0.102 255.255.255.0 ip address 172.0.0.70 255.255.255.0 ip vrrp-extended vrid 1 backup priority 100 track-priority 20 advertise backup ip-address 172.0.0.10 dead-interval 1 track-port e 3/1 enable ip vrrp-extended vrid 2 backup priority 100 track-priority 20 advertise backup ip-address 20.20.0.100 dead-interval 1 track-port e 3/13 enable ! interface ve 99 ip address 129.146.138.10 255.255.255.0 end
326
ver 07.5.05cT53 ! module 1 bi-jc-8-port-gig-m4-management-module module 3 bi-jc-48e-port-100-module ! global-protocol-vlan ! vlan 1 name DEFAULT-VLAN by port ! vlan 99 name swan by port untagged ethe 1/6 to 1/8 router-interface ve 99 ! vlan 10 name refarch by port untagged ethe 3/1 to 3/16 router-interface ve 10 ! ! hostname MLS1 ip default-network 129.146.138.0/1 ip route 192.168.0.0 255.255.255.0 172.0.0.1 ip route 0.0.0.0 0.0.0.0 129.146.138.1 ! router vrrp-extended interface ve 10 ip address 20.20.0.102 255.255.255.0 ip address 172.0.0.71 255.255.255.0 ip vrrp-extended vrid 1 backup priority 100 track-priority 20 advertise backup ip-address 172.0.0.10 dead-interval 1 track-port e 3/1 enable ip vrrp-extended vrid 2 backup priority 100 track-priority 20 advertise backup ip-address 20.20.0.100 dead-interval 1 track-port e 3/13 enable
interface ve 99
Chapter 7
327
ip address 129.146.138.11 255.255.255.0 ! ! ! ! ! sflow sample 512 sflow source ethernet 3/1 sflow enable ! ! end
ver 07.3.05T12 global-protocol-vlan ! ! server source-ip 20.20.0.50 255.255.255.0 172.0.0.10 ! !! ! server real web1 10.20.0.1 port http port http url "HEAD /" ! server real web2 10.20.0.2 port http port http url "HEAD /" ! ! server virtual WebVip1 192.168.0.100 port http port http dsr bind http web1 http web2 http ! ! vlan 1 name DEFAULT-VLAN by port
328 Networking Concepts and Technology: A Designers Resource
ver 07.3.05T12 no spanning-tree ! hostname SLB0 ip address 192.168.0.111 255.255.255.0 ip default-gateway 192.168.0.10 web-management allow-no-password banner motd ^C Reference Architecture -- Enterprise Engineering^C Server Load Balancer-- SLB0 129.146.138.12/24^C !! end
ver 07.3.05T12 global-protocol-vlan ! ! server source-ip 20.20.0.51 255.255.255.0 172.0.0.10 ! !! ! server real s1 20.20.0.1 port http port http url "HEAD /" ! server real s2 20.20.0.2 port http port http url "HEAD /" ! ! server virtual vip1 172.0.0.11 port http port http dsr bind http s1 http s2 http ! ! vlan 1 name DEFAULT-VLAN by port
Chapter 7 Reference Design Implementations 329
ver 07.3.05T12 ! hostname SLB1 ip address 172.0.0.112 255.255.255.0 ip default-gateway 172.0.0.10 web-management allow-no-password banner motd ^C Reference Architecture - Enterprise Engineering^C Server Load Balancer - SLB1 - 129.146.138.13/24^C !
Network Security
For the Sun ONE network configuration, firewalls were configured between each service module to provide network security. FIGURE 7-20 shows the relationship between the firewalls and the service modules.
330
Client
Client
Intranet/Internet Edge switch Firewall Web service Tier Firewall Application service Tier Firewall Database service Tier
FIGURE 7-20
In the lab, one physical firewall device was used to create multiple virtual firewalls. Network traffic was directed to pass through the firewalls between the service modules, as shown in FIGURE 7-21. The core switch is only configured for Layer 2 with separate port-based VLANs. The connection between the Netscreen and the core switch uses tagged VLANS. Trust zones are created on the Netscreen device, and they map directly to the tagged VLANs. The Netscreen firewall device performs the Layer 3 routing. This configuration directs all traffic through the firewall, resulting in firewall protection between each service module.
Chapter 7
331
Client
Client
Netscreen device
VLAN*
Core switch
Database service Tier *Web, application, and database trafc multiplexed on one VLAN
FIGURE 7-21
332
Netscreen Firewall
CODE EXAMPLE 7-5 shows a partial example of a configuration file used to configure the Netscreen device.
set auth timeout 10 set clock "timezone" 0 set admin format dos set admin name "netscreen" set admin password nKVUM2rwMUzPcrkG5sWIHdCtqkAibn set admin sys-ip 0.0.0.0 set admin auth timeout 0 set admin auth type Local set zone id 1000 "DMZ1" set zone id 1001 "web" set zone id 1002 "appsrvr" set zone "Untrust" block set zone "DMZ" vrouter untrust-vr set zone "MGT" block set zone "DMZ1" vrouter trust-vr set zone "web" vrouter trust-vr set zone "appsrvr" vrouter trust-vr set ip tftp retry 10 set ip tftp timeout 2 set interface ethernet1 zone DMZ1 set interface ethernet2 zone web set interface ethernet3 zone appsrvr set interface ethernet1 ip 192.168.0.253/24 set interface ethernet1 route set interface ethernet2 ip 10.10.0.253/24 set interface ethernet2 route set interface ethernet3 ip 20.20.0.253/24 set interface ethernet3 route unset interface vlan1 bypass-others-ipsec unset interface vlan1 bypass-non-ip set interface ethernet1 manage ping unset interface ethernet1 manage scs unset interface ethernet1 manage telnet unset interface ethernet1 manage snmp unset interface ethernet1 manage global unset interface ethernet1 manage global-pro unset interface ethernet1 manage ssl set interface ethernet1 manage web
Chapter 7
333
unset interface ethernet1 ident-reset set interface vlan1 manage ping set interface vlan1 manage scs set interface vlan1 manage telnet set interface vlan1 manage snmp set interface vlan1 manage global set interface vlan1 manage global-pro set interface vlan1 manage ssl set interface vlan1 manage web set interface v1-trust manage ping set interface v1-trust manage scs set interface v1-trust manage telnet set interface v1-trust manage snmp set interface v1-trust manage global set interface v1-trust manage global-pro set interface v1-trust manage ssl set interface v1-trust manage web unset interface v1-trust ident-reset unset interface v1-untrust manage ping unset interface v1-untrust manage scs unset interface v1-untrust manage telnet unset interface v1-untrust manage snmp unset interface v1-untrust manage global unset interface v1-untrust manage global-pro unset interface v1-untrust manage ssl unset interface v1-untrust manage web unset interface v1-untrust ident-reset set interface v1-dmz manage ping unset interface v1-dmz manage scs unset interface v1-dmz manage telnet unset interface v1-dmz manage snmp unset interface v1-dmz manage global unset interface v1-dmz manage global-pro unset interface v1-dmz manage ssl unset interface v1-dmz manage web unset interface v1-dmz ident-reset set interface ethernet2 manage ping unset interface ethernet2 manage scs unset interface ethernet2 manage telnet unset interface ethernet2 manage snmp unset interface ethernet2 manage global unset interface ethernet2 manage global-pro unset interface ethernet2 manage ssl
334
unset interface ethernet2 manage web unset interface ethernet2 ident-reset set interface ethernet3 manage ping unset interface ethernet3 manage scs unset interface ethernet3 manage telnet unset interface ethernet3 manage snmp unset interface ethernet3 manage global unset interface ethernet3 manage global-pro unset interface ethernet3 manage ssl unset interface ethernet3 manage web unset interface ethernet3 ident-reset set interface v1-untrust screen tear-drop set interface v1-untrust screen syn-flood set interface v1-untrust screen ping-death set interface v1-untrust screen ip-filter-src set interface v1-untrust screen land set flow mac-flooding set flow check-session set address DMZ1 "dmznet" 192.168.0.0 255.255.255.0 set address web "webnet" 10.10.0.0 255.255.255.0 set address appsrvr "appnet" 20.20.0.0 255.255.255.0 set snmp name "ns208" set traffic-shaping ip_precedence 7 6 5 4 3 2 1 0 set ike policy-checking set ike respond-bad-spi 1 set ike id-mode subnet set l2tp default auth local set l2tp default ppp-auth any set l2tp default radius-port 1645 set policy id 0 from DMZ1 to web "dmznet" "webnet" "ANY" Permit set policy id 1 from web to DMZ1 "webnet" "dmznet" "ANY" Permit set policy id 2 from DMZ1 to appsrvr "dmznet" "appnet" "ANY" Permit set policy id 3 from appsrvr to DMZ1 "appnet" "dmznet" "ANY" Permit set ha interface ethernet8 set ha track threshold 255 set pki authority default scep mode "auto" set pki x509 default cert-path partial _____________________
Chapter 7
335
APPENDIX
Lyapunov Analysis
This appendix provides an outline of the mathematical proof that shows why the least connections server load balancing (SLB) algorithm is inherently stable. This means that over a long period of time, the system will ensure that the load is evenly balanced. This analysis can be used to model and verify the stability of any network design, which may be of tremdous value if you are an advanced network architect. Building on what was discussed in Chapter 3, we will extend the model of the single queue to that of the entire system and then show that the entire system is stable. The entire system consists of an aggregate ingress load of l, N server processes of varying service rates 1, 2, . . .n, hence we get the following equation: EQN 1: S = + 1 + 2 +. . . n We will use this equation later. It states that the value S is the sum of the aggregate load and the sum of all the service rates. This means in one time slot:
s s
average sum of all incoming loads average sum of all server processing capacity
Since the incoming packets are modeled as Poisson arrivals, which is in continuous time, we will map the time domain to an index N, which increases whenever the state of the system changes. The state is defined as the queue occupancy. If a packet arrives, it will increase the size of one of the queues in the system. If a packet is serviced, then the size of one queue of the system will decrease. Let Qs(t) = min(Q1(t), Q2(t). . . QN(t)). This Qs is the least occupied queue among all N queues. Let Qb(t) = set {Q1(t), Q2(t). . . QN(t)} - {Qs(t)}, which is all the queues except for the least occupied. Let Qa(t) = Qb(t) + Qs(t), which is all the queues.
337
We know that the next state of all queues in the set Qb(t) can change only due to a Web service, which is a reduction by one request. There can be no increase in this queue size because the SLB will not forward any new requests. Therefore, these queues cannot grow in the next time slot, so we get: Qb(t+1) = Qb(t) - 1 with probability of ib/S We can also figure out the next possible state of Qs(t), which can change due to a Web service, resulting in a reduction of queue size by 1 or an increase in queue size, due to the SLB forwarding a request to this queue. Hence we get the next state as follows: Qs(t+1) = Qs(t) -1 with probability of is/S or Qs(t) +1 with a probability of is/S We can assign the Lyapunov Function to the sum of all the occupancies of all N queues. We will use t, representing a particular time slot: L(t) = Q1(t) + Q2(t).... QN(t) = Qib(t) + Qis(t) L(t+1) = Qib(t+1) + Qis(t) = [ib/S(Qib(t) -1)] + is/S[Qs(t) -1] + is/S[Qs(t) +1] Now if we look at one particular queue, Qi(t), keeping time discrete, the state of Qi(t) only changes due to events of arrivals and/or departures. We can see how this queue increases and decreases in size or queue occupancy. For stability, we need to show: EL = E[L(t+1) - L(t)) | L(t)] <= -e || Q || + k This says that the expected value of the single step driftthat is, the Lyapunov Function at time t+1Lyapunov Function at time t, given the Lyapunov Function at time tmust be a negative constant times the queue size plus some constant k. The value of EL becomes negative when the queue size times -e is larger than k. This is typical in almost all systems in that before the system reaches a steady state there is an initial unstable period, but after some time a steady state is reached. This is where we need to look at the system to determine the behavior of the system in steady state. EL = E[L(t+1) - L(t)|L(t)] = E [[ib/S(Qib(t) -1)] + is/S*[Qs(t) -1] + is/S*[Qs(t) +1] - ib/S*Qib(t) - (is/S + is/S)*Qis(t)| L(t)] = E [ is/S*[Qs(t) +1] - is/S*Qis(t) + is/S*[Qs(t) -1] - is/S*Qis(t) + [ib/S(Qib(t) 1)] -ib/S*Qib(t)]|
338
- ib/S]
- ib/S
All incoming traffic is being redirected by the SLB algorithm to the least occupied queue, = is
From this we conclude that as long as the incoming traffic is admissible or < The system is stable! This proves that the SQF algorithm is guaranteed to drain all queues in such a way as to make sure the system is stable. If we had a round-robin SLB algorithm instead, we would not get this mathematical result. In particular, there is no way we can enforce the following: i < i, resulting in Qi(t) overflowing, even though the overall average incoming traffic is less than the overall average server capacity that is, < . The SLB blindly forwards incoming traffic to servers, without considering the occupancy of Qi(t). The round-robin scheme can easily have some idle servers and still continue to forward traffic to an overloaded server, resulting in instability. In the SQF algorithm, we know that only the shortest queue is forwarded traffic and that the other queues can only drain. As long as Qis does not overflow, the entire system is stable. We know that < , hence we know that Qis(t) is stable.
Appendix A
Lyapunov Analysis
339
Glossary
This glossary defines terms and acronyms used in this book.
A
ABR ACK ACL ANAR ANER API Application Server Available Bit Rate Acknowledgement flag, TCP header Access Control List Auto-Negotiation Advertisement Register Auto-Negotiation Expansion Register Application Programming Interface A host computer that provides access to a software application. In the context of this Reference Architecture, it is used to mean a J2EE Application Server, which essentially serves as an enterprise platform for Java applications. See J2EE. Address Recognition Indicator/Frame Copier Indicator Address Resolution Protocol Application Specific Integrated Circuit Asynchronous Transfer Mode
341
B
BER BGP BMCR BMP BMSR BPDU Bit Error Rate Border Gateway Protocol Basic Mode Control Register Bean Managed Persistence Basic Mode Status Register Bridge Protocol Data Unit
C
CBQ CBR CBS CGI CIR CLEC CMP Congestion Window Class-Based Queuing Constant Bit Rate Committed Burst Size Common Gateway Interface Committed Information Rate Competitive Local Exchange Carrier Container Managed Persistence A congestion window added by slow start to the senders TCP: the congestion window, called cwnd. When a new connection is established with a host on another network, the congestion window is initialized to one segment (that is, the segment size announced by the other end).
D
DAC DAPL DAS Dual-Attached Concentrator Direct Access Programming Library Dual-Attached Station
342
Data Link Provided Interface Direct Memory Access Distributed Multilink Trunking Domain Name Service Denial of Service Dynamic Random Access Memory Direct Server Return Dedicated Token Ring
E
EBS EJB ESRP Edge data center switch Excess Burst Size Enterprise JavaBean Extreme Standby Routing Protocol The integration point to the customers existing backbone network. This is the switch that connects the data center to the customers backbone network.
F
Failover A characteristic of a highly available component or service that describes the ability to switch to another equivalent component or service so that the overall availability is still maintained. See High Availability. Fiber Distributed Data Interface Forwarding Information Base First In, First Out Field Programmable Gate Array FDDI FIB FIFO FPGA
Glossary
343
G
GCR GESR GFR GMII GSR Gigabit Control Register Gigabit Extended Status Register Guaranteed Frame Rate Gigabit Media Independent Interface Gigabit Status Register
H
High Availability (HA) HOL HTTP HTTPS General term used to describe the ability of a component or service to be running and therefore available. Head of Line Blocking (Hypertext Transfer Protocol) The Internet protocol based on TCP/IP that fetches hypertext objects from remote hosts. HTTP over SSL
I
ias ILEC Integratable iPlanet Application Server Incumbent Local Exchange Carrier In the context of an integrated stack, it represents a mixture of third-party software products that support open standards such as Java and Java Technologies for SOAP, UDDI, XML, and WSDL. These products can be combined to deliver a customer solution and should work together given their support of these open standards. In the context of an integrated stack, it represents Suns software products that implement the Sun ONE architecture to deliver a fully optimized, tested, and supported system to maximize value to customers. input/output memory management unit
Integrated
IOMMU
344
Internet Protocol Inter-Packet Gap Internet Protocol Multipathing Microsofts internet server application programming interface Internet Service Provider Inter Exchange Carrier
J
J2EE (Java 2 Platform Enterprise Edition) Set of standards that leverages J2SE technology and simplifies Java development by offering standardized, modular components by providing a complete set of services to those components, and by handling many details of application behavior automatically, without complex programming. This is the standard on which the Sun ONE Application Server is based. See http://java.sun.com/j2ee/. (Java 2 Platform Standard Edition) Represents the set of technologies that provides the run time environment and Software Development Kit for Java development. See http://java.sun.com/j2se/. An object-oriented programming language developed by Sun Microsystems. The Write Once, Run Anywhere programming language. The software development kit that developers need to build applications for the Java 2 Platform, Standard Edition, v. 1.2. See also JDK. A portable, platform-independent reusable component model. See http://java.sun.com/. (Java Remote Method Invocation) (n.) A distributed object model for Java program to Java program, in which the methods of remote objects written in the Java programming language can be invoked from other Java virtual machines, possibly on different hosts. (Java API for XML Messaging) Enables applications to send and receive document oriented XML messages using a pure Java API. JAXM implements Simple Object Access Protocol (SOAP) 1.1 with Attachments messaging so that developers can focus on building, sending, receiving, and decomposing messages for their applications instead of programming low-level XML communications routines. See http://java.sun.com/xml/jaxm/index.html.
J2SE
JAXM
Glossary
345
JAXR
(Java API for XML Registries) Provides a uniform and standard Java API for accessing different kinds of XML Registries. See http://java.sun.com/xml/jaxr/index.html. (Java API for XML-based RPC) Enables Java technology developers to build Web applications and Web services incorporating XML-based RPC functionality according to the SOAP 1.1 specification. See http://java.sun.com/xml/jaxrpc/index.html. Java DataBase Connectivity (Java Development Kit) The software that includes the APIs and tools that developers need to build applications for those versions of the Java platform that preceded the Java 2 Platform. See also Java 2 SDK. Java Naming and Directory Interface (Java runtime environment) A subset of the Java Development Kit (JDK) for users and developers who want to redistribute the runtime environment. The Java runtime environment consists of the Java virtual machine (JVM), the Java core classes, and supporting files. (JavaServer Pages) Technology that allows Web developers and designers to rapidly develop and easily maintain information-rich, dynamic Web pages that leverage existing business systems. See http//java.sun.com/products/jsp/.
JAXRPC
JDBC JDK
JNDI JRE
JSP
L
LAA LACP LAN LDAP LLC LPNAR LSP Locally Administered Address Link aggregation Control Protocol Local Area Network (Lightweight Directory Access Protocol) The Internet standard for directory lookups. Logical Link Control Link Partner Auto-negotiation Advertisement Register Link State Packet
346
M
MAC MAU MDT MII M/M/1 queue MSS MTBF MTTR MTU Multi-tier architecture Maximum Segment Size Mean Time Between Failures Mean Time Till Recovery Maximum Transmission Unit For a given custom application, multiples of any of these tiers may be usedthus n-tier. There is no implied relationship between tiers and machines, but collapsing all the tiers onto a single machine would not be network centric. 1. 2. 3. 4. Client Tier Web Tier Application Tier Database Tier Media Access Control Media Access Unit Multi-Data Transmission Media Independent Interface
N
NAP NAT NFS NIC NSAPI NSP Network Access Point Network Address Translation Network File System Network Interface Card (or Controller) Netscape Application Programming Interface Network Service Provider
Glossary
347
O
Operating System OSPF A collection of programs that monitor the use of the system and supervise the other programs executed by it. Open Shortest Path First
P
PHY ping Physical layer (1) (n.) (Packet Internet Groper) A small program (ICMP ECHO) that a computer sends to a host and times on its return path. (2) (v.) To test the reach of destinations by sending them an ICMP ECHO: Ping host X to see if it is up! PIR PPA Presentation Service Peak Information Rate Primary Point of Attachment Term used to describe a service that presents the data that is returned to the end user. In this context, the presentation service was delivered by a tier of Web servers that served up JSP/servlet traffic for viewing by the client Web browsers. Enables communication between Sun ONE Application Server and a client. Manages and provides services for all active, loaded listeners. Supports HTTP, HTTPS (HTTP over SSL), and IIOP.
Protocol Manager
Q
QoS
(Quality of Service) Measures the ability of network and computing systems to provide different levels of services to selected applications and associated network flows.
348
R
RBOC RDMA RED Remote system RLDRAM RIP RJ-45 connector (n.) Regional Bell Operating Company Remote Direct Memory Access Random Early Detection (n.) A system other than the one on which you are working. reduced latency DRAM (n.) Routing Information Protocol, An IGP with Berkeley UNIX (n.) A modular cable connector standard used with consumer telecommunications equipment, such as systems equipped for ISDN connectivity. (n.) Remote Method Invocation (See Java RMI.) (n.) In a hierarchy of items, the one item from which all other items are descended. The root item has nothing above it in the hierarchy. See also class, hierarchy, package, root directory, root file system, and root user name. (n.) The base directory from which all other directories stem, directly or indirectly. (n.) On Sun server systems, the disk drive where the operating system resides. The root disk is located in the SCSI tray behind the front panel. (n.) A file system residing on the root device (a device predefined by the system at initialization) that anchors the overall file system. (n.) The SunOS user name that grants special privileges to the person who logs in with that ID. The user who can supply the correct password for the root user name is given superuser privileges for the particular machine. (1) (n.) In the X protocol, a window with no parent window. Each screen has a root window that covers it. (2) (adj.) Characteristic of an input method that uses a pre-editing window that is a child of the root window. Router RR RTT A system that assigns a path for network (or Internet) traffic to follow based on IP Address. Round-robin method of load balancing. Round Trip Time
RMI root
root directory root disk root file system root user name
root window
Glossary
349
S
SAC SACK SAS SBus single-attached concentrator selective acknowledgement single-attached station (n.) A 32-bit self-identifying bus used mainly on SPARCTM workstations, the SBus provides information to the system so that it can identify the device driver that needs to be used. An SBus device might need to use hardware configuration files to augment the information provided by the SBus card. See also PCI bus. (n.) A device providing additional SBus slots by connecting two SBuses. Generally, a bus bridge is functionally transparent to devices on the SBus. However, there are cases (for example, bus sizing) in which bus bridges can change the exact way a series of bus cycles is performed. Also called an SBus coupler. (n.) The hardware responsible for performing arbitration, addressing translation and decoding, driving slave selects and address strobe, and generating timeouts. (n.) A logical device attached to the SBus. This device might be on the motherboard or on an SBus expansion card. (n.) A physical printed circuit assembly that conforms to the single- or doublewidth mechanical specifications and that contains one or more SBus devices. (n.) An SBus slot into which you can install an SBus expansion card. (n.) A special series of bytes at address 0 of each SBus slave that identifies the SBus device. Sockets Direct Protocol (Secure Sockets Layer) A protocol developed for transmitting private documents via the Internet. SSL works by using a public key to encrypt data thats transferred over the SSL connection. Ability to provide information, data, and applications to anyone, anytime, anywhere on any device. Includes Web services technology, but also includes technology you are using today and could use in the future. Switch Fabric Module Service Level Agreement server load balancing
SBus bridge
SBus controller
SBus device SBus expansion card SBus expansion slot SBus ID SDP SecuritySSL
Services on Demand
350
A TCP flow control protocol that allows the sender to transmit multiple packets before it stops and waits for an acknowledgment. Split Multilink Trunking Systems Network Architecture Small Office/Home Office The Sun Microsystems open standards-based UNIX operating system. The Solaris Operating System, the foundation for Sun ONE software architecture, delivers the security, manageability, and performance. single point of failure Smallest Queue First static random access memory Spanning Tree Protocol (n.) A kernel aggregate created by connecting STREAMS components, resulting from an application of the STREAMS mechanism. The primary components are the Stream head, the driver, and zero or more pushable modules between the Stream head and driver. (n.) A Stream component that is farthest from the user process and contains a driver. (n.) A Stream component closest to the user process. It provides the interface between the Stream and the user process. Handles data streams from the Sun ONE Application Server to the Web server and to the Web browser. A streaming service improves performance by allowing users to begin viewing results of requests sooner rather than waiting until the complete operation has been processed. (n.) A kernel mechanism that supports development of network services and data communications drivers. STREAMS defines interface standards for character input/output within the kernel and between the kernel and user level. The STREAMS mechanism includes integral functions, utility routines, kernel facilities, and a set of structures. (n.) A mechanism for bidirectional data transfer implemented using STREAMS and sharing properties of STREAMS-based devices. (Sun Open Net Environment) The Sun Microsystems software strategy that comprises the vision, architecture, platform, and expertise for developing and deploying Services on Demand today. See http://www.sun.com/sunone.
STREAMS
Glossary
351
Switch SYN
Any device or mechanism that moves data from one network to another without any routing tables. synchronization
T
TCAM TDM TTCP TTRT Telecommunications Access Method time division multiplexing Test Transmission Control Protocol Target Token rotation
U
UDDI Universal Description, Discovery, and Integration. The UDDI Project is an industry initiative that is working to enable businesses to quickly, easily, and dynamically find and transact with one another via Web services. UDDI enables a business to (i) describe its business and its services, (ii) discover other businesses that offer desired services, and (iii) integrate with these other businesses. See http://www.uddi.org. An alternative to UDDI is JAXR, created by Sun Microsystems. See JAXR.
W
Web connectors Web connectors and listeners manage the passing of requests from the Web server to the Sun ONE Application Server. Listeners distribute and handle requests from the Web connectors. New listeners can be added with the HTTP handler. The easy-to-use, extensible, easy-to-administer, secure, platform-independent solution to speed up and simplify the deployment and management of your Internet and intranet Web sites. It provides immediate productivity for fullfeatured, Java technology-based server applications.
Web server
352
Web service
A fine-grained, component-style service Advertised and described in a service registry Based on standardized protocols JAXR, UDDI, JAXRPC, JAXM, SOAP, WSDL, and so on Accessible programmatically by applications or other Web services
WSDL
(Web Services Description Language) An XML format for describing network services as a set of endpoints operating on messages containing either document-oriented or procedure-oriented information. See http://www.w3.org/TR/.
Glossary
353
Index
A access control lists 296 active switches 311 application redirection function 299 architecture 3 architecture, network security 330 asynchronous transfer mode (ATM) 98 Auto-negotiation 153 Auto-negotiation Advertisement Register 155 Auto-negotiation Expansion register 157 B Basic Mode Control Register 153 Basic Mode Status Register 154 BlackDiamond switches 323 border gateway protocol v4 (BGP4) 311 bridges 66 business logic 312 C checksumming 146 cipher suite 110 class C network 308 client requests 312 configuring the Extreme Networks switches 323 Foundry Networks switches 324 congestion control 107 congestion window 51, 54 consistent mode 142 Constant Bit Rate (CBR) 98 content switching 300 Control Plane 67
CPU load balancing 148 D data flow through service modules 312 Data Plane 67 descriptor ring 141 design 3 disable source routing 127 Dual-attached concentrator 134 Dual-attached station 132 E Enterprise 98 Enterprise Java Beans (EJBs) 314 Extreme Networks equipment 317 Extreme Networks switches, configuring 323 F FDDI concentrators 134 FDDI interfaces 136 FDDI station 132 Fiber Distributed Data Interface network 131 firewall architecture 332 firewalls between service modules (figure) 331 flat architecture 261 flow control keywords 213 Forced mode 153 Foundry Networks equipment 317 Foundry Networks switches, configuring 324 Full NAT 91, 302 functional tiers 17 G Gigabit Media Independent Interface 157 global synchronization 107
355
H Half NAT 91, 302 Handshake Layer 110 I interface specifications 314 interrupt blanking 148 IP address space (private) 296 IP forwarding module 107 IP header 314 J J2EE application 314 Java data access objects 314 Java DataBase Connectivity (JDBC) 314 Java Server Pages (JSP) 312 Jumbo frames 152 jumbo frames 217 JumpStart, Solaris 296 L Layer 3 routing 331 Link-partner Auto-negotiation Advertisement 155 load balancing, built-in 317 local area networks, virtual (VLANs) 296 logical network architecture 296 logical network architecture overview (figure) 297 logical network design 296 M MAC overflow 146 management network 296 mapping process 21 media access unit 128 multi-data transmission 143 multi-level architecture 261 N netmask values 320 Netscreen Firewall configuration file 333 network configuration (Extreme Networks equipment) 318 configuration (Foundry Networks equipment) 319 physical 315 security architecture 330 Network Address Translation 302 Network Address Translation (NAT) 91 network architecture with virtual routers 310
356
network design 3 traditional 308 using chassis-based switches 309 Network Service Provider (NSP) 98 O open shortest path first (OSPF) 311 P Parallel Detection Fault 157 partial checksumming 147 pause frames 161 physical network 315 policing 107 Precedence Priority Model 102 private IP address space 296 proxy switching 300 Q QoS Profile 103 Quality of Service (QoS) 92 queuing 107 R random early detection register 216 Random Early Discard 151 receive interrupt blanking values 215 receive window 54 received packet delivery method 150 Reservation Model 102 ring of trees 135 ring speed 129 round-robin 74 router 66 routers 320 routing information protocol (RIP) 311 S secure socket layer 314 sequence of events (data flow) 314 server load balancing 298 Service Level Agreements (SLAs) 98 Services on Demand architecture 18 shaping 107 Single-attached concentrator 134 Single-attached station 132 sliding windows 54 Startup Phase 51 stateful 25 Stateful Layer 7 switching 300
Stateful Session Based 298 stateless and idempotent 25 Stateless Session Based 298 static routes 296, 311 Steady State Phase 51 streaming mode 142 Streams Service Queue model 151 switch 66 switch configuration 322 switch configuration file (Extreme switch) 323 switch configuration file (Foundry switch) 326 symmetric flow control 162 T tagged VLAN 309 Tail Drop 107 token ring interfaces 125 token ring network 123 transmission latency 142 Transmit Pause capability 162 Trunking Policies 232 trust zones 331 U URL switching 300 V Variable Bit Rate-Real Time (VBR-rt) 98 virtual firewalls 331 virtual local area networks (VLANs) 296 virtual routers 310 VLAN, tagged 309 W Web-based applications 17 weighted round-robin 74
Index
357