Documente Academic
Documente Profesional
Documente Cultură
Radhika Niranjan Mysore, Andreas Pamboris, Nathan Farrington, Nelson Huang, Pardis Miri,
Sivasankar Radhakrishnan, Vikram Subramanya, and Amin Vahdat
Department of Computer Science and Engineering
University of California San Diego
{radhika, apambori, farrington, nhuang, smiri, sivasankar, vikram.s3, vahdat}@cs.ucsd.edu
39
supporting a single layer 2 fabric for the entire data center. terpretation of current best practices [1] for the layout of
A layer 3 fabric would require configuring each switch with a 11,520-port data center network. Machines are organized
its subnet information and synchronizing DHCP servers to into racks and rows, with a logical hierarchical network tree
distribute IP addresses based on the host’s subnet. Worse, overlaid on top of the machines. In this example, the data
transparent VM migration is not possible at layer 3 (save center consists of 24 rows, each with 12 racks. Each rack
through techniques designed for IP mobility) because VMs contains 40 machines interconnected by a top of rack (ToR)
must switch their IP addresses if they migrate to a host switch that delivers non-blocking bandwidth among directly
on a different subnet. Unfortunately, layer 2 fabrics face connected hosts. Today, a standard ToR switch contains 48
scalability and efficiency challenges because of the need to GigE ports and up to 4 available 10 GigE uplinks.
support broadcast. Further, R3 at layer 2 requires MAC ToR switches connect to end of row (EoR) switches via
forwarding tables with potentially hundreds of thousands 1-4 of the available 10 GigE uplinks. To tolerate individ-
or even millions of entries, impractical with today’s switch ual switch failures, ToR switches may be connected to EoR
hardware. R4 is difficult for either layer 2 or layer 3 because switches in different rows. An EoR switch is typically a mod-
forwarding loops are possible during routing convergence. A ular 10 GigE switch with a number of ports corresponding to
layer 2 protocol may avoid such loops by employing a single the desired aggregate bandwidth. For maximum bandwidth,
spanning tree (inefficient) or tolerate them by introducing each of the 12 ToR switches would connect all 4 available
an additional header with a TTL (incompatible). R5 re- 10 GigE uplinks to a modular 10 GigE switch with up to 96
quires efficient routing protocols that can disseminate topol- ports. 48 of these ports would face downward towards the
ogy changes quickly to all points of interest. Unfortunately, ToR switches and the remainder of the ports would face up-
existing layer 2 and layer 3 routing protocols, e.g., ISIS and ward to a core switch layer. Achieving maximum bandwidth
OSPF, are broadcast based, with every switch update sent for inter-row communication in this example requires con-
to all switches. On the efficiency side, the broadcast over- necting 48 upward facing ports from each of 24 EoR switches
head of such protocols would likely require configuring the to a core switching layer consisting of 12 96-port 10 GigE
equivalent of routing areas [5], contrary to R2. switches.
Hence, the current assumption is that the vision of a uni-
fied plug-and-play large-scale network fabric is unachievable, Forwarding.
leaving data center network architects to adopt ad hoc par- There are a number of available data forwarding tech-
titioning and configuration to support large-scale deploy- niques in data center networks. The high-level dichotomy
ments. Recent work in SEATTLE [10] makes dramatic ad- is between creating a Layer 2 network or a Layer 3 net-
vances toward a plug-and-play Ethernet-compatible proto- work, each with associated tradeoffs. A Layer 3 approach
col. However, in SEATTLE, switch state grows with the assigns IP addresses to hosts hierarchically based on their
number of hosts in the data center, forwarding loops remain directly connected switch. In the example topology above,
possible, and routing requires all-to-all broadcast, violating hosts connected to the same ToR could be assigned the same
R3, R4, and R5. Section 3.7 presents a detailed discussion /26 prefix and hosts in the same row may have a /22 prefix.
of both SEATTLE and TRILL [28]. Such careful assignment will enable relatively small forward-
In this paper, we present PortLand, a set of Ethernet- ing tables across all data center switches.
compatible routing, forwarding, and address resolution pro- Standard intra-domain routing protocols such as OSPF [21]
tocols with the goal of meeting R1-R5 above. The principal may be employed among switches to find shortest paths
observation behind our work is that data center networks are among hosts. Failures in large-scale network topologies will
often physically inter-connected as a multi-rooted tree [1]. be commonplace. OSPF can detect such failures and then
Using this observation, PortLand employs a lightweight pro- broadcast the information to all switches to avoid failed
tocol to enable switches to discover their position in the links or switches. Transient loops with layer 3 forwarding is
topology. PortLand further assigns internal Pseudo MAC less of an issue because the IP-layer TTL limits per-packet
(PMAC) addresses to all end hosts to encode their position resource consumption while forwarding tables are being
in the topology. PMAC addresses enable efficient, provably asynchronously updated.
loop-free forwarding with small switch state. Unfortunately, Layer 3 forwarding does impose adminis-
We have a complete implementation of PortLand. We trative burden as discussed above. In general, the process
provide native fault-tolerant support for ARP, network-layer of adding a new switch requires manual administrator con-
multicast, and broadcast. PortLand imposes little require- figuration and oversight, an error prone process. Worse,
ments on the underlying switch software and hardware. We improperly synchronized state between system components,
hope that PortLand enables a move towards more flexible, such as a DHCP server and a configured switch subnet iden-
efficient and fault-tolerant data centers where applications tifier can lead to unreachable hosts and difficult to diagnose
may flexibly be mapped to different hosts, i.e. where the errors. Finally, the growing importance of end host virtual-
data center network may be treated as one unified fabric. ization makes Layer 3 solutions less desirable as described
below.
2. BACKGROUND For these reasons, certain data centers deploy a layer 2
network where forwarding is performed based on flat MAC
2.1 Data Center Networks addresses. A layer 2 fabric imposes less administrative over-
head. Layer 2 fabrics have their own challenges of course.
Topology. Standard Ethernet bridging [23] does not scale to networks
Current data centers consist of thousands to tens of thou- with tens of thousands of hosts because of the need to sup-
sands of computers with emerging mega data centers hosting port broadcast across the entire fabric. Worse, the presence
100,000+ compute nodes. As one example, consider our in- of a single forwarding spanning tree (even if optimally de-
40
signed) would severely limit performance in topologies that main over one such topology, a fat tree. As will become
consist of multiple available equal cost paths. evident, the fat tree is simply an instance of the traditional
A middle ground between a Layer 2 and Layer 3 fab- data center multi-rooted tree topology (Section 2.1). Hence,
ric consists of employing VLANs to allow a single logical the techniques described in this paper generalize to existing
Layer 2 fabric to cross multiple switch boundaries. While data center topologies. We present the fat tree because our
feasible for smaller-scale topologies, VLANs also suffer from available hardware/software evaluation platform (Section 4)
a number of drawbacks. For instance, they require band- is built as a fat tree.
width resources to be explicitly assigned to each VLAN at Figure 1 depicts a 16-port switch built as a multi-stage
each participating switch, limiting flexibility for dynamically topology from constituent 4-port switches. In general, a
changing communication patterns. Next, each switch must three-stage fat tree built from k-port switches can support
maintain state for all hosts in each VLAN that they par- non-blocking communication among k3 /4 end hosts using
ticipate in, limiting scalability. Finally, VLANs also use a 5k2 /4 individual k-port switches. We split the fat tree into
single forwarding spanning tree, limiting performance. three layers, labeled edge, aggregation and core as in Fig-
ure 1. The fat tree as a whole is split into k individual pods,
End Host Virtualization. with each pod supporting non-blocking operation among
The increasing popularity of end host virtualization in the k2 /4 hosts. Non-blocking operation requires careful schedul-
data center imposes a number of requirements on the un- ing of packets among all available paths, a challenging prob-
derlying network. Commercially available virtual machine lem. While a number of heuristics are possible, for the
monitors allow tens of VMs to run on each physical machine purposes of this work we assume ECMP-style hashing of
in the data center1 , each with their own fixed IP and MAC flows [16] among the k2 /4 available paths between a given
addresses. In data centers with hundreds of thousands of source and destination. While current techniques are less
hosts, this translates to the need for scalable addressing and than ideal, we consider the flow scheduling problem to be
forwarding for millions of unique end points. While individ- beyond the scope of this paper.
ual applications may not (yet) run at this scale, application
designers and data center administrators alike would still 2.3 Related Work
benefit from the ability to arbitrarily map individual appli- Recently, there have been a number of proposals for net-
cations to an arbitrary subset of available physical resources. work architectures specifically targeting the data center.
Virtualization also allows the entire VM state to be trans- Two recent proposals [14, 6] suggest topologies based on fat
mitted across the network to migrate a VM from one phys- trees [18]. As discussed earlier, fat trees are a form of multi-
ical machine to another [11]. Such migration might take rooted trees that already form the basis for many existing
place for a variety of reasons. A cloud computing hosting data center topologies. As such, they are fully compatible
service may migrate VMs for statistical multiplexing, pack- with our work and in fact our implementation runs on top
ing VMs on the smallest physical footprint possible while of a small-scale fat tree. DCell [15] also recently proposed a
still maintaining performance guarantees. Further, variable specialized topology for the data center environment. While
bandwidth to remote nodes in the data center could war- not strictly a multi-rooted tree, there is implicit hierarchy in
rant migration based on dynamically changing communica- the DCell topology, which should make it compatible with
tion patterns to achieve high bandwidth for tightly-coupled our techniques.
hosts. Finally, variable heat distribution and power avail- Others have also recently recognized the need for more
ability in the data center (in steady state or as a result of scalable layer 2 networks. SmartBridge [25] extended the
component cooling or power failure) may necessitate VM original pioneering work on learning bridges [23] to move
migration to avoid hardware failures. beyond single spanning tree networks while maintaining the
Such an environment currently presents challenges both loop free property of extended LANs. However, Smart-
for Layer 2 and Layer 3 data center networks. In a Layer 3 Bridge still suffers from the scalability challenges character-
setting, the IP address of a virtual machine is set by its istic of Ethernet networks. Contemporaneous to our work,
directly-connected switch subnet number. Migrating the MOOSE [27] also suggests the use of hierarchical Ethernet
VM to a different switch would require assigning a new IP addresses and header rewriting to address some of Ether-
address based on the subnet number of the new first-hop net’s scalability limitations.
switch, an operation that would break all open TCP con- RBridges and TRILL [24], its IETF standardization ef-
nections to the host and invalidate any session state main- fort, address some of the routing challenges in Ethernet.
tained across the data center, etc. A Layer 2 fabric is ag- RBridges run a layer 2 routing protocol among switches.
nostic to the IP address of a VM. However, scaling ARP Essentially switches broadcast information about their lo-
and performing routing/forwarding on millions of flat MAC cal connectivity along with the identity of all directly con-
addresses introduces a separate set of challenges. nected end hosts. Thus, all switches learn the switch topol-
ogy and the location of all hosts. To limit forwarding table
2.2 Fat Tree Networks size, ingress switches map destination MAC addresses to the
Recently proposed work [6, 14, 15] suggest alternate appropriate egress switch (based on global knowledge) and
topologies for scalable data center networks. In this paper, encapsulate the packet in an outer MAC header with the
we consider designing a scalable fault tolerant layer 2 do- egress switch identifier. In addition, RBridges add a sec-
1 ondary header with a TTL field to protect against loops.
One rule of thumb for the degree of VM-multiplexing allo-
cates one VM per thread in the underlying processor hard- We also take inspiration from CMU Ethernet [22], which
ware. x86 machines today have 2 sockets, 4 cores/processor, also proposed maintaining a distributed directory of all host
and 2 threads/core. Quad socket, eight core machines will information. Relative to both approaches, PortLand is able
be available shortly. to achieve improved fault tolerance and efficiency by lever-
41
Core
Aggregation
Edge
aging knowledge about the baseline topology and avoiding There is an inherent trade off between protocol simplicity
broadcast-based routing protocols altogether. and system robustness when considering a distributed versus
Failure Carrying Packets (FCP) [17] shows the benefits centralized realization for particular functionality. In Port-
of assuming some knowledge of baseline topology in routing Land, we restrict the amount of centralized knowledge and
protocols. Packets are marked with the identity of all failed limit it to soft state. In this manner, we eliminate the need
links encountered between source and destination, enabling for any administrator configuration of the fabric manager
routers to calculate new forwarding paths based on the fail- (e.g., number of switches, their location, their identifier).
ures encountered thus far. Similar to PortLand, FCP shows In deployment, we expect the fabric manager to be repli-
the benefits of assuming knowledge of baseline topology to cated with a primary asynchronously updating state on one
improve scalability and fault tolerance. For example, FCP or more backups. Strict consistency among replicas is not
demonstrates improved routing convergence with fewer net- necessary as the fabric manager maintains no hard state.
work messages and lesser state. Our approach takes inspiration from other recent large-
To reduce the state and communication overhead associ- scale infrastructure deployments. For example, modern stor-
ated with routing in large-scale networks, recent work [8, age [13] and data processing systems [12] employ a central-
9, 10] explores using DHTs to perform forwarding on flat ized controller at the scale of tens of thousands of machines.
labels. We achieve similar benefits in per-switch state over- In another setting, the Route Control Platform [7] considers
head with lower network overhead and the potential for im- centralized routing in ISP deployments. All the same, the
proved fault tolerance and efficiency, both in forwarding and protocols described in this paper are amenable to distributed
routing, by once again leveraging knowledge of the baseline realizations if the tradeoffs in a particular deployment envi-
topology. ronment tip against a central fabric manager.
42
the port number the host is connected to. We use vmid (16
bits) to multiplex multiple virtual machines on the same Fabric
IP
10.5.1.2
PMAC
00:00:01:02:00:01
physical machine (or physical hosts on the other side of Manager 10.2.4.5 00:02:00:02:00:01
a bridge). Edge switches assign monotonically increasing 5 0:0
1
2:0 01
vmid’s to each subsequent new MAC address observed on a 02 : 00:0 2:00:
:0 45
00: 00:01
given port. PortLand times out vmid’s without any traffic :
00 00
08
and reuses them. 2 3
Fabric IP PMAC 3
Manager 10.5.1.2 00:00:01:02:00:01
1
43
Algorithm 1 LDP listener thread() • Tree level (level): 0, 1, or 2 depending on whether the
1: While (true) switch is an edge, aggregation, or core switch. Our
2: For each tp in tentative pos approach generalizes to deeper hierarchies.
3: If (curr time − tp.time) > timeout
4: tentative pos ← tentative pos − {tp}; • Up/down (dir): Up/down is a bit which indicates
5: B Case 1: On receipt of LDM P
S whether a switch port is facing downward or upward
6: N eighbors ← N eighbors {switch that sent P } in the multi-rooted tree.
7: If (curr time − start time > T and |N eighbors| ≤ k2 )
8: my level ← 0; incoming port ← up; Initially, all values other than the switch identifier and
9: Acquire position thread(); port number are unknown and we assume the fat tree topol-
10: If (P.level = 0 and P.dir = up) ogy depicted in Figure 1. However, LDP also generalizes
11: my level ← 1; incoming port ← down;
12: Else If (P.dir = down) to multi-rooted trees as well as partially connected fat trees.
13: incoming port ← up; We assume all switch ports are in one of three states: discon-
14: If (my level = −1 and |N eighbors| = k) nected, connected to an end host, or connected to another
15: is core ← true; switch.
16: For each switch in N eighbors The key insight behind LDP is that edge switches receive
17: If (switch.level 6= 1 or switch.dir 6= −1) LDMs only on the ports connected to aggregation switches
18: is core ← false; break;
19: If (is core = true) (end hosts do not generate LDMs). We use this observation
20: my level ← 2; Set dir of all ports to down; to bootstrap level assignment in LDP. Edge switches learn
21: If (P.pos 6= −1 and P.pos *SP os used) their level by determining that some fraction of their ports
22: P os used ← P os used {P.pos}; are host connected. Level assignment then flows up the tree.
23: If (P.pod 6= −1 and my level 6= 2) Aggregations switches set their level once they learn that
24: my pod ← P.pod; some of their ports are connected to edge switches. Finally,
25:
26: B Case 2: On receipt of position proposal P core switches learn their levels once they confirm that all
S ports are connected to aggregation switches.
27: If (P.proposal * (P os used tentative pos))
28: reply ← {“Yes”}; Algorithm 1 presents the processing performed by each
29:
S
tentative pos ← tentative pos {P.proposal}; switch in response to LDMs. Lines 2-4 are concerned with
30: Else position assignment and will be described below. In line 6,
31: reply ← {“No”, P os used, tentative pos}; the switch updates the set of switch neighbors that it has
heard from. In lines 7-8, if a switch is not connected to more
Algorithm 2 Acquire position thread() than k/2 neighbor switches for sufficiently long, it concludes
that it is an edge switch. The premise for this conclusion is
1: taken pos = {}; that edge switches have at least half of their ports connected
2: While (my pos = −1)
to end hosts. Once a switch comes to this conclusion, on any
3: proposal ← random()% k2 , s.t. proposal * taken pos
4: Send proposal on all upward facing ports subsequent LDM it receives, it infers that the corresponding
5: Sleep(T ); incoming port is an upward facing one. While not shown for
6: If (more than k4 + 1 switches confirm proposal) simplicity, a switch can further confirm its notion of position
7: my pos = proposal; by sending pings on all ports. Hosts will reply to such pings
8: If (my pos = 0) but will not transmit LDMs. Other PortLand switches will
9: my pod = Request from F abric M anager; both reply to the pings and transmit LDMs.
10: Update taken pos according to replies;
In lines 10-11, a switch receiving an LDM from an edge
switch on an upward facing port concludes that it must be
an aggregation switch and that the corresponding incoming
change, this may still be a viable option. However, to ex- port is a downward facing port. Lines 12-13 handle the case
plore the limits to which PortLand switches may be entirely where core/aggregation switches transmit LDMs on down-
plug-and-play, we also present a location discovery protocol ward facing ports to aggregation/edge switches that have
(LDP) that requires no administrator configuration. Port- not yet set the direction of some of their ports.
Land switches do not begin packet forwarding until their Determining the level for core switches is somewhat more
location is established. complex, as addressed by lines 14-20. A switch that has
PortLand switches periodically send a Location Discovery not yet established its level first verifies that all of its active
Message (LDM) out all of their ports both, to set their posi- ports are connected to other PortLand switches (line 14). It
tions and to monitor liveness in steady state. LDMs contain then verifies in lines 15-18 that all neighbors are aggregation
the following information: switches that have not yet set the direction of their links
(aggregation switch ports connected to edge switches would
• Switch identifier (switch id): a globally unique identi-
have already been determined to be downward facing). If
fier for each switch, e.g., the lowest MAC address of
these conditions hold, the switch can conclude that it is a
all local ports.
core switch and set all its ports to be downward facing (line
• Pod number (pod): a number shared by all switches 20).
in the same pod (see Figure 1). Switches in different Edge switches must acquire a unique position number in
pods will have different pod numbers. This value is each pod in the range of 0.. k2 − 1. This process is depicted in
never set for core switches. Algorithm 2. Intuitively, each edge switch proposes a ran-
domly chosen number in the appropriate range to all aggre-
• Position (pos): a number assigned to each edge switch, gation switches in the same pod. If the proposal is verified
unique within each pod. by a majority of these switches as unused and not tenta-
44
tively reserved, the proposal is finalized and this value will gation switches necessary to ensure multicast packet delivery
be included in future LDMs from the edge switch. As shown to edge switches with at least one interested host.
in lines 2-4 and 29 of Algorithm 1, aggregation switches will Our forwarding protocol is provably loop free by observ-
hold a proposed position number for some period of time ing up-down semantics [26] in the forwarding process as ex-
before timing it out in the case of multiple simultaneous plained in Appendix A. Packets will always be forwarded
proposals for the same position number. up to either an aggregation or core switch and then down
LDP leverages the fabric manager to assign unique pod toward their ultimate destination. We protect against tran-
numbers to all switches in the same pod. In lines 8-9 of sient loops and broadcast storms by ensuring that once a
Algorithm 2, the edge switch that adopts position 0 requests packet begins to travel down, it is not possible for it to travel
a pod number from the fabric manager. This pod number back up the topology. There are certain rare simultaneous
spreads to the rest of the pod in lines 21-22 of Algorithm 1. failure conditions where packets may only be delivered by,
For space constraints, we leave a description of the entire essentially, detouring back down to an aggregation switch to
algorithm accounting for a variety of failure and partial con- get to a core switch capable of reaching a given destination.
nectivity conditions to separate work. We do note one of the We err on the side of safety and prefer to lose connectivity
interesting failure conditions, miswiring. Even in a data cen- in these failure conditions rather than admit the possibility
ter environment, it may still be possible that two host facing of loops.
ports inadvertently become bridged. For example, someone
may inadvertently plug an Ethernet cable between two out- 3.6 Fault Tolerant Routing
ward facing ports, introducing a loop and breaking some of
the important PortLand forwarding properties. LDP pro- Fault Matrix
tects against this case as follows. If an uninitialized switch Fabric
Manager 3
begins receiving LDMs from an edge switch on one of its 2
ports, it must be an aggregation switch or there is an error 4
condition. It can conclude there is an error condition if it
receives LDMs from aggregation switches on other ports or
if most of its active ports are host-connected (and hence re- 1
ceive no LDMs). In an error condition, the switch disables
the suspicious port and signals an administrator exception.
45
Fault Matrix Fault Matrix
4
Fabric Multicast State Fabric Multicast State
Manager 3 Multicast MAC Subscribers Roots Manager Multicast MAC Subscribers Roots
01:5E:E1:00:00:24 0,3,6 16 01:5E:E1:00:00:24 0,3,6 16, 18
5
2
R R S R R R S R
In Port VLAN Dst MAC Src MAC Type N/W Dst N/W Src ... Out Port In Port VLAN Dst MAC Src MAC Type N/W Dst N/W Src ... Out Port
0 FFFF 01:5E:E1:00:00:24 00:01:00:02:00:01 0800 - - ... 3 0 FFFF 01:5E:E1:00:00:24 00:01:00:02:00:01 0800 - - ... 2,3
Figure 5: Multicast: Fault detection and action. Figure 6: Multicast: After fault recovery.
46
Forwarding
System Topology Routing ARP Loops Multicast
Switch State Addressing
Flat; All switches
O(number of ISIS extensions
TRILL MAC-in-MAC Switch map MAC TRILL header
General global hosts) based on
encapsulation broadcast address to with TTL
MOSPF
remote switch
O(number of Switch Unicast loops New construct:
SEATTLE General Flat One-hop DHT
global hosts) broadcast possible groups
Location
Discovery Provably loop Broadcast-free
Multi-rooted O(number of Protocol; Fabric free; no routing;
PortLand Hierarchical
tree local ports) Fabric manager additional multi-rooted
manager for header spanning trees
faults
the TTL and recalculate the CRC for every frame, adding
complexity to the common case. SEATTLE admits routing Fabric
loops for unicast traffic. It proposes a new “group” construct Manager
OpenFlow
for broadcast/multicast traffic. Groups run over a single Protocol
spanning tree, eliminating the possibility of loops for such
traffic. PortLand’s forwarding is provably loop free with Local Switch Local Switch Local Switch User
no additional headers. It further provides native support Module Module Module Space
for multicast and network-wide broadcast using an efficient
fault-tolerance mechanism. Netlink IPC
47
Table 2 summarizes the state maintained locally at each
switch as well as the fabric manager. Here 92
88
k = N umber of ports on the switches,
84
m = N umber of local multicast groups, 80
p = N umber of multicast groups active in the system. 76
Sequence Number
Failure Recovery
72
State Switch Fabric Manager 68
RTOmin =
Connectivity Matrix O(k3 /2) O(k3 /2) 64
200ms
Multicast Flows O(m) O(p) 60
IP → P M AC mappings O(k/2) O(k3 /4) 56
52
48
Table 2: State requirements. 44
15.38 15.48 15.58 15.68 15.78 15.88 15.98
Time(s)
5. EVALUATION
In this section, we evaluate the efficiency and scalability
Figure 9: TCP convergence.
of our implementation. We describe the experiments carried
out on our system prototype and present measurements to
characterize convergence and control overhead for both mul-
ticast and unicast communication in the presence of link 950
failures. We ran all experiments on our testbed described in
940
Section 4.
Recovery
930
Sequence Number
Failure
920
140
910
120
Convergence time (ms)
900
100
890
80
880
60
4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9
Time(s)
40
20
48
400 250 TCP flow transfer
100 ARPs/sec/host
350 State transfer
Control traffic (Mbps) 50 ARPs/sec/host
25 ARPs/sec/host 200
300
Throughput (Mbps)
250
150
200
150 100
100
50
50
0 0
128 5128 10128 15128 20128 25128 30128 0 5 10 15 20 25 30 35 40 45
Number of hosts Time (Seconds)
70
100 ARPs/sec/host
60 50 ARPs/sec/host VM Migration.
25 ARPs/sec/host Finally, we evaluate PortLand’s ability to support virtual
50
machine migration. In this experiment, a sender transmits
Cores required
49
7. REFERENCES Education, pages 160–161, Washington, DC, USA, 2007.
[1] Cisco Data Center Infrastructure 2.5 Design Guide. IEEE Computer Society.
www.cisco.com/application/pdf/en/us/guest/netsol/ [20] R. Moskowitz and P. Nikander. Host Identity Protocol
ns107/c649/ccmigration_09186a008073377d.pdf. (HIP) Architecture. RFC 4423 (Proposed Standard), 2006.
[2] Configuring IP Unicast Layer 3 Switching on Supervisor [21] J. Moy. OSPF Version 2. RFC 2328, Internet Engineering
Engine 2. www.cisco.com/en/US/docs/routers/7600/ios/ Task Force, 1998.
12.1E/configuration/guide/cef.html. [22] A. Myers, T. S. E. Ng, and H. Zhang. Rethinking the
[3] Inside Microsoft’s $550 Million Mega Data Centers. Service Model: Scaling Ethernet to a Million Nodes. In
www.informationweek.com/news/hardware/data_centers/ ACM HotNets-III, 2004.
showArticle.jhtml?articleID=208403723. [23] L. S. C. of the IEEE Computer Society. IEEE Standard for
[4] OpenFlow. www.openflowswitch.org/. Local and Metropolitan Area Networks, Common
[5] OSPF Design Guide. Specifications Part 3: Media Access Control (MAC) Bridges
www.ciscosystems.com/en/US/tech/tk365/ Ammendment 2: Rapid Reconfiguration, June 2001.
technologies_white_paper09186a0080094e9e.shtml. [24] R. Perlman, D. Eastlake, D. G. Dutt, S. Gai, and
[6] M. Al-Fares, A. Loukissas, and A. Vahdat. A Scalable, A. Ghanwani. Rbridges: Base Protocol Specification.
Commodity Data Center Network Architecture. In Technical report, Internet Engineering Task Force, 2009.
SIGCOMM ’08: Proceedings of the ACM SIGCOMM 2008 [25] T. L. Rodeheffer, C. A. Thekkath, and D. C. Anderson.
conference on Data communication, pages 63–74, New SmartBridge: A Scalable Bridge Architecture. In
York, NY, USA, 2008. ACM. Proceedings of ACM SIGCOMM, 2001.
[7] M. Caesar, D. Caldwell, N. Feamster, J. Rexford, [26] M. D. Schroeder, A. D. Birrell, M. Burrows, H. Murray,
A. Shaikh, and J. van der Merwe. Design and R. M. Needham, T. L. Rodeheffer, E. H. Satterthwaite, and
Implementation of a Routing Control Platform. In C. P. Thacker. Autonet: A High-Speed, Self-Configuring
USENIX Symposium on Networked Systems Design & Local Area Network Using Point-to-Point Links. In IEEE
Implementation, 2005. Journal On Selected Areas in Communications, 1991.
[8] M. Caesar, M. Castro, E. B. Nightingale, G. O, and [27] M. Scott and J. Crowcroft. MOOSE: Addressing the
A. Rowstron. Virtual Ring Routing: Network Routing Scalability of Ethernet. In EuroSys Poster session, 2008.
Inspired by DHTs. In Proceedings of ACM SIGCOMM, [28] J. Touch and R. Perlman. Transparent Interconnection of
2006. Lots of Links (TRILL):Problem and Applicability
[9] M. Caesar, T. Condie, J. Kannan, K. Lakshminarayanan, Statement, 2009.
I. Stoica, and S. Shenker. ROFL: Routing on Flat Labels.
In Proceedings of ACM SIGCOMM, 2006.
[10] M. C. Changhoon Kim and J. Rexford. Floodless in Appendix A: Loop-Free Proof
SEATTLE: A Scalable Ethernet Architecture for Large A fat-tree network topology has many physical loops, which
Enterprises. In SIGCOMM ’08: Proceedings of the ACM
SIGCOMM 2008 conference on Data communication, 2008.
can easily lead to forwarding loops given some combination
[11] C. Clark, K. Fraser, S. Hand, J. G. H. E. J. C. Limpach, of forwarding rules present in the switches. However, phys-
I. Pratt, and A. Warfield. Live Migration of Virtual ical loops in data center networks are desirable and provide
Machines. In USENIX Symposium on Networked Systems many benefits such as increased network bisection band-
Design & Implementation, 2005. width and fault tolerance. Traditional Ethernet uses a min-
[12] J. Dean and S. Ghemawat. MapReduce: Simplified Data imum spanning tree to prevent forwarding loops at the cost
Processing on Large Clusters. In OSDI’04: Proceedings of of decreased bisection bandwidth and fault tolerance.
the 6th conference on Symposium on Operating Systems
Here we show that fat trees can be constrained in such
Design & Implementation, pages 10–10, Berkeley, CA,
USA, 2004. USENIX Association. a way as to prevent forwarding loops, without requiring an
[13] S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google explicit spanning tree. This constraint is simple, stateless,
File System. ACM SIGOPS Operating Systems Review, local to an individual switch, and uniform across all switches
37(5), 2003. in the fat tree.
[14] A. Greenberg, P. Lahiri, D. A. Maltz, P. Patel, and
S. Sengupta. Towards a Next Generation Data Center Constraint 1. A switch must never forward a packet
Architecture: Scalability and Commoditization. In out along an upward-facing port when the ingress port for
PRESTO ’08: Proceedings of the ACM Workshop on
that packet is also an upward-facing port.
Programmable Routers for Extensible Services of
Tomorrow, pages 57–62, New York, NY, USA, 2008. ACM.
[15] C. Guo, H. Wu, K. Tan, L. Shi, Y. Zhang, and S. Lu.
Theorem 1. When all switches satisfy Constraint 1
DCell: A Scalable and Fault-tolerant Network Structure for (C1), a fat tree will contain no forwarding loops.
Data Centers. In Proceedings of the ACM SIGCOMM 2008
Proof. C1 prevents traffic from changing direction more
conference on Data communication, pages 75–86, New
York, NY, USA, 2008. ACM. than once. It imposes the logical notion of up-packets and
[16] C. Hopps. Analysis of an Equal-Cost Multi-Path Algorithm. down-packets. Up-packets may travel only upward through
RFC 2992, Internet Engineering Task Force, 2000. the tree, whereas down-packets may travel only downward.
[17] K. Lakshminarayanan, M. Caesar, M. Rangan, C1 effectively allows a switch to perform a one-time conver-
T. Anderson, S. Shenker, I. Stoica, and H. Luo. Achieving sion of an up-packet to a down-packet. There is no provision
Convergence-Free Routing Using Failure-Carrying Packets. for converting a down-packet to an up-packet. In order for a
In Proceedings of ACM SIGCOMM, 2007.
switch to receive the same packet from the same ingress port
[18] C. E. Leiserson. Fat-Trees: Universal Networks for
Hardware-Efficient Supercomputing. IEEE Transactions on
more than once, this packet should change its direction at
Computers, 34(10):892–901, 1985. least twice while routed through the tree topology. However
[19] J. W. Lockwood, N. McKeown, G. Watson, G. Gibb, this is not possible since there is no mechanism for convert-
P. Hartke, J. Naous, R. Raghuraman, and J. Luo. ing a down-packet to an up-packet, something that would be
NetFPGA–An Open Platform for Gigabit-Rate Network required for at least one of these changes in direction.
Switching and Routing. In Proceedings of the 2007 IEEE
International Conference on Microelectronic Systems
50