Documente Academic
Documente Profesional
Documente Cultură
Network
rrr
Mem
$ P
Node: processor(s), memory system, plus communication assist: Network interface and communication controller.
Scalable network.
Function of a parallel machine network is to efficiently transfer information from source node to destination node in support of network transactions that realize the programming model. Network performance should scale up as its size is increased. Latency grows slowly with network size. Total available bandwidth scales up with network size. Network cost/complexity should grow slowly in terms of network size.
EECC756 - Shaaban
#1 lec # 9 Spring2003 4-17-2003
CA M P M
CA P
EECC756 - Shaaban
#2 lec # 9 Spring2003 4-17-2003
Cost of Communication
Given amount of comm (inherent or artifactual), goal is to reduce cost
EECC756 - Shaaban
#3 lec # 9 Spring2003 4-17-2003
Network Characteristics
Topology:
Physical interconnection structure of the network graph:
Node Degree: Number of channels per node. Network diameter: Longest minimum routing distance between any two nodes in hops. Average Distance between all pairs of nodes . Bisection width: Minimum number of links whose removal disconnects the graph and cuts it in half. Symmetry: The property that the network looks the same from every node. Homogeneity: Whether all the nodes and links are identical or not.
Type of interconnection:
Static or Direct Interconnects: Nodes connected directly using static pointto-point links. Dynamic or Indirect Interconnects: Switches are usually used to realize dynamic links between nodes: Each node is connected to specific subset of switches. (e.g Multistage Interconnection Networks, MINs). Blocking or non-blocking, permutations realized. Shared-, broadcast-, or bus-based connections. (e.g. Ethernet-based).
EECC756 - Shaaban
#5 lec # 9 Spring2003 4-17-2003
Network Characteristics
Routing Algorithm and Functions:
The set of paths that messages may follow.
Deterministic Routing: The route taken by a message determined by source and destination regardless of other traffic in the network. Adaptive Routing: One of multiple routes from source to destination selected to account for other traffic to reduce node/link contention.
Switching Strategy:
Circuit switching vs. packet switching.
EECC756 - Shaaban
#6 lec # 9 Spring2003 4-17-2003
Network Characteristics
Hardware/software implementation complexity/cost. Network throughput: Total number of messages handled by network per unit time. Aggregate Network bandwidth: Similar to network throughput but given in total bytes/sec. Network hot spots: Form in a network when a small number of network nodes/links handle a very large percentage of total network traffic and become saturated. Network scalability: The feasibility of increasing network size, determined by:
Performance scalability: Relationship between network size in terms of number of nodes and the resulting network performance (average latency, aggregate network bandwidth). Cost scalability: Relationship between network size in terms of number of nodes/links and network cost/complexity.
EECC756 - Shaaban
#7 lec # 9 Spring2003 4-17-2003
Network Latency
Time to transfer n bytes from source to destination: Time(n)s-d = overhead + routing delay + channel occupancy + contention delay
Unloaded Network Latency = routing delay + channel occupancy
channel occupancy = (n + ne) / b b = channel bandwidth, bytes/sec n = payload size ne = packet envelope: header, trailer. Effective link bandwidth = bn / (n + ne) The term for unloaded network latency is refined next by examining
the impact of flow control mechanism used in the network EECC756 - Shaaban
#8 lec # 9 Spring2003 4-17-2003
-T h ro u g h R o u ti n g D es t
Unloaded network latency for n byte packet: h(n/b + () vs n/b + h ( h = distance in hops ( = switch delay
EECC756 - Shaaban
#9 lec # 9 Spring2003 4-17-2003
Network Latency
For an unloaded network (no contention delay) the network latency to transfer an n byte packet (including packet envelope) across the network:
Unloaded Network Latency = routing delay + channel occupancy
Factors affecting effective local link bandwidth available to a single node: Accounting for Packet density b x n/(n + ne) Also Accounting for Routing delay b x n / (n + ne + w() Contention: Routing delay At endpoints. Within the network. Factors affecting throughput or Aggregate bandwidth: Network bisection bandwidth: Sum of bandwidth of smallest set of links when removed partition the network into two unconnected networks of equal size. Total bandwidth of all the C channels: Cb bytes/sec, Cw bits per cycle or C phits per cycle. Suppose N hosts each issue a packet every M cycles with average routing distance h and average distribution: Each message occupies h channels for l = n/w cycles Total network load = Nhl / M phits per cycle. Average Link utilization = Total network load / Total bandwidth Average Link utilization: V = Nhl /MC < 1
Available Bandwidth
EECC756 - Shaaban
#12 lec # 9 Spring2003 4-17-2003
Network Saturation
80 0.
Delivered Bandwidth
70 60
0. 0. 0. 0. 0. 0. 0.
Latency
50 40
Saturation
30 20 10 0 0 0.2 0.4 0.6 0.8 1
0.
0.
0.
0.
Delivered Bandwidth
Offered Bandwidth
Link utilization =1
Saturation
Two packets trying to use the same link at same time. May be caused by limited available buffering. Possible resolutions:
Increased buffer space. Drop one or more packets. Use an alternative route (requires an adaptive routing algorithm or a better static routing to distribute load more evenly).
Most networks used in parallel machines block in place Link-level flow control. Back pressure to the source to slow down flow of data. Closed system: Offered load depends on delivered.
EECC756 - Shaaban
#14 lec # 9 Spring2003 4-17-2003
EECC756 - Shaaban
#15 lec # 9 Spring2003 4-17-2003
Binary Tree
Fully Connected
EECC756 - Shaaban
#16 lec # 9 Spring2003 4-17-2003
Static Point-to-point Connection Network Topologies Point-to Direct point-to-point links are used. Suitable for predictable communication patterns matching topology. Fully Connected Network: Every node is connected to all other nodes using N- 1 direct links
N(N-1)/2 Links -> O(N2) complexity Node Degree: N -1 Diameter = 1 Average Distance = 1 Bisection Width = (N/2)2
Linear Array:
N-1 Links -> O(N) complexity Node Degree: 1-2 Diameter = N -1 Average Distance = 2/3N Bisection Width = 1
Ring:
N Links -> O(N) complexity Node Degree: 2 Diameter = N/2 Average Distance = 1/3N Bisection Width = 2
Examples: Token-Ring, FDDI, SCI (Dolphin interconnects SAN), FiberChannel Arbitrated Loop, KSR1
EECC756 - Shaaban
#17 lec # 9 Spring2003 4-17-2003
1D torus
2D mesh
2D torus
d-dimensional array or mesh: N = kd-1 X ...X kO nodes described by d-vector of coordinates (id-1, ..., iO) Where 0
e ij e kj -1 for 0 e j e d-1
d-dimensional k-ary mesh: N = kd k = dN described by d-vector of radix k coordinate. Diameter = d(k-1) d-dimensional k-ary torus (or k-ary d-cube): Edges wrap around, every node has degree 2d and connected to nodes that differ by one (mod k) in every dimension.
EECC756 - Shaaban
#18 lec # 9 Spring2003 4-17-2003
Average Distance: d x 2k/3 for mesh. dk/2 for cube or torus. Degree: d to 2d for mesh. 2d for cube or torus. Bisection width: k d-1 bi-directional links when k is even.
EECC756 - Shaaban
#19 lec # 9 Spring2003 4-17-2003
2D Mesh
(2-dimensional k-ary mesh)
For an k x k 2D Mesh:
Node Degree: 2-4 Network diameter: 2(k-1) No of links: 2N - 2k Bisection Width: k Where k = N Example: 1824 node Intel Paragon: 16 x 114 2D mesh
EECC756 - Shaaban
#20 lec # 9 Spring2003 4-17-2003
Hypercubes
Also called binary n-cubes (n-ary 2-cube) Dimension = n = log2N Number of nodes = N = 2n Diameter: O(log2N) hops = n= Dimension Good bisection BW: N/2 Complexity:
Number of links: N(log2N)/2 Node degree is n = log2N
5-D
0-D 1-D 2-D 3-D 4-D
EECC756 - Shaaban
#21 lec # 9 Spring2003 4-17-2003
110 010
111 011
100 000
101 001
000
001
010
011
100
101
110
111
000
001
010
011
100
101
110
111
000
001
010
011
100
101
110
111
EECC756 - Shaaban
#22 lec # 9 Spring2003 4-17-2003
Trees
Binary Tree
Diameter and average distance are logarithmic. k-ary tree, height d = logk N Address specified d-vector of radix k coordinates describing path down from root. Fixed degree k. Route up to common ancestor and down: R = B XOR A Let i be position of most significant 1 in R, route up i+1 levels Down in direction given by low i+1 bits of B H-tree space is O(N) with O(N) long wires. Low Bisection BW = 1
EECC756 - Shaaban
#23 lec # 9 Spring2003 4-17-2003
FatFat-Trees
Fat T
Fatter higher bandwidth links (more connections in reality) as you go up, so bisection BW scales with number of nodes N.
6x3x2
Embed multiple logical dimension in one physical dimension using long interconnections.
EECC756 - Shaaban
#25 lec # 9 Spring2003 4-17-2003
EECC756 - Shaaban
#26 lec # 9 Spring2003 4-17-2003
EECC756 - Shaaban
#27 lec # 9 Spring2003 4-17-2003
Single-Stage networks: Crossbar switches are single-stage, strictly nonblocking, and can implement not only the N! permutations, but also the NN combinations of non-overlapping broadcast.
EECC756 - Shaaban
#29 lec # 9 Spring2003 4-17-2003
CrossbarCrossbar-Based Switches
Receiver Input Ports Input Buffer Output Buffer Transmiter Output Ports Cross-bar
EECC756 - Shaaban
#30 lec # 9 Spring2003 4-17-2003
Switch Components
Output ports:
Transmitter (typically drives clock and data).
Input ports:
Synchronizer aligns data signal with local clock domain. FIFO buffer.
Crossbar:
Switch fabric connecting each input to any output. Feasible degree limited by area or pinout, O(n2) complexity.
4 256 16,777,216 nn
2 24 40,320 n!
EECC756 - Shaaban
#32 lec # 9 Spring2003 4-17-2003
Permutations
For n objects there are n! permutations by which the n objects can be reordered. The set of all permutations form a permutation group with respect to a composition operation. One can use cycle notation to specify a permutation function. For Example: The permutation T = ( a, b, c)( d, e) stands for the bijection mapping: a p b, b p c , c p a , d p e , e p d in a circular fashion. The cycle ( a, b, c) has a period of 3 and the cycle (d, e) has a period of 2. Combining the two cycles, the permutation T has a cycle period of 2 x 3 = 6. If one applies the permutation T six times, the identity mapping I = ( a) ( b) ( c) ( d) ( e) is obtained.
EECC756 - Shaaban
#33 lec # 9 Spring2003 4-17-2003
Perfect shuffle is a special permutation function suggested by Harold Stone (1971) for parallel processing applications. Obtained by rotating the binary address of an one position left. The perfect shuffle and its inverse for 8 objects are shown here:
Perfect Shuffle
000 001 010 011 100 101 110 111 000 001 010 011 100 101 110
000 001 010 011 100 101 110 111 Perfect Shuffle
EECC756 - Shaaban
#34 lec # 9 Spring2003 4-17-2003
EECC756 - Shaaban
#35 lec # 9 Spring2003 4-17-2003
MultiMulti-Stage Networks:
EECC756 - Shaaban
#36 lec # 9 Spring2003 4-17-2003
MultiMulti-Stage Networks:
EECC756 - Shaaban
#37 lec # 9 Spring2003 4-17-2003
EECC756 - Shaaban
#38 lec # 9 Spring2003 4-17-2003
4
0 1 0 1
3
0
0
1 0
1
1
2 1 0
Building block
Complexity: N/2 x logN (# of switches in each stage x # of stages) Exactly one route from any source to any destination node. R = A XOR B, at level i use straight edge if ri=0, otherwise cross edge Bisection N/2 Diameter logN
EECC756 - Shaaban
#39 lec # 9 Spring2003 4-17-2003
EECC756 - Shaaban
#41 lec # 9 Spring2003 4-17-2003
EECC756 - Shaaban
#42 lec # 9 Spring2003 4-17-2003
Switches operate at 75 MHz. Framed into 1-word and 8-word read/write request packets. EECC756 - Shaaban
#43 lec # 9 Spring2003 4-17-2003
EECC756 - Shaaban
#44 lec # 9 Spring2003 4-17-2003
150
100
50
EECC756 - Shaaban
#45 lec # 9 Spring2003 4-17-2003
N-node hypercube has N bisection links. 2d torus has 2N 1/2 Fixed bisection => w(d) = N 1/d / 2 = k/2 not shown: 1 M nodes, d=2 has w=512 And avg. 1023 hops.
Dim
sio
n = 40 bytes,
(=2
EECC756 - Shaaban
#46 lec # 9 Spring2003 4-17-2003