Documente Academic
Documente Profesional
Documente Cultură
Advanced Architectures
* Minimal update, due to this part not being used for lectures in ECE 154 at UCSB
8008
80386 80486
8080
8-bit
Pentium, MMX
8084
32-bit Pentium Pro, II
8086 8088
Pentium III, M
80186 80188
16-bit Celeron
80286
Previously
discussed
1. Pipelining (and superpipelining) 3-8 √
Established
methods
Covered in
6. Multithreading (super-, hyper-) 2-5 ?
Part VII
methods
Newer
Earth Simulator
10 / 5 years
ASCI White Pacific
Cray 2
GFLOPS
1980 1990 2000 2010
Dongarra, J., “Trends in High Performance Computing,”
Computer J., Vol. 47, No. 4, pp. 399-403, 2004. [Dong04]
GP processor
performance
per watt
MIPS
kIPS
1980 1990 2000 2010
Calendar year
Figure 25.1 Trend in energy consumption for each MIPS of
computational power in general-purpose processors and DSPs.
Feb. 2011 Computer Architecture, Advanced Architectures Slide 8
25.2 Performance-Driven ISA Extensions
e h z v e h v
d g y u d g u
b f x t b f t
a e w s a e s
add add
s t u v s +t u+v
14 3 58 66 5 3 12 32 5 6 12 9
79 1 58 65 3 12 22 17 5 12 90 8
65 535 255
(all 1s) (all 1s)
0 0 0 0 0 0 0 0
Speedup attained
Fraction of cycles
20%
2
10%
0% 1
0 1 2 3 4 5 6 7 8 0 2 4 6 8
Issuable instructions per cycle Instruction issue width
(a) (b)
General
registers (128) Execution Execution Execution
unit unit
... unit
Predi-
Memory cates
(64)
Execution Execution Execution
unit unit
... unit
Floating-point
registers (128)
Memo table
Miss Mux
0
Mult/ Output
Inputs Div 1
Done
FPGA-like unit
Accel. 1
on which Accel. 3
accelerators
can be formed
via loading of Accel. 2 Unused
configuration resources
registers
PE PE PE PE
12 13 14 15
Feedback
path Column Column Column Column
memory memory memory memory
memory
stream
Global
Johnson’ s expansion
SISD SIMD GMSV GMMP
Uniprocessors Array or vector Shared-memory Rarely used
processors multiprocessors
Multiple instr
Distributed
memory
streams
Flynn’s categories
Data Data
in out
0 0 0
Parallel input/output
Processor- 1 Processor- 1 1 Inter-
to-
. to- . connection
processor memory . network
network . network . A computing node .
. . .
p1 m1 p1
... ...
Parallel I/O
f = 0.01
30 of the rest
f = 0.02 with p
20 processors
f = 0.05
10 1
s=
f = 0.1 f + (1 – f)/p
0
0 10 20 30 40 50 min(p, 1/f)
Enhancement factor (p )
Figure 4.4 Amdahl’s law: speedup achieved if a fraction f of a task
is unaffected and the remaining 1 – f part runs p times as fast.
for i = 0 to 63 do load W
P[i] := W[i] D[i] load D
endfor P := W D
store P
for i = 0 to 63 do
X[i+1] := X[i] + Z[i] Unparallelizable
Y[i+1] := X[i+1] + Y[i]
endfor
Load
unit A Function unit 2 pipeline
Vector
Load register
unit B
file Function unit 3 pipeline
Store
unit
Forwarding muxes
Figure 26.1 Simplified generic structure of a vector processor.
Feb. 2011 Computer Architecture, Advanced Architectures Slide 27
Conflict-Free Memory Access
0 1 2 62 63 Bank 0 1 2 62 63
.. . .. .
number
0,0 0,1 0,2 ... 0,62 0,63 0,0 0,1 0,2 ... 0,62 0,63
1,0 1,1 1,2 ... 1,62 1,63 1,63 1,0 1,0 ... 1,61 1,62
2,0 2,1 2,2 ... 2,62 2,63 2,62 2,63 2,0 ... 2,60 2,61
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
62,0 62,1 62,2 ... 62,62 62,63 62,2 62,3 62,4 ... 62,0 62,1
63,0 63,1 63,2 ... 63,62 63,63 63,1 63,2 63,3 ... 63,63 63,0
(a) Conventional row-major order (b) Skewed row-major order
Vector reg 0
Load X
Vector reg 1
Vector reg 2
Load Y
Vector reg 3
Vector reg 4
Store Z
Vector reg 5
Time
Without
chaining +
Multiplication Addition
start-up start-up
With pipeline
chaining
+
Figure 26.4 Total latency of the vector computation
S := X Y + Z, without and with pipeline chaining.
0
0 100 200 300 400
Vector length
Figure 26.5 The per-element execution time in a vector processor
as a function of the vector length.
...
...
...
...
(a) Shared-control array (b) Multiple shared controls, (c) Separate controls,
processor, SIMD MSIMD MIMD
Control
broadcast Parallel
I/O
Configuration Switches
Control
broadcast Parallel
I/O
Figure 26.7
In In
Out Out
Figure 26.9 I/O switch states in the array processor of Figure 26.7.
0 0
Processor- 1 Processor- 1
to-
. to- .
processor memory
network . network .
. .
p1 m1
...
Parallel I/O
Figure 27.1 Structure of a multiprocessor with
centralized shared-memory.
88
8
4 4 4 / 2, 6, 10, 14, 18, 22, 26, 30
88
5 4 4 226, 230, 234, 238, . . . , 254
8
3, 7, 11, 15, 19, 23, 27, 31
6 4 4 /
88
7 4 4 227, 231, 235, 239, . . . , 255
0 0
Processor- 1 Processor- 1
to-
. to- .
processor memory
network . network .
. .
p1 m1
...
Parallel I/O
w w 0
0 z x
w y
Processor- 1 y Processor- 1
to- . to- .
processor Single inconsistent memory
network . network .
Single consistent
. .
p–1 x z m–1
z
Invalid
...
Parallel I/O
Cache
Snoop side data Processor side
cache control array cache control
=?
Tag
=?
Snoop
state Cmd Addr Buffer Buffer Addr Cmd
System
bus
Figure 27.7 Main structure for a snoop-based cache coherence algorithm.
1 z :0 Inter-
connection
. network
.
.
while z=0 do
x := x + y x:0
p1 y:1
endwhile
0
Inter-
Parallel input/output
1 connection
network
.
.
.
p1
To I/O controllers Computing nodes (typically, 1-4 CPUs and associated memory)
Node 0 Node 1 Node 2 Node 3
Ring
network
To interconnection network
Coherent
Interface 0
(SCI)
1
0
Parallel input/output
1 Inter-
connection
. network
A computing node .
.
p1
LC LC Link controller
Output channels
Input channels
LC Q Q LC
LC Q Q LC
Switch
LC Q Q LC
LC Q Q LC
... ...
... ...
... Process B is suspended
...
...
...
send x
Communication ...
latency ...
... Process B is awakened
... receive x
... ...
... ...
... ...
Time
Level-3 bus
Level-2 bus
Level-1 bus
Message
Padding
Packet
data
First packet Last packet
Header Data or payload Trailer
A transmitted
packet
Flow control
digits (flits)
Destination 2
Worm 1:
moving
Worm 2:
Source 1
blocked
Source 2
(a) Two worms en route to their (b) Deadlock due to circular waiting
respective destinations of four blocked worms
One module:
Expansion CPU(s),
slots memory,
disks
One module:
CPU, Wireless
memory, connection
disks surfaces
System or
PC I/O bus
Fast network
interface with NIC
large memory
Network built of
high-s peed
wormhole switches
Still to be worked out as of late 2000s: How to charge for compute usage
Managing / upgrading is
much more efficient in large,
centralized facilities
(warehouse-sized data
centers or server farms) Image from Wikipedia
Image from
IEEE Spectrum,
June 2009