01 IntroArchitecture

Introduction:
Modern computer architecture
The stored program computer and its inherent bottlenecks

Multi- and manycore chips and nodes
Introduction: Moore’s law
Intel Sandy Bridge EP: 2.3 billion

Nvidia Kepler: 7 billion
Intel Broadwell: 7.2 billion
Nvidia Pascal: 15 billion
1965: G. Moore claimed

#transistors on “microchip”
doubles every 12-24 months
(c) RRZE 2016 Basic Architecture 2

Multi-core today: Intel Xeon 2600v3 (2014)
 Xeon E5-2600v3 “Haswell EP”:

Up to 18 cores running at 2+ GHz (+ “Turbo Mode”: 3.5+ GHz)
 Simultaneous Multithreading
 reports as 36-way chip
 5.7 billion transistors @ 22 nm
Optional:
 Die size: 662 mm2 “Cluster on Die”
(CoD) mode
... ...
2-socket server

A deeper dive into core and chip
architecture
General-purpose cache-based microprocessor core
Modern CPU core
Stored-program computer
 Implements “Stored
Program Computer”
concept (Turing 1936)
 Similar designs on all

modern systems
 (Still) multiple potential

bottlenecks
 Flexible!

Basic resources on a stored program computer
Instruction execution and data movement
1. Instruction execution
This is the primary resource of the processor. All efforts in hardware design
are targeted towards increasing the instruction throughput.
Instructions are the concept of “work” as seen by processor designers.

Not all instructions count as “work” as seen by application developers!
Example: Adding two arrays A(:) and B(:) Processor work:

LOAD r1 = A(i)
LOAD r2 = B(i)
ADD r1 = r1 + r2
do i=1, N STORE A(i) = r1
A(i) = A(i) + B(i) INCREMENT i
enddo BRANCH  top if i<N
User work:
N Flops (ADDs)

Basic resources on a stored program computer
Instruction execution and data movement
2. Data transfer
Data transfers are a consequence of instruction execution and therefore a
secondary resource. Maximum bandwidth is determined by the request rate
of executed instructions and technical limitations (bus width, speed).
Example: Adding two arrays A(:) and B(:)
Data transfers:
8 byte: LOAD r1 = A(i)
do i=1, N 8 byte: LOAD r2 = B(i)
A(i) = A(i) + B(i) 8 byte: STORE A(i) = r2
enddo Sum: 24 byte
Crucial question: What is the bottleneck?

 Data transfer?
 Code execution?
From high level code to actual execution
sum=0.d0
do i=1, N
sum=sum + A(i)
end
Compiler (naive)
Base address of A(i)
sum in register xmm1
ADDSD: Add 1st argument to 2nd
argument and store result in 2nd
argument
Register increment
Compare register content
i in
N in register rdx
Jump to label if loop register rax
continues

Microprocessors – Pipelining
Pipelining of arithmetic/functional units
 Idea:
 Split complex instruction into several simple / fast steps (stages)
 Each step takes the same amount of time, e.g., a single cycle
 Execute different steps on different instructions at the same time (in parallel)
 Allows for shorter cycle times (simpler logic circuits), e.g.:

 floating point multiplication takes 5 cycles, but
 processor can work on 5 different multiplications simultaneously
 one result at each cycle after the pipeline is full
 Drawback:
 Pipeline must be filled - startup times (#Instructions >> pipeline steps)
 Efficient use of pipelines requires large number of independent instructions 
instruction level parallelism
 Requires complex instruction scheduling by compiler/hardware – software-
pipelining / out-of-order
 Pipelining is widely used in modern computer architectures

5-stage Multiplication-Pipeline: A(i)=B(i)*C(i) ; i=1,...,N
First result is available after 5 cycles (=latency of pipeline)!

Wind-up/-down phases: Empty pipeline stages
Pipelining: The Instruction pipeline
 Besides arithmetic & functional unit, instruction execution itself is

pipelined also, e.g.: one instruction performs at least 3 steps:
Fetch Instruction Decode Execute
from L1I instruction Instruction
Fetch Instruction 1
1 from L1I
2 Fetch Instruction 2 Decode
from L1I Instruction 1
t
Fetch Instruction 3 Decode Execute

3 from L1I Instruction 2 Instruction 1
Fetch Instruction 4 Decode Execute
4 from L1I Instruction 3 Instruction 2
…
 Branches can stall this pipeline! (Speculative Execution, Predication)
 Each unit is pipelined itself (e.g., Execute = Multiply Pipeline)

Microprocessors – Superscalarity and
Simultaneous Multithreading
Superscalar Processors – Instruction Level Parallelism
 Multiple units enable use of Instrucion Level Parallelism (ILP):

Instruction stream is “parallelized” on the fly
Fetch Instruction 4
Fetch Instruction 3
fromInstruction
Fetch L1I 2 4-way
fromInstruction
Fetch L1I 1
from L1I 2
Fetch Instruction Decode „superscalar“
Fetch Instruction
from L1I 2 Decode
t fromInstruction
Fetch L1I 2 Instruction
Decode1
fromInstruction
Decode 1
from L1I 3
Fetch Instruction Instruction 1
Decode Execute
from L1I 3
Decode Execute
fromInstruction
Decode2 Instruction 1
Execute
from L1I
Fetch Instruction 9 Instruction
Decode 2 Instruction
Execute1
from L1I 4
Decode Instruction 1
Execute
from L1I 4
Decode Instruction 1
Execute
fromInstruction
Decode3 Instruction 2
Execute
from
Fetch L1I
Instruction 13 Instruction
Decode 3 Instruction
Execute2
from L1I Instruction 3 Instruction 2
from L1I Instruction 9 Instruction 5
 Issuing m concurrent instructions per cycle: m-way superscalar

 Modern processors are 3- to 6-way superscalar &
can perform 2 floating point instructions per cycles

Superscalar processors –
executing multiple instructions concurrently
LOAD
Instruction execution STORE (Latency: 4 cy)
(Latency: 2 cy)
ADD
load a[1] (Latency: 3cy)
Cycle 1
Cycle 2 load a[2] …
Cycle 3 load a[3] for(int i=1; i<n; ++i)
Cycle 4 load a[4] a[i] = a[i] + s;
Cycle 5 load a[5] add a[1]=c,a[1]
Cycle 6 load a[6] add a[2]=c,a[2] …
Cycle 7 load a[7] add a[3]=c,a[3]
Cycle 8 load a[8] add a[4]=c,a[4] store a[1]
Cycle 9 load a[9] add a[5]=c,a[5] store a[2] “Steady state:”
Cycle 10 load a[10] add a[6]=c,a[6] store a[3] 3 instructions/cy
(“3-way superscalar”)
Cycle 14 load a[14] add a[10]=c,a[10] store a[7] Instructions Per Cycle: IPC=3
Cycle 16 load a[16] add a[12]=c,a[12] store a[10] Cycles Per Instruction: CPI=0.33
… … … …
Correct interleaving / reordering the instruction streams:

Out-Of-Order (OOO) execution

Core details: Simultaneous multi-threading (SMT)
SMT principle (2-way example):

Standard core
2-way SMT

Microprocessors –
Single Instruction Multiple Data (SIMD)
a.k.a. vectorization
Core details: SIMD processing
 Single Instruction Multiple Data (SIMD) operations allow the

concurrent execution of the same operation on “wide” registers
 x86 SIMD instruction sets:
 SSE: register width = 128 Bit  2 double precision floating point operands
 AVX: register width = 256 Bit  4 double precision floating point operands
 Adding two registers holding double precision floating point
operands
R0 R1 R2 R0 R1 R2
A[3]
B[3]
C[3]
SIMD execution: +
V64ADD [R0,R1] R2
A[2]
B[2]
C[2]
+
256 Bit
A[1]
B[1]
C[1]
+
Scalar execution:
A[0]
A[0]
B[0]
B[0]
C[0]
C[0]
64 Bit + R2 ADD [R0,R1] +

Microprocessors –
Memory Hierarchy
Von Neumann bottleneck reloaded: “DRAM gap”
DP peak performance and peak main memory bandwidth

for a single Intel processor (chip)
Approx.
10 F/B
Main memory access

speed not sufficient to
keep CPU busy…
 Introduce fast on-chip

caches, holding copies of
recently used data items
(c) RRZE 2016 20

Registers and caches: Data transfers in a memory hierarchy
 Caches help with getting instructions and data to the CPU “fast”
 How does data travel from memory to the CPU and back?
CPU registers
 Remember: Caches are organized LD C(1)
MISS ST A(1)
in cache lines (e.g., 64 bytes) MISS
LD C(2..Ncl)
 Only complete cache lines are ST A(2..Ncl) HIT
transferred between memory
CL CL
hierarchy levels (except registers) Cache
 MISS: Load or store instruction does
not find the data in a cache level
write evict
 CL transfer required allocate (delayed)
CL CL
3 CL
C(:) A(:)
transfers
 Example: Array copy A(:)=C(:) Memory

From UMA to ccNUMA
Basic architecture of commodity multi-socket nodes
Yesterday (2006): Dual-socket Intel “Core2” node
Uniform Memory Architecture (UMA)

Flat memory ; symmetric MPs
But: system “anisotropy”
Today: Dual-socket node
Cache-coherent Non-Uniform Memory

Architecture (ccNUMA)
HT / QPI provide scalable bandwidth at the
price of ccNUMA architectures: Where
does my data finally end up?
Intel Cluster on Die (CoD)  ccNUMA within a socket!

(AMD has been there long ago…)
Cray XC30 “SandyBridge-EP” 8-core dual socket node
 8 cores per socket 2.7 GHz

(3.5 @ turbo)
 DDR3 memory interface with 4
channels per chip
 Two-way SMT
 Two 256-bit SIMD FP units
 SSE4.2, AVX
 32 kB L1 data cache per core

 256 kB L2 cache per core
 20 MB L3 cache per chip

There is no single driving force for single core performance!
Maximum floating point (FP) performance:
𝐹𝑃
𝑃𝑐𝑜𝑟𝑒 = 𝑛𝑠𝑢𝑝𝑒𝑟 ∙ 𝑛𝐹𝑀𝐴 ∙ 𝑛𝑆𝐼𝑀𝐷 ∙ 𝑓
Super- FMA SIMD Clock

scalarity factor factor Speed
Typical 𝒏𝑭𝑷
𝒔𝒖𝒑𝒆𝒓 𝒏𝑺𝑰𝑴𝑫 𝑷𝒄𝒐𝒓𝒆
𝒏𝑭𝑴𝑨 Code 𝒇 [GHz]
representatives inst./cy ops/inst. [GF/s]
Nehalem 2 1 2 Q1/2009 X5570 2.93 11.7
Westmere 2 1 2 Q1/2010 X5650 2.66 10.6
Sandy Bridge 2 1 4 Q1/2012 E5-2680 2.7 21.6
Ivy Bridge 2 1 4 Q3/2013 E5-2660 v2 2.2 17.6
Haswell 2 2 4 Q3/2014 E5-2695 v3 2.3 36.8
Broadwell 2 2 4 Q1/2016 E5-2699 v4 2.2 35.2
IBM POWER8 2 2 2 Q2/2014 S822LC 2.93 23.4
Interlude:
A glance at current accelerator
technology
NVidia “Pascal” GP100

vs.
Intel Xeon Phi “Knights Landing”
NVidia Pascal GP100 block diagram
Architecture
 15.3 B Transistors
 ~ 1.4 GHz clock speed
 Up to 60 “SM” units
 64 (SP) “cores” each
 5.7 TFlop/s DP peak
 4 MB L2 Cache
 4096-bit HBM2
 MemBW ~ 732 GB/s
(theoretical)
 MemBW ~ 510 GB/s
(measured)
 2:1 SP:DP
performance
© NVIDIA Corp.
(c) RRZE 2016 29

Intel Xeon Phi “Knights Landing” block diagram
VPU VPU
VPU VPU
MCDRAM MCDRAM MCDRAM

T T T T T T T T
MCDRAM
P P
32 KiB L1 32 KiB L1
DDR4 DDR4 1 MiB L2
36 tiles
DDR4 (72 cores) DDR4 Architecture
max.
 8 B Transistors
DDR4 DDR4
 Up to 1.5 GHz clock speed
 Up to 2x36 cores (2D mesh)
 2x 512-bit SIMD units each
 4-way SMT
MCDRAM
 3.5 TFlop/s DP peak (SP 2x)
MCDRAM MCDRAM MCDRAM
 36 MiB L2 Cache
 16 GiB MCDRAM
 MemBW ~ 470 GB/s (measured)
 Large DDR4 main memory
 MemBW ~ 90 GB/s (measured)
(c) RRZE 2016 30

Trading single thread performance for parallelism:
GPGPUs vs. CPUs
GPU vs. CPU

light speed estimate
(per device)
MemBW ~ 5-10x
Peak ~ 6-15x
2x Intel Xeon E5-2697v4 Intel Xeon Phi 7250 NVidia Tesla P100
“Broadwell” “Knights Landing” “Pascal”
Cores@Clock 2 x 18 @ ≥2.3 GHz 68 @ 1.4 GHz 56 SMs @ ~1.3 GHz
SP Performance/core ≥73.6 GFlop/s 89.6 GFlop/s ~166 GFlop/s
Threads@STREAM ~8 ~40 >8000?
SP peak ≥2.6 TFlop/s 6.1 TFlop/s ~9.3 TFlop/s
Stream BW (meas.) 2 x 62.5 GB/s 450 GB/s (HBM) 510 GB/s
Transistors / TDP ~2x7 Billion / 2x145 W 8 Billion / 215W 14 Billion/300W
(c) RRZE 2016 31

Node topology and
programming models
Parallelism in a modern compute node
 Parallel and shared resources within a shared-memory node
2 GPU #1
1 4 5
10
3
6 Other I/O
9
8 PCIe link
7
GPU #2
Parallel resources: Shared resources:

 Execution/SIMD units 1  Outer cache level per socket 6
 Cores 2  Memory bus per socket 7
 Inner cache levels 3  Intersocket link 8
 Sockets / ccNUMA domains 4  PCIe bus(es) 9
 Multiple accelerators 5  Other I/O resources 10
How does your application react to all of those details?

Scalable and saturating behavior
 Clearly distinguish between “saturating” and “scalable” performance

on the chip level
shared resources parallel

may show resources show
saturating scalable
performance performance

Parallel programming models
on modern compute nodes
 Shared-memory (intra-node)
 Good old MPI
 OpenMP
 POSIX threads
 Intel Threading Building Blocks (TBB)
 Cilk+, OpenCL, StarSs,… you name it
All models require
 Distributed-memory (inter-node) awareness of topology and
 MPI affinity issues for getting
 PVM (gone) best performance out of
the machine!
 Hybrid
 Pure MPI
 MPI+OpenMP
 MPI + any shared-memory model
 MPI (+OpenMP) + CUDA/OpenCL/…

Parallel programming models:
Pure MPI
 Machine structure is invisible to user:

  Very simple programming model
  MPI “knows what to do”!?
 Performance issues
 Intranode vs. internode MPI
 Node/system topology

Parallel programming models:
Pure threading on the node
 Machine structure is invisible to user

  Very simple programming model
 Threading SW (OpenMP, pthreads,
TBB,…) should know about the details
 Performance issues
 Synchronization overhead
 Memory access
 Node topology

Parallel programming models: Lots of choices
Hybrid MPI+OpenMP on a multicore multisocket cluster
One MPI process / node
One MPI process / socket:

OpenMP threads on same
socket: “blockwise”
OpenMP threads pinned

“round robin” across
cores in node
Two MPI processes / socket

OpenMP threads
on same socket

Conclusions about architecture
 Modern computer architecture has a rich “topology”
 Node-level hardware parallelism takes many forms

 Sockets/devices – CPU: 1-4 or more, GPGPU/Phi: 1-6 or more
 Cores – moderate (CPU: 4-24, GPGPU: 10-100, Phi: 64-72)
 SIMD – moderate (CPU: 2-8, Phi: 8-16) to massive (GPGPU: 10’s-100’s)
 Superscalarity (CPU/Phi: 2-6)
 Exploiting performance: parallelism + bottleneck awareness

 “High Performance Computing” == computing at a bottleneck
 Performance of programs is sensitive to architecture

 Topology/affinity influences overheads of popular programming models
 Standards do not contain (many) topology-aware features
 Things are starting to improve slowly (MPI 3.0, OpenMP 4.0)
 Apart from overheads, performance features are largely independent of the
programming model

01 IntroArchitecture

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

01 IntroArchitecture

Încărcat de

Drepturi de autor:

Formate disponibile

Introduction:

Modern computer architecture

The stored program computer and its inherent bottlenecks

Intel Sandy Bridge EP: 2.3 billion

1965: G. Moore claimed

(c) RRZE 2016 Basic Architecture 2

 Xeon E5-2600v3 “Haswell EP”:

 5.7 billion transistors @ 22 nm

(c) RRZE 2016 Basic Architecture 3

Modern CPU core

 Similar designs on all

 (Still) multiple potential

(c) RRZE 2016 Basic Architecture 5

Instructions are the concept of “work” as seen by processor designers.

Example: Adding two arrays A(:) and B(:) Processor work:

(c) RRZE 2016 Basic Architecture 6

Example: Adding two arrays A(:) and B(:)

Crucial question: What is the bottleneck?

(c) RRZE 2016 Basic Architecture 8

 Allows for shorter cycle times (simpler logic circuits), e.g.:

 Pipelining is widely used in modern computer architectures

(c) RRZE 2016 Basic Architecture 10

First result is available after 5 cycles (=latency of pipeline)!

 Besides arithmetic & functional unit, instruction execution itself is

Fetch Instruction 3 Decode Execute

(c) RRZE 2016 Basic Architecture 12

 Multiple units enable use of Instrucion Level Parallelism (ILP):

 Issuing m concurrent instructions per cycle: m-way superscalar

(c) RRZE 2016 Basic Architecture 14

Correct interleaving / reordering the instruction streams:

(c) RRZE 2016 Basic Architecture 15

SMT principle (2-way example):

(c) RRZE 2016 Basic Architecture 16

 Single Instruction Multiple Data (SIMD) operations allow the

(c) RRZE 2016 Basic Architecture 18

DP peak performance and peak main memory bandwidth

Main memory access

 Introduce fast on-chip

(c) RRZE 2016 20

(c) RRZE 2016 Basic Architecture 21

Uniform Memory Architecture (UMA)

Today: Dual-socket node

Cache-coherent Non-Uniform Memory

Intel Cluster on Die (CoD)  ccNUMA within a socket!

 8 cores per socket 2.7 GHz

 32 kB L1 data cache per core

(c) RRZE 2016 Basic Architecture 23

Maximum floating point (FP) performance:

Super- FMA SIMD Clock

NVidia “Pascal” GP100

(c) RRZE 2016 29

MCDRAM MCDRAM MCDRAM

(c) RRZE 2016 30

GPU vs. CPU

(c) RRZE 2016 31

 Parallel and shared resources within a shared-memory node

Parallel resources: Shared resources:

How does your application react to all of those details?

(c) RRZE 2016 Basic Architecture 35

 Clearly distinguish between “saturating” and “scalable” performance

shared resources parallel

(c) RRZE 2016 Basic Architecture 36

(c) RRZE 2016 Basic Architecture 37