Documente Academic
Documente Profesional
Documente Cultură
Single-Processor Performance
Outline
Intro
The Basic Subsystems
Machine Language
The Memory Hierarchy
Pipelines
Admin Bits
Outline
Intro
The Basic Subsystems
Machine Language
The Memory Hierarchy
Pipelines
Introduction
Whats in a computer?
Whats in a computer?
Processor
Die
Whats in a computer?
Processor
Whats in a computer?
Whats in a computer?
Memory
Outline
Intro
The Basic Subsystems
Machine Language
The Memory Hierarchy
Pipelines
A Basic Processor
Memory Interface
Address ALU
Address Bus
Data Bus
Register File
Flags
Internal Bus
Insn.
fetch
PC
Control Unit
Data ALU
A Basic Processor
Memory Interface
Address ALU
Address Bus
Data Bus
Register File
Flags
Internal Bus
Insn.
Bonus
fetch
Question:
PC
Whats a Control
bus? Unit
Data ALU
Memory Interface
Address ALU
Address Bus
Data Bus
Register File
Flags
Internal Bus
Insn.
fetch
PC
Control Unit
Data ALU
What operation?
Which register?
Which addressing mode?
Op
Specialized ALUs:
Floating Point Unit (FPU)
Address ALU
Machine Language
Often general-purpose
Sometimes special-purpose: Floating
%r0
%r1
%r2
%r3
%r4
%r5
%r6
%r7
D0..15
Memory
Processor
A0..15
R/W
CLK
D0..15
Memory
Processor
A0..15
R/W
CLK
D0..15
Memory
Processor
A0..15
R/W
CLK
D0..15
Memory
Processor
A0..15
R/W
CLK
D0..15
Memory
Processor
A0..15
R/W
CLK
D0..15
Memory
Processor
A0..15
R/W
CLK
D0..15
Memory
Processor
A0..15
R/W
CLK
Outline
Intro
The Basic Subsystems
Machine Language
The Memory Hierarchy
Pipelines
int a = 5;
int b = 17;
int z = a b;
4:
b:
12:
15:
19:
1c:
c7
c7
8b
0f
89
8b
45
45
45
af
45
45
f4 05 00 00 00 movl
f8 11 00 00 00 movl
f4
mov
45 f8
imul
fc
mov
fc
mov
$0x5,0xc(%rbp)
$0x11,0x8(%rbp)
0xc(%rbp),%eax
0x8(%rbp),%eax
%eax,0x4(%rbp)
0x4(%rbp),%eax
Things to know:
Addressing modes (Immediate, Register, Base plus Offset)
0xHexadecimal
AT&T Form: (well use this)
Another Look
Memory Interface
Address ALU
Address Bus
Data Bus
Register File
Flags
Internal Bus
Insn.
fetch
PC
Control Unit
Data ALU
Another
Look
4: c7 45 f4 05 00 00 00 movl $0x5,0xc(%rbp)
Address ALU
b:
12:
15:
19:
1c:
c7
8b
0f
89
8b
45 f8 11 00 00 00 movl $0x11,0x8(%rbp)
45 f4
mov 0xc(%rbp),%eax
af 45 f8
imul 0x8(%rbp),%eax
45 fc
mov %eax,0x4(%rbp)
45 fc
mov 0x4(%rbp),%eax
Memory Interface
Address Bus
Data Bus
Register File
Flags
Internal Bus
Insn.
fetch
PC
Control Unit
Data ALU
c7
c7
8b
0f
89
8b
45
45
45
af
45
45
f4 05 00 00 00
f8 11 00 00 00
f4
45 f8
fc
fc
mov
mov
mov
imul
mov
mov
0:
1:
4:
b:
12:
14:
17:
1a:
1e:
22:
24:
27:
28:
55
48
c7
c7
eb
8b
01
83
83
7e
8b
c9
c3
89
45
45
0a
45
45
45
7d
f0
45
e5
f8 00 00 00 00
fc 00 00 00 00
fc
f8
fc 01
f8 09
f8
push
mov
movl
movl
jmp
mov
add
addl
cmpl
jle
mov
leaveq
retq
%rbp
%rsp,%rbp
$0x0,0x8(%rbp)
$0x0,0x4(%rbp)
1e <main+0x1e>
0x4(%rbp),%eax
%eax,0x8(%rbp)
$0x1,0x4(%rbp)
$0x9,0x8(%rbp)
14 <main+0x14>
0x8(%rbp),%eax
Things to know:
Condition Codes (Flags): Zero, Sign, Carry, etc.
Call Stack: Stack frame, stack pointer, base pointer
ABI: Calling conventions
0:
1:
4:
b:
12:
14:
17:
1a:
1e:
22:
24:
27:
28:
55
48
c7
c7
eb
8b
01
83
83
7e
8b
c9
c3
89
45
45
0a
45
45
45
7d
f0
45
e5
f8 00 00 00 00
fc 00 00 00 00
fc
f8
fc 01
f8 09
f8
push
mov
movl
movl
jmp
mov
add
addl
cmpl
jle
mov
leaveq
retq
%rbp
%rsp,%rbp
$0x0,0x8(%rbp)
$0x0,0x4(%rbp)
1e <main+0x1e>
0x4(%rbp),%eax
%eax,0x8(%rbp)
$0x1,0x4(%rbp)
$0x9,0x8(%rbp)
14 <main+0x14>
0x8(%rbp),%eax
Things to know:
Want to make those yourself?
Condition
(Flags): Zero, Sign, Carry, etc.
WriteCodes
myprogram.c.
Call Stack:
Stack
frame, stack pointer, base pointer
$ cc -c
myprogram.c
objdump
--disassemble myprogram.o
ABI: $
Calling
conventions
Evaluate
Know your peak throughputs (roughly)
Are you getting close?
Are you tracking the right limiting factor?
Intro Basics Assembly Memory Pipelines
Outline
Intro
The Basic Subsystems
Machine Language
The Memory Hierarchy
Pipelines
Idea:
1 kB, 1 cycle
L1 Cache
10 kB, 10 cycles
L2 Cache
DRAM
Virtual Memory
(hard drive)
1 TB, 1 M cycles
1 kB, 1 cycle
L1 Cache
10 kB, 10 cycles
L2 Cache
DRAM
Virtual Memory
(hard drive)
1 TB, 1 M cycles
How might data locality
factor into this?
What is a working set?
Intro Basics Assembly Memory Pipelines
Main
Memory
Index
0
1
2
3
Data
xyz
pdq
abc
rgf
Cache
Memory
Index Tag Data
0 2 abc
1 0 xyz
Problem:
Goals at odds with each other: Access matching logic expensive!
Solution 1: More data per unit of access matching logic
Larger Cache Lines
Solution 2: Simpler/less access matching logic
Less than full Associativity
Other choices: Eviction strategy, size
Cache: Associativity
Direct Mapped
Memory
0
1
2
3
4
5
6
..
.
Cache
0
1
2
3
Cache
0
1
2
3
Cache: Associativity
0.1
Direct
2-way
4-way
8-way
Full
0.01
Direct Mapped
Memory
0
1
2
3
4
5
6
..
.
miss rate
0.001
Cache
0
0.0001
1
2
1e-005
3
1e-006
1K
4K
Memory
0
1
2
3
4
5
6
..
. 64K
16K
Cache
0
1
2
3
256K
1M
Inf
cache size
Miss rate versus cache size on the Integer portion of SPEC CPU2000 [Cantin, Hill 2003]
false
0x0 (0)
0x3 (3)
0x3f (63)
0x7 (7)
63
false
0x0 (0)
0x3 (3)
0x3f (63)
0x7 (7)
63
Time [s]
0.06
0.04
0.020 1 2 3 4 5 6 7 8 9 10
2 2 2 2 2 2 2 2 2 2 2
Stride
Intro Basics Assembly Memory Pipelines
6
Eff. Bandwidth [GB/s]
2
1
0
212 214 216 218 220 222 224 226
Array Size [Bytes]
Intro Basics Assembly Memory Pipelines
20
unsigned p = 0;
for (unsigned i = 0; i < steps; ++i)
15
{
ary [p] ++;
p += stride;
if (p >= array
10 size)
p = 0;
}
free (ary );
}
Compute
x x + 3a 5b.
Compute
x x + 3a 5b.
Matrix-Matrix Multiplication
Homework 1, posted.
Outline
Intro
The Basic Subsystems
Machine Language
The Memory Hierarchy
Pipelines
IF Instruction fetch
ID Instruction Decode
EX Execution
MEM Memory Read/Write
WB Result Writeback
Solution: Pipelining
Pipelining
Waiting
Instructions
Stage 1: Fetch
PIPELINE
Possible issues:
Clock Cycle
0
Stage 2: Decode
Stage 3: Execute
Stage 4: Write-back
Completed
Instructions
32 Byte Pre-Decode,
Fetch Buffer
6 Instructions
18 Entry
Instruction Queue
Complex
Decoder
Microcode
Simple
Decoder
Simple
Decoder
4 ops
1 op
Simple
Decoder
1 op
1 op
7+ Entry op Buffer
4 ops
4 ops
ALU
SSE
Shuffle
ALU
128 Bit
FMUL
FDIV
ALU
SSE
Shuffle
MUL
128 Bit
FADD
Port 3
Port 5
Port 1
ALU
Branch
SSE
ALU
Store
Address
Port 4
Store
Data
Port 2
Load
Address
Store
128 Bit
Load
Memory Interface
New concept:
Instruction-level
parallelism
(Superscalar)
Instruction Fetch
128 Bit
32 Byte Pre-Decode,
Fetch Buffer
6 Instructions
18 Entry
Instruction Queue
Complex
Decoder
Microcode
Simple
Decoder
Simple
Decoder
4 ops
1 op
Simple
Decoder
1 op
1 op
7+ Entry op Buffer
4 ops
4 ops
ALU
SSE
Shuffle
ALU
128 Bit
FMUL
FDIV
ALU
SSE
Shuffle
MUL
128 Bit
FADD
Port 3
Port 5
Port 1
ALU
Branch
SSE
ALU
Store
Address
Port 4
Store
Data
Port 2
Load
Address
Store
128 Bit
Load
Memory Interface
A Puzzle
Which is faster?
. . . and why?
Software pipelining:
for (int i = 0; i < 1000; ++i)
{
do a( i );
do b(i );
}
SIMD
Control Units are large and expensive.
SIMD
PU
Data Pool
Instruction Pool
PU
PU
PU
attribute
v4si a, b, c;
c = a + b;
// +, , , /, unary minus, , |, &, , %
About HW1
Aik Bkj
Aik Bkj
Aik Bkj
Aik Bkj
Aik Bkj
Aik Bkj
Questions?
Image Credits
Q6600 Wikimedia Commons