Sunteți pe pagina 1din 72

Lecture 2: Processor Design,

Single-Processor Performance

G63.2011.002/G22.2945.001 September 14, 2010

Intro Basics Assembly Memory Pipelines

Outline

Intro
The Basic Subsystems
Machine Language
The Memory Hierarchy
Pipelines

Intro Basics Assembly Memory Pipelines

Admin Bits

Lec. 1 slides posted


New here? Welcome!

Please send in survey info (see lec. 1 slides) via email


PASI
Please subscribe to mailing list
Near end of class: 5-min, 3-question concept check

Intro Basics Assembly Memory Pipelines

Outline

Intro
The Basic Subsystems
Machine Language
The Memory Hierarchy
Pipelines

Intro Basics Assembly Memory Pipelines

Introduction

Goal for Today


High Performance Computing:
Discuss the actual computer end of this. . .
. . . and its influence on performance

Intro Basics Assembly Memory Pipelines

Whats in a computer?

Intro Basics Assembly Memory Pipelines

Whats in a computer?

Processor

Intel Q6600 Core2 Quad, 2.4 GHz


Intro Basics Assembly Memory Pipelines

Die
Whats in a computer?

Processor

(2) 143 mm2 , 2 2 cores


Intel Q6600 Core2 Quad, 2.4 GHz 582,000,000 transistors
100W
Intro Basics Assembly Memory Pipelines

Whats in a computer?

Intro Basics Assembly Memory Pipelines

Whats in a computer?
Memory

Intro Basics Assembly Memory Pipelines

Outline

Intro
The Basic Subsystems
Machine Language
The Memory Hierarchy
Pipelines

Intro Basics Assembly Memory Pipelines

A Basic Processor
Memory Interface
Address ALU

Address Bus

Data Bus

Register File
Flags

Internal Bus
Insn.
fetch

PC
Control Unit

Data ALU

(loosely based on Intel 8086)


Intro Basics Assembly Memory Pipelines

A Basic Processor
Memory Interface
Address ALU

Address Bus

Data Bus

Register File
Flags

Internal Bus
Insn.
Bonus
fetch

Question:
PC
Whats a Control
bus? Unit

Data ALU

(loosely based on Intel 8086)


Intro Basics Assembly Memory Pipelines

How all of this fits together


Everything synchronizes to the Clock.
Control Unit (CU): The brains of the
operation. Everything connects to it.

Memory Interface
Address ALU

Bus entries/exits are gated and


(potentially) buffered.
CU controls gates, tells other units
about what and how:

Address Bus

Data Bus

Register File
Flags

Internal Bus
Insn.
fetch

PC
Control Unit

Data ALU

What operation?
Which register?
Which addressing mode?

Intro Basics Assembly Memory Pipelines

What is. . . an ALU?


Arithmetic Logic Unit
One or two operands A, B
Operation selector (Op):
(Integer) Addition, Subtraction

(Logical) And, Or, Not

(Bitwise) Shifts (equivalent to

multiplication by power of two)


(Integer) Multiplication, Division

Op

Specialized ALUs:
Floating Point Unit (FPU)

Address ALU

Operates on binary representations of


numbers. Negative numbers represented
by twos complement.
Intro Basics Assembly Memory Pipelines

What is. . . a Register File?

Registers are On-Chip Memory


Directly usable as operands in

Machine Language
Often general-purpose
Sometimes special-purpose: Floating

point, Indexing, Accumulator


Small: x86 64: 1664 bit GPRs
Very fast (near-zero latency)

%r0
%r1
%r2
%r3
%r4
%r5
%r6
%r7

Intro Basics Assembly Memory Pipelines

How does computer memory work?


One (reading) memory transaction (simplified):

D0..15

Memory

Processor

A0..15

R/W
CLK

Intro Basics Assembly Memory Pipelines

How does computer memory work?


One (reading) memory transaction (simplified):

D0..15

Memory

Processor

A0..15

R/W
CLK

Intro Basics Assembly Memory Pipelines

How does computer memory work?


One (reading) memory transaction (simplified):

D0..15

Memory

Processor

A0..15

R/W
CLK

Intro Basics Assembly Memory Pipelines

How does computer memory work?


One (reading) memory transaction (simplified):

D0..15

Memory

Processor

A0..15

R/W
CLK

Intro Basics Assembly Memory Pipelines

How does computer memory work?


One (reading) memory transaction (simplified):

D0..15

Memory

Processor

A0..15

R/W
CLK

Intro Basics Assembly Memory Pipelines

How does computer memory work?


One (reading) memory transaction (simplified):

D0..15

Memory

Processor

A0..15

R/W
CLK

Intro Basics Assembly Memory Pipelines

How does computer memory work?


One (reading) memory transaction (simplified):

D0..15

Memory

Processor

A0..15

R/W
CLK

Observation: Access (and addressing) happens


in bus-width-size chunks.
Intro Basics Assembly Memory Pipelines

What is. . . a Memory Interface?

Memory Interface gets and stores binary


words in off-chip memory.
Smallest granularity: Bus width
Tells outside memory
where through address bus
what through data bus

Computer main memory is Dynamic RAM


(DRAM): Slow, but small and cheap.

Intro Basics Assembly Memory Pipelines

Outline

Intro
The Basic Subsystems
Machine Language
The Memory Hierarchy
Pipelines

Intro Basics Assembly Memory Pipelines

A Very Simple Program

int a = 5;
int b = 17;
int z = a b;

4:
b:
12:
15:
19:
1c:

c7
c7
8b
0f
89
8b

45
45
45
af
45
45

f4 05 00 00 00 movl
f8 11 00 00 00 movl
f4
mov
45 f8
imul
fc
mov
fc
mov

$0x5,0xc(%rbp)
$0x11,0x8(%rbp)
0xc(%rbp),%eax
0x8(%rbp),%eax
%eax,0x4(%rbp)
0x4(%rbp),%eax

Things to know:
Addressing modes (Immediate, Register, Base plus Offset)
0xHexadecimal
AT&T Form: (well use this)

<opcode><size> <source>, <dest>

Intro Basics Assembly Memory Pipelines

Another Look
Memory Interface
Address ALU

Address Bus

Data Bus

Register File
Flags

Internal Bus
Insn.
fetch

PC
Control Unit

Data ALU

Intro Basics Assembly Memory Pipelines

Another
Look
4: c7 45 f4 05 00 00 00 movl $0x5,0xc(%rbp)
Address ALU

b:
12:
15:
19:
1c:

c7
8b
0f
89
8b

45 f8 11 00 00 00 movl $0x11,0x8(%rbp)
45 f4
mov 0xc(%rbp),%eax
af 45 f8
imul 0x8(%rbp),%eax
45 fc
mov %eax,0x4(%rbp)
45 fc
mov 0x4(%rbp),%eax

Memory Interface
Address Bus

Data Bus

Register File
Flags

Internal Bus
Insn.
fetch

PC
Control Unit

Data ALU

Intro Basics Assembly Memory Pipelines

A Very Simple Program: Intel Form


4:
b:
12:
15:
19:
1c:

c7
c7
8b
0f
89
8b

45
45
45
af
45
45

f4 05 00 00 00
f8 11 00 00 00
f4
45 f8
fc
fc

mov
mov
mov
imul
mov
mov

DWORD PTR [rbp0xc],0x5


DWORD PTR [rbp0x8],0x11
eax,DWORD PTR [rbp0xc]
eax,DWORD PTR [rbp0x8]
DWORD PTR [rbp0x4],eax
eax,DWORD PTR [rbp0x4]

Intel Form: (you might see this on the net)

<opcode> <sized dest>, <sized source>


Goal: Reading comprehension.
Dont understand an opcode?

Google <opcode> intel instruction.

Intro Basics Assembly Memory Pipelines

Machine Language Loops


int main()
{
int y = 0, i ;
for ( i = 0;
y < 10; ++i)
y += i;
return y;
}

0:
1:
4:
b:
12:
14:
17:
1a:
1e:
22:
24:
27:
28:

55
48
c7
c7
eb
8b
01
83
83
7e
8b
c9
c3

89
45
45
0a
45
45
45
7d
f0
45

e5
f8 00 00 00 00
fc 00 00 00 00
fc
f8
fc 01
f8 09
f8

push
mov
movl
movl
jmp
mov
add
addl
cmpl
jle
mov
leaveq
retq

%rbp
%rsp,%rbp
$0x0,0x8(%rbp)
$0x0,0x4(%rbp)
1e <main+0x1e>
0x4(%rbp),%eax
%eax,0x8(%rbp)
$0x1,0x4(%rbp)
$0x9,0x8(%rbp)
14 <main+0x14>
0x8(%rbp),%eax

Things to know:
Condition Codes (Flags): Zero, Sign, Carry, etc.
Call Stack: Stack frame, stack pointer, base pointer
ABI: Calling conventions

Intro Basics Assembly Memory Pipelines

Machine Language Loops


int main()
{
int y = 0, i ;
for ( i = 0;
y < 10; ++i)
y += i;
return y;
}

0:
1:
4:
b:
12:
14:
17:
1a:
1e:
22:
24:
27:
28:

55
48
c7
c7
eb
8b
01
83
83
7e
8b
c9
c3

89
45
45
0a
45
45
45
7d
f0
45

e5
f8 00 00 00 00
fc 00 00 00 00
fc
f8
fc 01
f8 09
f8

push
mov
movl
movl
jmp
mov
add
addl
cmpl
jle
mov
leaveq
retq

%rbp
%rsp,%rbp
$0x0,0x8(%rbp)
$0x0,0x4(%rbp)
1e <main+0x1e>
0x4(%rbp),%eax
%eax,0x8(%rbp)
$0x1,0x4(%rbp)
$0x9,0x8(%rbp)
14 <main+0x14>
0x8(%rbp),%eax

Things to know:
Want to make those yourself?
Condition
(Flags): Zero, Sign, Carry, etc.
WriteCodes
myprogram.c.
Call Stack:
Stack
frame, stack pointer, base pointer
$ cc -c
myprogram.c

objdump
--disassemble myprogram.o
ABI: $
Calling
conventions

Intro Basics Assembly Memory Pipelines

We know how a computer works!

All of this can be built in about 4000 transistors.


(e.g. MOS 6502 in Apple II, Commodore 64, Atari 2600)
So what exactly is Intel doing with the other 581,996,000
transistors?
Answer:

Intro Basics Assembly Memory Pipelines

We know how a computer works!

All of this can be built in about 4000 transistors.


(e.g. MOS 6502 in Apple II, Commodore 64, Atari 2600)
So what exactly is Intel doing with the other 581,996,000
transistors?
Answer: Make things go faster!

Intro Basics Assembly Memory Pipelines

We know how a computer works!

All of this can be built in about 4000 transistors.


(e.g. MOS 6502 in Apple II, Commodore 64, Atari 2600)
So what exactly is Intel doing with the other 581,996,000
transistors?
Answer: Make things go faster!
Goal now:
Understand sources of slowness, and how they get addressed.
Remember: High Performance Computing

Intro Basics Assembly Memory Pipelines

The High-Performance Mindset


Writing high-performance Codes
Mindset: What is going to be the limiting
factor?
ALU?
Memory?
Communication? (if multi-machine)

Benchmark the assumed limiting factor right


away.

Evaluate
Know your peak throughputs (roughly)
Are you getting close?
Are you tracking the right limiting factor?
Intro Basics Assembly Memory Pipelines

Outline

Intro
The Basic Subsystems
Machine Language
The Memory Hierarchy
Pipelines

Intro Basics Assembly Memory Pipelines

Source of Slowness: Memory


Memory is slow.
Distinguish two different versions of slow:
Bandwidth
Latency
Memory has long latency, but can have large bandwidth.

Size of die vs. distance to memory: big!


Dynamic RAM: long intrinsic latency!
Intro Basics Assembly Memory Pipelines

Source of Slowness: Memory


Memory is slow.
Distinguish two different versions of slow:
Bandwidth
Latency
Memory has long latency, but can have large bandwidth.

Idea:

Size of die vs. distance to memory: big!


Dynamic RAM: long intrinsic latency!

Put a look-up table of


recently-used data onto
the chip.
Cache
Intro Basics Assembly Memory Pipelines

The Memory Hierarchy


Hierarchy of increasingly bigger, slower memories:
Registers

1 kB, 1 cycle

L1 Cache

10 kB, 10 cycles

L2 Cache

1 MB, 100 cycles

DRAM

1 GB, 1000 cycles

Virtual Memory
(hard drive)

1 TB, 1 M cycles

Intro Basics Assembly Memory Pipelines

The Memory Hierarchy


Hierarchy of increasingly bigger, slower memories:
Registers

1 kB, 1 cycle

L1 Cache

10 kB, 10 cycles

L2 Cache

1 MB, 100 cycles

DRAM

1 GB, 1000 cycles

Virtual Memory
(hard drive)

1 TB, 1 M cycles
How might data locality
factor into this?
What is a working set?
Intro Basics Assembly Memory Pipelines

Cache: Actual Implementation


Demands on cache implementation:
Fast, small, cheap, low-power
Fine-grained
High hit-rate (few misses)

Main
Memory
Index
0
1
2
3

Data
xyz
pdq
abc
rgf

Cache
Memory
Index Tag Data
0 2 abc
1 0 xyz

Problem:
Goals at odds with each other: Access matching logic expensive!
Solution 1: More data per unit of access matching logic
Larger Cache Lines
Solution 2: Simpler/less access matching logic
Less than full Associativity
Other choices: Eviction strategy, size

Intro Basics Assembly Memory Pipelines

Cache: Associativity

Direct Mapped
Memory
0
1
2
3
4
5
6
..
.

Cache
0
1
2
3

2-way set associative


Memory
0
1
2
3
4
5
6
..
.

Cache
0
1
2
3

Intro Basics Assembly Memory Pipelines

Cache: Associativity
0.1

Direct
2-way
4-way
8-way
Full

0.01

2-way set associative

Direct Mapped
Memory
0
1
2
3
4
5
6
..
.

miss rate

0.001

Cache
0
0.0001
1
2
1e-005
3
1e-006

1K

4K

Memory
0
1
2
3
4
5
6
..
. 64K
16K

Cache
0
1
2
3

256K

1M

Inf

cache size

Miss rate versus cache size on the Integer portion of SPEC CPU2000 [Cantin, Hill 2003]

Intro Basics Assembly Memory Pipelines

Cache Example: Intel Q6600/Core2 Quad


--- L1 data cache --fully associative cache
=
threads sharing this cache =
processor cores on this die=
system coherency line size =
ways of associativity
=
number of sets - 1 (s)
=

false
0x0 (0)
0x3 (3)
0x3f (63)
0x7 (7)
63

--- L1 instruction --fully associative cache


=
threads sharing this cache =
processor cores on this die=
system coherency line size =
ways of associativity
=
number of sets - 1 (s)
=

false
0x0 (0)
0x3 (3)
0x3f (63)
0x7 (7)
63

--- L2 unified cache --fully associative cache


false
threads sharing this cache = 0x1 (1)
processor cores on this die= 0x3 (3)
system coherency line size = 0x3f (63)
ways of associativity
= 0xf (15)
number of sets - 1 (s)
= 4095

More than you care to know about your CPU:


http://www.etallen.com/cpuid.html
Intro Basics Assembly Memory Pipelines

Measuring the Cache I


void go(unsigned count, unsigned stride)
{
const unsigned arr size = 64 1024 1024;
int ary = (int ) malloc(sizeof (int) arr size );
for (unsigned it = 0; it < count; ++it)
{
for (unsigned i = 0; i < arr size ; i += stride)
ary [ i ] = 17;
}
free (ary );
}

Intro Basics Assembly Memory Pipelines

Measuring the Cache I


void go(unsigned count, unsigned stride)
{
0.16
const unsigned arr size = 64 1024 1024;
int ary = (int0.14
) malloc(sizeof (int) arr size );

Time [s]

for (unsigned 0.12


it = 0; it < count; ++it)
{
0.10
for (unsigned
i = 0; i < arr size ; i += stride)
ary [ i ] = 0.08
17;
}
free (ary );
}

0.06
0.04
0.020 1 2 3 4 5 6 7 8 9 10
2 2 2 2 2 2 2 2 2 2 2
Stride
Intro Basics Assembly Memory Pipelines

Measuring the Cache II

void go(unsigned array size , unsigned steps)


{
int ary = (int ) malloc(sizeof (int) array size );
unsigned asm1 = array size 1;
for (unsigned i = 0; i < steps; ++i)
ary [( i 16) & asm1] ++;
free (ary );
}

Intro Basics Assembly Memory Pipelines

Measuring the Cache II

6
Eff. Bandwidth [GB/s]

void go(unsigned array size , unsigned steps)


{
int ary = (int ) malloc(sizeof (int) array size );
unsigned asm1 = array size 1;

for (unsigned i = 0; i < steps; ++i)


ary [( i 16) & asm1] ++;
free (ary );
}

2
1

0
212 214 216 218 220 222 224 226
Array Size [Bytes]
Intro Basics Assembly Memory Pipelines

Measuring the Cache III


void go(unsigned array size , unsigned stride , unsigned steps)
{
char ary = (char ) malloc(sizeof (int) array size );
unsigned p = 0;
for (unsigned i = 0; i < steps; ++i)
{
ary [p] ++;
p += stride;
if (p >= array size)
p = 0;
}
free (ary );
}

Intro Basics Assembly Memory Pipelines

Measuring the Cache III


void go(unsigned array size , unsigned stride , unsigned steps)
{
char ary = (char ) malloc(sizeof (int) array size );

20

Array Size [MB]

unsigned p = 0;
for (unsigned i = 0; i < steps; ++i)
15
{
ary [p] ++;
p += stride;
if (p >= array
10 size)
p = 0;
}
free (ary );
}

100 200 300 400 500 600


Stride [bytes]
Intro Basics Assembly Memory Pipelines

Programming for the Cache

How can we rearrange programs to be cache-friendly?


Examples:
Large vectors x, a, b

Compute
x x + 3a 5b.

Intro Basics Assembly Memory Pipelines

Programming for the Cache

How can we rearrange programs to be cache-friendly?


Examples:
Large vectors x, a, b

Compute
x x + 3a 5b.
Matrix-Matrix Multiplication

Homework 1, posted.

Intro Basics Assembly Memory Pipelines

Outline

Intro
The Basic Subsystems
Machine Language
The Memory Hierarchy
Pipelines

Intro Basics Assembly Memory Pipelines

Source of Slowness: Sequential Operation

IF Instruction fetch
ID Instruction Decode
EX Execution
MEM Memory Read/Write
WB Result Writeback

Intro Basics Assembly Memory Pipelines

Solution: Pipelining

Intro Basics Assembly Memory Pipelines

Pipelining

(MIPS, 110,000 transistors)


Intro Basics Assembly Memory Pipelines

Issues with Pipelines

Pipelines generally help


performancebut not always.
Stalls
Dependent Instructions
Branches (+Prediction)
Self-Modifying Code

Waiting
Instructions

Stage 1: Fetch

PIPELINE

Possible issues:

Clock Cycle
0

Stage 2: Decode
Stage 3: Execute
Stage 4: Write-back

Completed
Instructions

Solution: Bubbling, extra


circuitry

Intro Basics Assembly Memory Pipelines

Intel Q6600 Pipeline


Instruction Fetch
128 Bit

32 Byte Pre-Decode,
Fetch Buffer
6 Instructions

18 Entry
Instruction Queue

Complex
Decoder

Microcode

Simple
Decoder

Simple
Decoder

4 ops

1 op

Simple
Decoder

1 op

1 op

7+ Entry op Buffer
4 ops

Register Alias Table


and Allocator
4 ops

4 ops

Retirement Register File


(Program Visible State)

96 Entry Reorder Buffer (ROB)


4 ops

32 Entry Reservation Station


Port 0

ALU

SSE
Shuffle
ALU

128 Bit
FMUL
FDIV

ALU

SSE
Shuffle
MUL

128 Bit
FADD

Internal Results Bus

Port 3

Port 5

Port 1

ALU
Branch

SSE
ALU

Store
Address

Port 4
Store
Data

Port 2
Load
Address

Memory Ordering Buffer


(MOB)
128 Bit

Store
128 Bit

Load

Memory Interface

Intro Basics Assembly Memory Pipelines

Intel Q6600 Pipeline

New concept:
Instruction-level
parallelism
(Superscalar)

Instruction Fetch
128 Bit

32 Byte Pre-Decode,
Fetch Buffer
6 Instructions

18 Entry
Instruction Queue

Complex
Decoder

Microcode

Simple
Decoder

Simple
Decoder

4 ops

1 op

Simple
Decoder

1 op

1 op

7+ Entry op Buffer
4 ops

Register Alias Table


and Allocator
4 ops

4 ops

Retirement Register File


(Program Visible State)

96 Entry Reorder Buffer (ROB)


4 ops

32 Entry Reservation Station


Port 0

ALU

SSE
Shuffle
ALU

128 Bit
FMUL
FDIV

ALU

SSE
Shuffle
MUL

128 Bit
FADD

Internal Results Bus

Port 3

Port 5

Port 1

ALU
Branch

SSE
ALU

Store
Address

Port 4
Store
Data

Port 2
Load
Address

Memory Ordering Buffer


(MOB)
128 Bit

Store
128 Bit

Load

Memory Interface

Intro Basics Assembly Memory Pipelines

Programming for the Pipeline

How to upset a processor pipeline:


for (int i = 0; i < 1000; ++i)
for (int j = 0; j < 1000; ++j)
{
if ( j % 2 == 0)
do something(i , j );
}

. . . why is this bad?

Intro Basics Assembly Memory Pipelines

A Puzzle

int steps = 256 1024 1024;


int [] a = new int[2];
// Loop 1
for (int i =0; i<steps; i ++) { a[0]++; a[0]++; }
// Loop 2
for (int i =0; i<steps; i ++) { a[0]++; a[1]++; }

Which is faster?
. . . and why?

Intro Basics Assembly Memory Pipelines

Two useful Strategies


Loop unrolling:

for (int i = 0; i < 1000; ++i)


do something(i );

for (int i = 0; i < 500; i+=2)


{
do something(i );
do something(i+1);
}

for (int i = 0; i < 500; i+=2)


{
do a( i );
do a( i +1);
do b(i );
do b(i+1);
}

Software pipelining:
for (int i = 0; i < 1000; ++i)
{
do a( i );
do b(i );
}

Intro Basics Assembly Memory Pipelines

SIMD
Control Units are large and expensive.

SIMD

Functional Units are simple and cheap.

PU
Data Pool

Increase the Function/Control ratio:


Control several functional units with
one control unit.

Instruction Pool

PU
PU
PU

All execute same operation.


GCC vector extensions:
typedef int v4si

attribute

(( vector size (16)));

v4si a, b, c;
c = a + b;
// +, , , /, unary minus, , |, &, , %

Will revisit for OpenCL, GPUs.


Intro Basics Assembly Memory Pipelines

About HW1

Open-ended! Want: coherent thought about caches, memory

access ordering, latencies, bandwidths, etc.


See Berkeley CS267 (lec. 3, . . . ) for more hints.
Also: Introduction to the machinery.

(Clusters, running on them, SSH, git, forge)


Why so much machinery?
Linux lab machines: WWH 229/230

Intro Basics Assembly Memory Pipelines

Rearranging Matrix-Matrix Multiplication


Matrix Multiplication:
Cij =

Aik Bkj

Intro Basics Assembly Memory Pipelines

Rearranging Matrix-Matrix Multiplication


Matrix Multiplication:
Cij =

Aik Bkj

Intro Basics Assembly Memory Pipelines

Rearranging Matrix-Matrix Multiplication


Matrix Multiplication:
Cij =

Aik Bkj

Intro Basics Assembly Memory Pipelines

Rearranging Matrix-Matrix Multiplication


Matrix Multiplication:
Cij =

Aik Bkj

Intro Basics Assembly Memory Pipelines

Rearranging Matrix-Matrix Multiplication


Matrix Multiplication:
Cij =

Aik Bkj

Intro Basics Assembly Memory Pipelines

Rearranging Matrix-Matrix Multiplication


Matrix Multiplication:
Cij =

Aik Bkj

Intro Basics Assembly Memory Pipelines

Questions?

Intro Basics Assembly Memory Pipelines

Image Credits
Q6600 Wikimedia Commons

Mainboard: Wikimedia Commons


DIMM: sxc.hu/gobran11
Q6600 back: Wikimedia Commons
Core 2 die: Intel Corp. / lesliewong.us
Basic cache: Wikipedia
Cache associativity: based on Wikipedia
Cache associativity vs miss rate: Wikipedia
Q6600 Wikimedia Commons
Cache Measurements: Igor Ostrovsky
Pipeline stuff: Wikipedia
Bubbly Pipeline: Wikipedia
Q6600 Pipeline: Wikipedia
SIMD concept picture: Wikipedia

Intro Basics Assembly Memory Pipelines

S-ar putea să vă placă și