Core Architecture

Intel Core 2 Architecture
Implication for software developers
Stephen Blair-Chappell
Technical Consulting Engineer
Intel Compiler Labs
Agenda
Why Multicore
Core 2 Microarchitecture
From high above
A closer look at the cores
In-depth pipeline walk-through
Implications for Developers
Pipeline stalls
vectorization
Making use of dual core
Summary / Q&A
Challenge 1: Power
Power Density Race
10,000
Power Density
(W/cm2)
1,000
100
Suns Surface
Rocket Nozzle
Nuclear Reactor
Hot Plate
8086
10 4004
8085
386
8008
286
486
8080
1
70
80
90
Pentium
processors
00
10
Challenge 2: Memory Latency

Memory Latency
3000
Logic
2500
2000
MHz
GAP
1500
1000
500
0
1992
Memory
1994
* Other brands and names may be claimed as the property of others.
1996
1998
2000
2002
Challenge 3: RC Delays
D e la y (p s )
10000
Interconnect RC Delay
Clock Period
1000
100
10
RC delay of 1mm interconnect
Copper Interconnect
1
0.35
0.25
180nm
130nm
90nm
Process Technology
i486: 16 clocks RAM access roundtrip
65 nm IC: 15 clocks to cross the chip
* Other brands and names may be claimed as the property of others.
65nm
The Multi-Core Advantage

DualDual-Core
1.73x
Performance
1.73x
Power
1.13x
1.00x
0.87x
1.02x
0.51x
Over-clocked
(+20%)
Design
Frequency
Under
Dual--clocked
core
(-(20%)
-20%)
Relative single-core frequency and Vcc
Core 2 Duo 1 Meter above

!"# $ %&'
()*
+!,
(.
! ((# !$$$,&'01*
! # (.2 -3/
!)$
4 3
1 5 6%
1'
.
Core 2 1 cm above
!) 1
$ 7* 6!
$ 7* 6!3
!-
),* 1
Core 2 In the Core

8
0
/ 3
!) 1
98 8
.
8
: ,
,;
),
1
6
:
@:
),* 1
6<
6<
6<
*
0
0,8
6
,,=;
11> ,,=;
11> ,,=;
11>
0/
0/
0/
6!3?
$ 7* 6!3
<
! (%*;
01*
<
*8 A
1
!-
$ 7* 6!
3?-6*
*8
Instruction Fetch and PreDecode

8
0
/ 3
!"
B
98 8
.
8
: ,
!(
,;
),
1
6
B*
5
8
3
)
:
:
@:
<
! (%*;
01*
<
*8 A
# $
B08
8
B
6<
6<
6<
*
0
0,8
6
,,=;
11> ,,=;
11> ,,=;
11>
0/
0/
0/
6!3?
3?-6*
*8
E8 8
Instruction Queue
18 deep instruction queue
8
0
/ 3
6 instructions write per cycle

5 instructions read per cycle
One Macro-fusion per cycle
98 8
.
8
: ,
,;
),
1
6
3
)
:
:
@:
<
*8 A
1
<
! (%*;
01*
;
)
6<
6<
6<
*
0
0,8
6
,,=;
11> ,,=;
11> ,,=;
11>
0/
0/
0/
6!3?
3?-6*
*8
Includes a Loop Stream Detector

(LSD)
Potentially very high bandwidth

instruction streaming
Maximum of 18 instructions in up to
four 16-byte packets
No RET instructions (hence, little
practical use for CALLs)
Up to four taken branches allowed
Most effective at 70+ iterations
Trade-off LSD with conventional

loop unrolling
Instruction Decode
Macro Fusion
8
0
/ 3
6
98 8
.
8
: ,
,;
),
1
6
3
)
:
:
@:
<
*8 A
1
<
! (%*;
01*
;
)
6<
6<
6<
*
0
0,8
6
,,=;
11> ,,=;
11> ,,=;
11>
0/
0/
0/
6!3?
3?-6*
*8
Without
Macro-Fusion
Instruction Queue
inc
esp
store [mem3], ebx

jne
targ
cmp
eax, [mem2]
load
eax, [mem1]
Read five instructions from

Instruction Queue
Each instruction gets decoded
separately
dec0 dec1 dec2 dec3
Cycle 1
Cycle 2
inc
esp
store [mem3], ebx
jne
targ
cmp eax, [mem2]
load eax, [mem1]
With Intels New Macro-Fusion
Instruction Queue
inc
esp
store [mem3], ebx

jne
targ
cmp
eax, [mem2]
load
eax, [mem1]
Read five Instructions from

Instruction Queue
Send fusable pair to single
decoder
dec0 dec1 dec2 dec3
All in one cycle
inc
esp
store [mem3], ebx
cmpjne eax, [mem2], targ
load eax, [mem1]
!" !" #!!
Execution (In-order/Out-of-Order)
Up to six uOps dispatch per clock
8
0
/ 3
Up to four results can be written back per

clock
98 8
.
8
: ,
,;
),
1
6
3
)
:
:
@:
<
*8 A
1
<
! (%*;
01*
;
)
6<
6<
6<
*
0
0,8
6
,,=;
11> ,,=;
11> ,,=;
11>
0/
0/
0/
6!3?
3?-6*
*8
Memory
One load and one store per clock
8
0
/ 3
Load/store conflict detection uses the lower

12 bits
98 8
.
8
: ,
,;
),
1
6
3
)
:
:
@:
<
*8 A
1
<
! (%*;
01*
;
)
6<
6<
6<
*
0
0,8
6
,,=;
11> ,,=;
11> ,,=;
11>
0/
0/
0/
6!3?
3?-6*
*8
Subject to 4 K memory address

aliasing
Prefetchers
Each core has two data prefetchers

One prefetches for specific instructions
(IP prefetch)
One prefetcher detects streams
There are two prefetchers shared by

the cores
An adjacent line prefetcher
DPL (Data prefetch logic) looks for
more complex streams
Implications of Core 2 Architecture for Developers
Some Words on Pipelines (1)

Modern CPUs may be understood by considering their basic design paradigm, the socalled pipeline. The pipeline is designed to break up the processing of a single
instruction in independenent parts that idealy are executed in an identical time window.
The independent parts of the processing are called pipeline stages.
Since identical processing time in each stage cant be guaranteed, most pipeline stages
control a buffer or queue that supplies instructions if the previous stage is still busy or in
which instruction can be stored if the next stage is still busy.
Underflow or Overflow of a queue will cause the respective stage to run idle and will
cause a pipeline stall.
*0
88
0
3*8
>
*0
88
1
*8
>
*0
88
>
*0
88
>D
*88
:
*8
Some Words on Pipelines (2)

In order to achieve the best performance
Pipeline stalls must be avoided
Since Core 2 performance makes use of speculative execution, a wrongly taken branche
might lead to a pipeline flush to keep the instructions consistent.
Pipeline flushes must be avoided
Understanding the Core 2 pipeline and being able to detect pipeline problems will highly
improve the performance of your software.
Core 2 Block Diagram

! "
7/
. /
#
#
%
8
( '$
%
8
( '$
%
( & *
0#
$5 '
,* !-
* %
!+
"
(
#
'
()
$
,
%
8
( '$
$ 6
#
! %
&
&
,*(-
0#
(
$
' 8
2
&
&
,'2 -
*
* 12
,*2 -3
* %
&
&
Vectorization Streaming SIMD Extensions

1
1 ,3 >D
11>
@11>A
"! "?
!
8
38 !
0
! 0
)
>
<
$1 ,3 8
11>
D
8
F
38
0
! 0
Vectorization
Advanced Digital Media Boost Single Cycle SSE
((726
1 <: >
@11>;
11> ;
11>$A
!
=)
=$
=!
G)
G$
G!
11>; ;
$ /
3>1-
9
6
7 G 6>!
# 5
6
7 G 6>
=)
6
=)
G)
=$ G$ =
7 G 6>!
G)
=$ G$
=! G!
=! G!
Making use of Dual Core

Up to now: Performance tuning = Serial performance tuning
Core 2 is very fast even on a single core
Most crucial advantage: Multi-core CPU!
Parallel programming needed
Parallelization concept
Multi-threading
Synchronization
Communication
Load balancing
Multi-core means more effort for developers
Summary
Core 2 Microarchitecture is the state-of-the-art in x86 CPU design
64bit
Dual-core
14 stages deep 4 instructions wide pipeline
128-bit SIMD registers
Understanding the Core2 pipeline is compulsory for the best serial performance
Understanding concurent programming and efficient parallel design are compulsory for
best multi-core performance
Its worth the time Multi-Core is here to stay

Core Architecture

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Core Architecture

Încărcat de

Drepturi de autor:

Formate disponibile

Intel Core 2 Architecture

Implication for software developers

Challenge 2: Memory Latency

* Other brands and names may be claimed as the property of others.

RC delay of 1mm interconnect

* Other brands and names may be claimed as the property of others.

The Multi-Core Advantage

Relative single-core frequency and Vcc

Core 2 Duo 1 Meter above

Core 2 In the Core

Instruction Fetch and PreDecode

6 instructions write per cycle

One Macro-fusion per cycle

Includes a Loop Stream Detector

Potentially very high bandwidth

Trade-off LSD with conventional

store [mem3], ebx

Read five instructions from

dec0 dec1 dec2 dec3

With Intels New Macro-Fusion

store [mem3], ebx

Read five Instructions from

dec0 dec1 dec2 dec3

All in one cycle

!" !" #!!

Up to four results can be written back per

Load/store conflict detection uses the lower

Subject to 4 K memory address

Each core has two data prefetchers

There are two prefetchers shared by

Implications of Core 2 Architecture for Developers

Some Words on Pipelines (1)

Some Words on Pipelines (2)

Core 2 Block Diagram

Vectorization Streaming SIMD Extensions

Making use of Dual Core

S-ar putea să vă placă și