Documente Academic
Documente Profesional
Documente Cultură
Stephen Blair-Chappell
Technical Consulting Engineer
Intel Compiler Labs
Agenda
Why Multicore
Core 2 Microarchitecture
From high above
A closer look at the cores
In-depth pipeline walk-through
Implications for Developers
Pipeline stalls
vectorization
Making use of dual core
Summary / Q&A
Challenge 1: Power
Power Density Race
10,000
Power Density
(W/cm2)
1,000
100
Suns Surface
Rocket Nozzle
Nuclear Reactor
Hot Plate
8086
10 4004
8085
386
8008
286
486
8080
1
70
80
90
Pentium
processors
00
10
3000
Logic
2500
2000
MHz
GAP
1500
1000
500
0
1992
Memory
1994
1996
1998
2000
2002
Challenge 3: RC Delays
D e la y (p s )
10000
Interconnect RC Delay
Clock Period
1000
100
10
Copper Interconnect
1
0.35
0.25
180nm
130nm
90nm
Process Technology
i486: 16 clocks RAM access roundtrip
65 nm IC: 15 clocks to cross the chip
65nm
1.73x
Performance
1.73x
Power
1.13x
1.00x
0.87x
1.02x
0.51x
Over-clocked
(+20%)
Design
Frequency
Under
Dual--clocked
core
(-(20%)
-20%)
! ((# !$$$,&'01*
! # (.2 -3/
!)$
4 3
1 5 6%
1'
.
Core 2 1 cm above
!) 1
$ 7* 6!
$ 7* 6!3
!-
),* 1
!) 1
98 8
.
8
: ,
,;
),
1
6
:
@:
),* 1
6<
6<
6<
*
0
0,8
6
,,=;
11> ,,=;
11> ,,=;
11>
0/
0/
0/
6!3?
$ 7* 6!3
<
! (%*;
01*
<
*8 A
1
!-
$ 7* 6!
3?-6*
*8
!"
B
98 8
.
8
: ,
!(
,;
),
1
6
B*
5
8
3
)
:
:
@:
<
! (%*;
01*
<
*8 A
# $
B08
8
B
6<
6<
6<
*
0
0,8
6
,,=;
11> ,,=;
11> ,,=;
11>
0/
0/
0/
6!3?
3?-6*
*8
E8 8
Instruction Queue
18 deep instruction queue
8
0
/ 3
98 8
.
8
: ,
,;
),
1
6
3
)
:
:
@:
<
*8 A
1
<
! (%*;
01*
;
)
6<
6<
6<
*
0
0,8
6
,,=;
11> ,,=;
11> ,,=;
11>
0/
0/
0/
6!3?
3?-6*
*8
Instruction Decode
Macro Fusion
8
0
/ 3
6
98 8
.
8
: ,
,;
),
1
6
3
)
:
:
@:
<
*8 A
1
<
! (%*;
01*
;
)
6<
6<
6<
*
0
0,8
6
,,=;
11> ,,=;
11> ,,=;
11>
0/
0/
0/
6!3?
3?-6*
*8
Without
Macro-Fusion
Instruction Queue
inc
esp
targ
cmp
eax, [mem2]
load
eax, [mem1]
Cycle 1
Cycle 2
inc
esp
store [mem3], ebx
jne
targ
cmp eax, [mem2]
load eax, [mem1]
Instruction Queue
inc
esp
targ
cmp
eax, [mem2]
load
eax, [mem1]
inc
esp
store [mem3], ebx
cmpjne eax, [mem2], targ
load eax, [mem1]
Execution (In-order/Out-of-Order)
Up to six uOps dispatch per clock
8
0
/ 3
98 8
.
8
: ,
,;
),
1
6
3
)
:
:
@:
<
*8 A
1
<
! (%*;
01*
;
)
6<
6<
6<
*
0
0,8
6
,,=;
11> ,,=;
11> ,,=;
11>
0/
0/
0/
6!3?
3?-6*
*8
Memory
One load and one store per clock
8
0
/ 3
98 8
.
8
: ,
,;
),
1
6
3
)
:
:
@:
<
*8 A
1
<
! (%*;
01*
;
)
6<
6<
6<
*
0
0,8
6
,,=;
11> ,,=;
11> ,,=;
11>
0/
0/
0/
6!3?
3?-6*
*8
*0
88
0
3*8
>
*0
88
1
*8
>
*0
88
>
*0
88
>D
*88
:
*8
7/
. /
#
#
%
8
( '$
%
8
( '$
%
( & *
0#
$5 '
,* !-
* %
!+
"
(
#
'
()
$
,
%
8
( '$
$ 6
#
! %
&
&
,*(-
0#
(
$
' 8
2
&
&
,'2 -
*
* 12
,*2 -3
* %
&
&
1 ,3 >D
11>
@11>A
"! "?
!
8
38 !
0
! 0
)
>
<
$1 ,3 8
11>
D
8
F
38
0
! 0
Vectorization
Advanced Digital Media Boost Single Cycle SSE
((726
1 <: >
@11>;
11> ;
11>$A
!
=)
=$
=!
G)
G$
G!
11>; ;
$ /
3>1-
9
6
7 G 6>!
# 5
6
7 G 6>
=)
6
=)
G)
=$ G$ =
7 G 6>!
G)
=$ G$
=! G!
=! G!
Summary
Core 2 Microarchitecture is the state-of-the-art in x86 CPU design
64bit
Dual-core
14 stages deep 4 instructions wide pipeline
128-bit SIMD registers
Understanding the Core2 pipeline is compulsory for the best serial performance
Understanding concurent programming and efficient parallel design are compulsory for
best multi-core performance
Its worth the time Multi-Core is here to stay