Intel Processor Architecture-Core

Intel® Core™ Microarchitecture
Intel® Software College

Objectives
After completion of this module you will be able to describe

• Components of an IA processor
• Working flow of the instruction pipeline
• Notable features of the architecture
Intel® Processor Micro-architecture - Core® microarchitecture
2
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Agenda
Introduction
Knowledge preparation
Notable features
Micro-architecture tour
Coding considerations
3
Agenda
Introduction
Notable features
4
Industrial Recognition Intel® Software College
PC Format May 2006

“Intel Strikes Back! Conroe is the name. Pistol-whipping Athlon
64s into burger meat is the game..“
Intel's Next Generation Microarchitecture Unveiled

Real World Tech
“Just as important as the technical innovations in Core MPUs, this
microarchitecture will have a profound impact on the industry. “
Intel Dishes the Knockout Punch to AMD with Conroe, GD Hardware.com

“…the results were far more than we could hope for and it'll be
amusing to see AMD's response to this beat-down session
Intel Regains Performance Crown, Anandtech

“… At 2.8 or 3.0GHz, a Conroe EE would offer even stronger performance
than what we’ve seen here.”
Intel Reveals Conroe Architecture, Extremetech

“… And not only was the Intel system running at 2.66GHz— a slower
clock rate than the top Pentium 4—it was outpacing an overclocked
Athlon 64 FX-60. Wrap your brain around that idea for a bit…”
Conroe Benchmarks - Intel Showing BigMicro-architecture

Intel® Processor Strength Hot Hardware.com
- Core® microarchitecture
“… Intel is poised to change the face of the desktop computing landscape…”
5
Performance Summary
Intel® Core™ Microarchitecture dramatically boosts Intel

platform performance
• Conroe & Woodcrest drive clear Desktop/Server performance
leadership
• Merom extends Intel Mobile performance leadership
Intel® Core™ Microarchitecture-based platforms set the

bar in Performance and Energy Efficiency for the Multi-
Core era
• Intel’s 3rd generation dual-core (while competition stuck on 1st
generation)
• New Intel high-performance ‘engine’: Wider, Smarter, Faster, More
Efficient
Energy-Efficient Performance 1
Best Processor on the Planet: Energy-
The “Core™ Effect”: Intel® Core™ Microarchitecture
20% (Merom),
ramp 40% (Conroe),
fuels broad roadmap 80% (Woodcrest) Performance Boosts1 !
accelerations
6 1 Based on SPECint*_rate_base2000
Agenda
Introduction
• Architecture VS Microarchitecture
• CISC VS RISC
• Performance Measurements
• Pipeline Design
• Power and Energy
• Chip Multi-Processing
Notable features
7
Architecture and Micro-architecture
What is Computer Architecture?

• Architecture is the set of features which are externally visible:
• Instruction set
• Registers
• Addressing modes
• Bus protocols
Intel Architectures (IA)
• IA32/X86 (8-bit, 16-bit and 32-bit Integer architecture)
• X87 (Floating Point extension)
• MMX (Multi-Media extension)
• SSE, SSE2, SSE3 (SIMD Streaming Extension)
• Intel® 64/EM64T (64-bit Integer extension of IA32) ? Go to detail!
• IA64 (Intel new 64-bit architecture)
• Itanium/Itainium2 processor family
8
Architecture and Micro-architecture (cont.)
What is Micro-architecture?
• Same as m–Architecture or u-Architecture
• “Invisible” features that provide meaningful value to the end
user (whatever makes you buy a new compatible PC)
• Programs run faster Improved Performance
• Reduced Power consumption Extended Battery life
• H/W fits into Smaller Form Factor
9
Intel® Architecture History

* IXA – Intel Internet Exchange Architecture/ EPIC – Explicitly Parallel Instruction Computing
Examples:
Architecture:
Instruction set definition EPIC* (Itanium®) IA-32 IXA* (XScale)
and compatibility
Microarchitecture:
Hardware implementation Examples:
maintaining instruction set
compatibility with high-level P5 P6 Intel NetBurst® Banias
architecture
Processors:
Productized
implementation of
Microarchitecture Examples:
Pentium® 4
Pentium® Pro
Pentium® Pentium® D Pentium® M
Pentium® II/III
Xeon®
10
Intel® Core™ Microarchitecture Processors
Intel® NetBurst®
+ New Innovations
Mobile
Microarchitecture
Intel® Core™ 2 Duo/Quad/Extreme processors

11
RISC Approach to CPU design

(RISC = Reduced Instruction Set Computers)
Optimize H/W for common basic operations
• Fixed instruction length
• Shorter Execution Pipeline
• Ease of Instruction Level Parallelism
• Large number of registers
• Less memory accesses
• ‘Load/Store’ architecture
• Shorter Execution Pipeline
• Ease of advancing Loads
• Branch Hints
• Reduce pipeline flush events
• ‘Exotic’ stuff to be implemented in S/W with minimal H/W support
• No ‘complex’ H/W instructions
• Handle exceptional conditions in S/W
Examples: MIPS, IBM Power and PowerPC, Sun Sparc
Achieve Maximum performance by

right partitioning between H/W and S/W Intel® Processor Micro-architecture - Core® microarchitecture
12
CISC Approach to CPU design
(CISC = Complex Instruction Set Computers)

Rich architecture
• Variable length instructions.
• Complex addressing modes.
On-chip HW / SW partitioning required
• H/W keeps executing ‘simple’ stuff
• Complex instructions are ‘emulated’ using u-code routines
from ROM
• More instructions treated as ‘simple’ as more H/W is available
COMPATIBILITY has some major advantages:
• Large (and forever increasing) software base
• Code development tools
• Expertise
• H/W - S/W spiral
Example: Intel IA32, Motorola 680X0
Maximize information passed to the HW

13
Performance Measurement
Performance is the reciprocal of the “Time of execution”:
1 1
Performance ≈ =
Were: Time _ of _ Execution L * CPI * TC
L = Code Length (# of machine instructions)
CPI = Clock cycles Per Instruction
Tc = Clock period (nSecs)
Substitute:
IPC = Instructions Per Cycle = 1/CPI
F = Frequency = 1/Tc
Improve ILP Improve Timing
IPC * F
Performance ≈
L
Arch Enhancements
14
Performance Measurement (cont.)
Benchmarks examples
Performance considerations: • Industry Standard
• Which Code/Application to run? • Spec (ISPEC, FSPEC)
• Which OS? • TPC
• Commercial
• Which other components in the • SysMark
platform? • MobileMark
• Under which thermal conditions? • PCMark
• Multithreading? Multiprocessing? • Sandra
• ScienceMark
• Applications
• Video (Windows Media encoder, DivX)
• Audio (Lame MP3)
• Compression (RAR)
• Content creation (3DSM, Photoshop, Premiere)
• Latest Games (Doom III, FarCry, but changes
fast)
• Specific industries use specific benchmarks
• Linux compilation, POVRay, LinPack, lmbench
15
Design Considerations for Different

Market Segments
Constrains:
• Thermally, area constrained Desktop
• Unconstrained Extreme
• Very area constrained Value
• Thermally, Energy and Area constrained Mobile
• Thermally, Energy Servers
Micro-architecture is the Art of Tradeoffs between:
• Schedule
• Requirements / Standards
• Performance
• Features
• Power / Energy
• Area / Cost
16
Design Metrics
IPC = Instructions per Cycle

• The more the better
Latency – same as Response Time
• The time interval between
• when any request for data is made and
• when the data transfer completes
• The less the better
Throughput
• The amount of work completed by the system per unit of time.
• The more the better
• ops/sec
17
CPU Pipeline
Break the work to smaller pieces

• Four basic stages of instruction life
• Fetch - bring instruction to core
• Decode - read operands from register
• Execute - perform the operation
• Writeback - save result to register
• Execution timing of simple instructions
(legend: “op src1,src2 dst”)
add eax, ebx eax F D E W
sub ecx, edx ecx F D E W
Increased throughput
• increased number of completed instructions per cycle
18
Pipeline Design - Explore Parallelism

New instruction not always depends on previous one
• Can start new instruction before previous one is finished
• ...if different stages use different H/W resources
Run instructions in parallel (pipeline)
Add eax, ebx eax F D E W
Sub ecx, edx ecx F D E W
Or edi, esi edi F D E W
Need to balance pipe stages
• Each stage should take same time for best throughput and utilization
Clock cycle is determined

by the longest path!
Fetch Decode Exec WB

19
Pipeline Design – Fighting Stalls
Data flow dependency (instructions output/input)

• Solved by bypasses, renaming etc
Control flow dependencies
• Solved by branch prediction
Others (Cache misses, long latency instructions)
• Solved by other dynamic scheduling techniques
? Go to detail!
20
Race of CISC vs. RISC
In modern CPUs Advanced µ-Architecture Techniques minimize the

advantages of RISC over CISC
• Branch Prediction
• Reduces the effect of extra pipeline stages
• Register Renaming
• Effectively Increase the Number of Registers
• Out Of Order
• Reduce Number of stalls caused by shortage of registers
• Speculative Execution
• Further Reduce Number of stalls
• Power saving features
• Reduce the overhead when not needed.
21
µop – Intel’s Take of the CICS/RISC Race
(CISC) Instructions are translated into one or more (RISC)

uop(micro-operation)s
• Fixed format
• Wide and simple
• Temp registers
Usually one uop per instruction
Complex instruction can be thousands of uops
Stores divided into two uops (STA and STD)
Fusion play games here
22
Power and Energy
Maximum power (TDP):

• Cooling requirements
• Cooling solution
• Computer form factor and acoustic noise
Average power
• Battery life
• Electricity bill
General calculation:
• P = frequency * voltage^2 * activity factor * capacitance + leakage
Reducing TDP
• Less transistors and wires
• Smaller transistors and wires
• Power features less activity
• Low leakage transistors
Reducing average power
• Energy efficiency
• Power states
• Lower leakage
23
Dual/Multi Core and SMT

Put more than one core per package
Architectural change:
• Software must be multi-threaded or multi-process
• …but backward compatible with multiprocessor systems (MP)
Several ways of implementing it
• All of them being used
I/O I/O
I/O I/O
LLC
LLC LLC LLC LLC
Core Core Core Core Core Core
SMT: Run two (or more) threads on the same core, simultaneously
24
Intel Approach
?
Intel®
Intel®
XQ6700*
Intel®
Intel®
Core 2 Duo®
Duo®
Intel®
Intel®
Pentium®
Pentium® D
Processor 80 Threads
Intel®
Intel®
Pentium®
Pentium®
With HT
Intel®
Intel® 4 Threads
Pentium®
Pentium®
2 Threads
State
2 Threads Execution Units
Cache
Bus
2 Threads
1 Threads
Q4 2000 Q2 2003 Q2 2005 Q3 2006 Q4 2006
While
While single
single core
core performance
performance has has increased
increased due due to to clock
clock speed,
speed,
increased
increased cache
cache and
and improved
improved ILP ILP the
the biggest
biggest performance
performance increases
increases
have
have come
come from
from
Intel® the
the
Processor thread
thread level level
Micro-architecture parallelism.
parallelism.
- Core® microarchitecture
25
A “Acronym Cheat Sheet” of Parallel

Computing
CMP: Chip Multi Processor (two or more cores per package)
• Dual Core: two cores in same package
• Quad Core: four cores in same package
DP: Dual Processor (two packages)
MP: Multi Processor (four or more packages)
SMT: Symmetric Multi Threading (virtual multi core: HyperThreading)
26
Agenda
Introduction
Notable features
• Wide Dynamic Execution
• Smart Memory Access
• Advanced Smart Cache
• Advanced Digital Media Boost
• Intelligent Power Capability
27
Intel® Core® Micro-architecture Notable

Features Instruction Fetch
Intel® Wide Dynamic Execution and PreDecode
• 14-stage efficient pipeline
Instruction Queue 2M/4M
• Wider execution path 5 shared L2
• Advanced branch prediction uCode
ROM
Decode Cache
• Macro-fusion 4
• Roughly ~15% of all instructions are
conditional branches up to
• Macro-fusion fuses a comparison Rename/Alloc
and jump to reduce micro-ops
10.4 Gb/s
running down the pipeline FSB
• Micro-fusion Retirement Unit
4
• Merges the load and operation (ReOrder Buffer)
micro-ops into one macro-op
• 64-Bit Support Schedulers
ALU ALU ALU
• Merom, Conroe, and Woodcrest Branch FAdd FMul
support EM64T MMX/SSE MMX/SSE MMX/SSE Load Store
FPmove FPmove FPmove
L1 D-Cache and D-TLB

28

Features (cont.)
Intel® Advanced Memory Access
• Improved prefetching
• Memory disambiguation
• Advance load before a possible data dependency (pointer conflict)
• Earlier loads hide memory latencies
29

Features (cont.)
Intel® Advanced Smart Cache
• Multi-core optimization
• Shared between the two cores
• Advanced Transfer Cache architecture
• Reduced bus traffic
• Both cores have full access to the entire cache
• Dynamic Cache sizing
30

Features (cont.)
Advantages of Shared Cache
Memory
Front Side Bus (FSB)

Shipping L2 Cache Line
~Half access to memory
Cache Line
CPU1 CPU2
31

Features (cont.)
Advantages of Shared Cache (cont.)
Memory
Front Side Bus (FSB)

L2 is shared:
No need to ship cache
line
Cache Line
CPU1 CPU2
32

Features (cont.)
Intel® Advanced Digital Media Boost SIMD Operation
(SSE/SSE2/SSE3/SSSE)
• Single Cycle SIMD Operation
SOURCE 127 0
• 8 Single Precision Flops/cycle X4 X3 X2 X1
• 4 Double Precision Flops/cycle SSE/2/3 OP
• Wide Operations Y4 Y3 Y2 Y1
• 128-bit packed Add DEST
• 128-bit packed Multiply

Core™ µarch
• 128-bit packed Load
CLOCK
X4opY4 X3opY3 X2opY2 X1opY1
• 128-bit packed Store CYCLE 1
• Support for Intel® EM64T Previous CLOCK

X2opY2 X1opY1
CYCLE 1
instructions
CLOCK X4opY4 X3opY3
CYCLE 2
33

Features
Intel® Advanced Digital Media Boost
• Additional Media Instructions - Supplemental Streaming SIMD
Extensions 3 (SSSE3)
• 16 new packed integer instructions
• Targeting video encode/decode
• Significantly improved strings
• REP MOVS and REP STOS
• ~8 bytes / cycle throughput
• mileage may vary
34

Features
Intel® Advanced Digital Media Boost
• Supplemental SSE-3 (SSSE-3)
Horizontal Addition/Subtraction
PHADDW, PHADDSW, PHADDD,
PHSUBW, PHSUBSW, PHSUBD
Packed Absolute Values
PABSB, PABSW, PABSD

Multiply and Add Packed
Signed/Unsigned bytes
PMADDUBSW
Packed multiply High with

Round and Scale PMULHRSW
Packed Shuffle Bytes

PSHUFB
Packed SIGN PSIGNB/W/D
Packed Align Right

PALIGNR
35

Features (cont.)
Intelligent Power Capability
• Advanced power gating & Dynamic power coordination
• Multi-point demand-based switching
• Voltage-Frequency switching separation
• Supports transitions to deeper sleep modes
• Event blocking
• Clock partitioning and recovery
• Dynamic Bus Parking
• During periods of high performance execution, many parts of the
chip core can be shut off
36
Agenda
Introduction
Notable features
• Front End
• Out-Of-Order Execution Core
• Memory Sub-system
37
Intel® Core® Micro-architecture Drill-down
page miss handler store

icache
branch address integer
prediction
predecode unit
data memory FP
load SIMD
cache order
instruction unit buffer store
(3x)
queue data
instruction register Reservation

decode alias table Station
MS ALLOC Re-Order Buffer

38
Agenda
Introduction
Knowledge refreshment
Notable features
• Front End
39
Core® Micro-architecture Front End
Instruction preparation before executed icache

branch
• Instruction Fetch Unit prediction
predecode unit
• Instruction Queue
• Instruction Decode Unit
• Branch Prediction Unit instruction
queue
instruction
decode
MS
40
Intel® Core™ Microarchitecture – Front End
Instruction Queue
Buffer between instruction pre-decode unit and decoder

• up to six predecoded instructions written per cycle
• 18 Instructions contained in IQ
• up to 5 Instructions read from IQ
Potential Loop cache
Loop Stream Detector (LSD) support
• Re-use of decoded instruction
• Potential power saving
41
Macro - Fusion
Scheduler
Roughly ~15% of all instructions are
cmpjae eax, [mem], label
conditional branches.
Macro-fusion merges two instructions
into a single micro-op, as if the two
instructions were a single long
instruction. Execution
Enhanced Arithmetic Logic Unit (ALU)

for macro-fusion. Each macro-fused
instruction executes with a single
dispatch. Branch
Eval
Not supported in EM64T long mode
flags and target to Write back
42
Macro-Fusion Absent Instruction Queue

addps xmm0, [EAX+16]
Read four instructions from
mulps xmm0, xmm0
Instruction Queue
Each instruction gets decoded movps [EAX+240], xmm0
into separate uops
cmp eax, 100000
Enabling Example
jge label
for (int i=0; i<100000; i++) {
… addps xmm0, [EAX+16] dec0
Cycle 1
} mulps xmm0, xmm0 dec1
movps [EAX+240], xmm0 dec2
cmp eax, 100000 dec3
Cycle 2 jge label dec0
43
Macro-Fusion Presented Instruction Queue

Read five Instructions from
Instruction Queue mulps xmm0, xmm0
Send fusable pair to single movps [EAX+240], xmm0

decoder
cmp eax, 100000
Single uop represents two
instructions jae label
Enabling Example
for (unsigned int i=0; Cycle 1 addps xmm0, [EAX+16] dec0
i<100000; i++) {
mulps xmm0, xmm0 dec1
… movps [EAX+240], xmm0 dec2
} cmpjae eax, 100000, label dec3
44
Instruction Decode / Micro-Op Fusion
Frequent pairs of micro-operations derived from the same

Macro Instruction can be fused into a single micro-operation
Micro-op fusion effectively widens the pipeline
45
Instruction Decode / Micro-Fusion (cont.)
u-ops of a Store “movps [EAX+240], xmm0”
sta eax+240
st xmm0, [eax+240]
std xmm0, [eax+240]
46
Branch Prediction Improvements
Intel® Pentium® 4 Processor branch prediction

PLUS the following two improvements:
Indirect Branch Predictor Loop Detector
Branch miss-predictions reduced by >20%
47
Agenda
Introduction
Notable features
• Front End
48
Core® Micro-architecture Execution Core
store
Accepted decoded u-ops, assign resources, address integer
execute and retire u-ops FP
load
• Renamer SIMD
store
data
(3x)
• Reservation station (RS)
register Reservation
• Issue ports
alias table Station
• Execution Unit ALLOC Re-Order Buffer
49
Intel® Core™ Microarchitecture – Execution Core
Execution Core Building Blocks
Renamer Ports (number)
RS
0,1,5 0,1,5
SIMD/Integer 0,1,5
SIMD Floating
MUL Integer
ROB Integer Point
Execution Unit
2 Load
3,4 Store
Memory Sub-system
50
Issue Ports and Execution Units

6 dispatch ports from RS
• 3 execution ports
• (shared for integer / fp / simd)
• load
• store (address)
• store (data)
128-bit SSE implementation
• Port 0 has packed multiply (4 cycles SP 5 DP pipelined)
• Port 1 has packed add (3 cycles all precisions)
51
Retirement Unit
ReOrder Buffer (ROB)

• Holds micro-ops in various stages of completion
• Buffers completed micro-ops
• updates the architectural state in order
• manages ordering of exceptions
alias table Station
ALLOC Re-Order Buffer
52
Agenda
Introduction
Notable features
• Front End
53
Core® Micro-architecture Memory Sub-

System
Memory Ordering Buffer
• Store Address Buffer
• Stores the address of each store not actually performed
• Loads compare address to any store older than itself
• If it find a hole…
• Store Data Buffer
• Stores data of each store not actually performed
• If load hit on the SAB, it forward the data from here
• Load Buffer
• Stores address of non-retired loads
• For snoops and re-dispatch
• One 128-bit load and one 128-bit store per cycle to different
memory locations
• Out of order Memory operations
54
Intel® Core™ Microarchitecture – Memory Sub-system
Core® Micro-architecture Memory Sub-

System (cont.)
32k D-Cache (8-way, 64 byte line size)
Shared second level (L2) 2MB 8-way or 4MB 16-way instruction and data cache
Cache to cache transfer
• improves producer / consumer style MP
Wider interface to L2
• reduced interference
• processor line fill is 2 cycles
Core1 Core2
Higher bandwidth from the L2 cache to the core
• ~14 clock latency and 2 clock throughput
Load & Store Access order
Bus
1. L1 cache of immediate core
2. L1 cache of the other core 2 MB L2 Cache
3. L2 cache
4. Memory
55
Advanced Memory Access / Enhanced Data

Pre-fetch Logic
Speculates the next needed data and loads it into cache by HW
and/or SW
Door Valet Parking Area Main Parking Lot

(L1 Cache) (L2 Cache) (External Memory)
56
Advanced Memory Access / Enhanced Data

Pre-fetch Logic (cont.)
• L1D cache prefetching
• Data Cache Unit Prefetcher
• Known as the streaming prefetcher
• Recognizes ascending access patterns in recently loaded data
• Prefetches the next line into the processors cache
• Instruction Based Stride Prefetcher
• Prefetches based upon a load having a regular stride
• Can prefetch forward or backward 2 Kbytes
• 1/2 default page size
• L2 cache prefetching: Data Prefetch Logic (DPL)
• Prefetches data to the 2nd level cache before the DCU requests
the data
• Maintains 2 tables for tracking loads
• Upstream – 16 entries
• Downstream – 4 entries
• Every load is either found in the DPL or generates a new entry
• Upon recognition of the 2nd load of a “stream” the DPL will
prefetch the next load
57
Advanced Memory Access / Memory

Disambiguation
Memory Disambiguation predictor
• Loads that are predicted NOT to forward from preceding store
are allowed to schedule as early as possible
• increasing the performance of OOO memory pipelines
Disambiguated loads checked at retirement

• Extension to existing coherency mechanism
• Invisible to software and system
58

Disambiguation Absent
Load4 must WAIT until previous stores complete
Memory
Data W
Store1 Y
Load2 Y
Data Z
Store3 W
Load4 X
Data Y
Data X
59

Disambiguation Presented
Loads can decouple from stores
Load4 can get its data WITHOUT waiting for stores
Memory
Data W
Load4 X
Store1 Y
Load2 Y Data Z
Store3 W
Data Y
Data X
60
Advanced Memory Access / Stores

Forwarding
If a load follows a store and reloads the data that the store
writes to memory, the micro-architecture can forward the data
directly from the store to the load
Memory
Store1 Y
Internal
Load2 Y Buffers
Data Y
61

Forwarding: Aligned Store Cases
store 16 store 32 bit store 64 bit
load 16 load 32 bit load 64 bit
ld 8 ld 8 load 16 load 16 load 32 bit load 32 bit
ld 8 ld 8 ld 8 ld 8 load 16 load 16 load 16 load 16
ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8
store 128 bit
load 128 bit
load 64 bit load 64 bit
load 32 bit load 32 bit load 32 bit load 32 bit
load 16 load 16 load 16 load 16 load 16 load 16 load 16 load 16
ld 8 ld 8 ld 8 ld 8 ld 8 Intel®
ld 8Processor
ld 8 ld Micro-architecture - Core® microarchitecture
8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8
62

Forwarding: Unaligned Cases
Note that unaligned store forward does not occur when the load
crosses a cache line boundary
store 16 store 32 bit store 64 bit
load 16‡ load 32 bit‡ load 64 bit
ld 8 ld 8 load 16‡ load 16 load 32 bit‡ load 32 bit
ld 8 ld 8 ld 8 ld 8 load 16‡ load 16 load 16 load 16
ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8
ld 8 Store forwarded to load
Note: Unaligned 128-bit stores
ld 8 No forwarding are issued as two 64-bit stores.
‡:
This provides two alignments for
No forwarding if the load store forwarding
crosses a cache line boundary
63
Agenda
Introduction
Notable features
64
Optimizing for
Instruction Fetch and PreDecode
Avoid “Length Changing Prefixes” (LCPs)
• Affects instructions with immediate data or offset
• Operand Size Override (66H)
• Address Size Override (67H) [obsolete]
• LCPs change the length decoding algorithm – increasing the
processing time from one cycle to six cycles (or eleven cycles
when the instruction spans a 16-byte boundary)
• The REX (EM64T) prefix (4xH) is not an LCP
• The REX prefix does lengthen the instruction by one byte, so use
of the first eight general registers in EM64T is preferred
65
Optimizing for
Instruction Queue
Includes a “Loop Stream Detector” (LSD)
• Potentially very high bandwidth instruction streaming
• A number of requirements to make use of the LSD
• Maximum of 18 instructions in up to four 16-byte packets
• No RET instructions (hence, little practical use for CALLs)
• Up to four taken branches allowed
• Most effective at 70+ iterations
• LSD is after PreDecode so there is no added cost for LCPs
• Trade-off LSD with conventional loop unrolling
66
Optimizing for
Decode
Decoder issues up to 4 uOps for renaming/ allocation per clock
• This creates a trade off between more complex instruction
uOps versus multiple simple instruction uOps
• For example, a single four uOp instruction is all that can be
renamed/allocated in a single clock
• In some cases, multiple simple instructions may be a better
choice than a single complex instruction
• Single uOp instructions allow more decoder flexibility
• For example, 4-1-1-1 can be decoded in one clock
• However, 2-2-2-1 takes three clocks to decode
67
Optimizing for
Execution
Up to six uOps can be dispatched per clock
• “Store Data” and “Store Address” dispatch ports are combined on
the block diagram
Up to four results can be written back per clock
Single clock latency operations are best
• Differing latency operations can create writeback conflicts
• Separate multiple-clock uOps with several single uOp instructions
• Typical instructions here: ADC/SBB, RWM, CMOVcc
• In some cases, separating a RMW instruction into its piece might be
faster (decode and scheduling flexibility)
When equivalent, PS preferred to PD (LCP)
• For example, MOVAPS over MOVAPD, XORPS over XORPD
68
Optimizing for
Execution (cont.)
Bypass register “access” preferred to register reads
Partial register accesses often lead to stalls
• Register size access that ‘conflicts’ with recent previous register
write
• Partial XMM updates subject to dependency delays
• Partial flag stall can occur, too much higher cost
• Use TEST instruction between shift and conditional to prevent
• Common zeroing instructions (e.g., XOR reg,reg) don’t stall
Avoid bypass between execution domains
• For example: FP (ADDPS) and logical ops (PAND) on XMMn
Vectorization: careful packing/unpacking sequence
• Use MXCSR’s FZ and DAZ controls as appropriate
69
Optimizing for
Memory
Software prefetch instructions
• Can reach beyond a page boundary (including page walk)
• Prefetches only when it completes without an exception
General techniques to help these prefetchers
• Organize data in consecutive lines
• In general, increasing addresses are more easily prefetched
70
Summary
What has been covered

• Notable features of Core® Micro-architecture
• Wide Dynamic Execution
• Advanced Memory Access
• Advanced Smart Cache
• Advanced Digital Media Boost
• Power Efficient Support
• Core® Micro-architecture components
• Front End
• OOO execution core
• Memory sub-system
71
72
Platform
Legacy & Debug I/O
Intel provides most of the silicon Core

LLC
on any computer Core
FSB
CPU
Classical platform partition
• CPU – Computation FSB
HD video
• MCH – high speed IO ME MEM
DDR
Graphics
• ICH – low speed IO PCIe PEG
Display
TVout Analog
Graphics speed and memory DMI
MCH
latencies will require different
partition
Wireless DMI
This presentation focuses on the PCI (IO)
SATA
core microarchitecture USB
KBRD ICH
others
73
Intel® 64 = Extending IA-32 to 64 Bit
Extended
ExtendedMemory
Memory
Addressability
Addressability
64 -Bit Pointers,
64-Bit Pointers,Registers
Registers
+ Additional
AdditionalRegisters
88-SSE
Registers
-SSE &&88-Gen
-Gen Purpose
Purpose
=
With 64-Bit
Double
DoublePrecision
Precision(64-bit)
(64-bit) Extension
Integer
IntegerSupport
Support Technology
Added to Intel XEON™ and Pentium® 4 Processor in 2004; today

available in all main stream Intel IA-32 processors – in particular in
all processors based on Intel® Core™ Architecture
74
Intel® 64 - New Modes of Operation
Compile New Features Defaults

required
Mode OS
Req’d
64- RIP New GPR Addr Operand
bit Rel. Regs Widt Size Size
IP h
64-bit Yes Yes Yes Yes 64 64 32
Mode
Long New 32 32
Mode 64-bit
Compa OS No Yes No No 32
tibility 16 16
Mode
Legacy Mode Legac 32 32

y 32-
(IA32 Mode) bit or No No No No 32
16-bit 16 16
OS
75
Registers : Extensions and Additions

RIP EIP
63 32 31 0 127 64 63 0
RAX EAX XMM0
RBX EBX XMM1
RCX ECX XMM2
RDX EDX XMM3
79 0 RBP EBP XMM4
RSI ESI XMM5
RDI EDI XMM6
RSP ESP XMM7
R8 XMM8
R9 XMM9
R10 XMM10
R11 XMM11
R12 XMM12
XMM13
R13
XMM14
X87/ R14
R15
XMM15
MMX
76
Registers : Availability in different

modes
77
64-bit Mode of Operation
Default data size is 32-bits

• Override to 64-bits using new REX prefix
All registers are 64-bit, 32-bit, 16-bit and 8-bit addressable
REX prefixes
• A family of 16 prefixed, encoded 0x40-0x4F
• Allows the use of general purpose registers as 64-bits
• Allows the use of new registers (like r8-r15)
Instructions that set a 32 bit register automatically zero extend
the upper 32-bits
78
REX Prefix
A new instruction-prefix byte used in 64-bit mode

• Specify the new GPRs and SSE registers
• Specify a 64-bit operand size.
• Specify extended control registers (used by system software)
An instruction can only have one REX prefix and if used, must immediately
precede the opcode or the two-byte opcode escape prefix .
The legacy instruction-size limit of 15 bytes still applies to instructions that
contains a REX prefix.
79
Physical and Linear Addressing
Linear Addressing
• Initial Intel® 64 implementation support 48
bits of Virtual addressing.
• Addresses are required to be in canonical form
– bits 47 thru 63 must all be 1 or all be 0.
Physical Addressing
• Initial Netburst™ Intel® 64 implementation
support 36 bit, today all current processors
support 40bit at least
• Entries in page tables expanded for up to 52
bits of physical address.
80
Intel®64 - Large Memory Considerations
Canonical addressing for 64 bit addresses

• Although the architecture now allows calculating flat
addresses to 64 bits, today’s processors limit virtual
addressing to 48 bits
• Canonical address definition: An address that has address
bit 63 through 47 set to either all ones or all zeros
• Canonical addresses are a requirement
• Values for addresses that are not canonical will cause faults
when put into locations expecting a valid address, such as
segment registers
Return
81
Introducing SIMD: Single Instruction

Multiple Data
Scalar processing SIMD processing
• traditional mode • with SSE / SSE2
• one operation produces • one operation produces
one result multiple results
X X x3 x2 x1 x0
+ +
Y Y y3 y2 y1 y0
X+Y X+Y x3+y3 x2+y2 x1+y1 x0+y0

82
X86 Register Sets

SSE-Registers introduced first in Pentium® 3
IA-INT MMX™ Technology / SSE Registers

Registers IA-FP Registers
80 128
32 64
xmm0
eax st0 mm0
xmm7
edi st7 mm7 Eight 128-bit registers
Eight 80/64-bit registers Hold data only:
Fourteen 32-bit registers 4 x single FP numbers
Hold data only
Scalar data & addresses
Stack access to FP0..FP7 2 x double FP numbers
Direct access to regs
Direct access to MM0..MM7 128-bit packed integers
No MMX™ Technology / FP Direct access to the registers

interoperability Use simultaneously with FP /
MMX Technology
Instruction Set Extensions
New Instructions Added to Intel® Processors

160 144
140
120
100
80 70
56 ~ 50
60
32
40 32
20 13
0
Jan-97 Feb-99 Dec-00 Feb-04 Jul-06 2008+
Future
MMX™ Streaming SIMD Streaming SIMD Streaming SIMD Supplemental SSE3 FutureSSE-4
Intel instruction
Extensions (SSE) Extensions 2 (SSE2) Extensions 3 (SSE3) (SSSE3) set extensions
Process (nm) 350 250 180 90 65 45 45
nm
Beginning in 2008: ~50 new instructions in 13 groups

All function in 32-bit and 64-bit modes
Improvements in Commercial Data Integrity i-SCSI, Video Processing, String and Text Processing, 2D &
3D Imaging, Vectorizing Compiler Performance
84
SSE and SSE-2 Data Types
SSE 4x floats
2x doubles
16x bytes
8x 16-bit shorts
SSE-2
4x 32-bit integers
2x 64-bit integers
1x 128-bit(!) integer
85
SSE-Instructions Set Extensions
Introduced by Pentium® 3 in 1999; now frequently called

SSE-1
Only new data type supported: 4x32Bit (Single Precision)
floating point data
Some 70 instructions
• Arithmetic, compare, convert operations on SSE SP FP data
• PACKED, UNPACKED
• Data load/store
• Prefetch
• Extension of MMX
• Streaming Store (store without using cache in between)
• …
2001 PTE Engineering Enabling Conference

SSE Sample: Branch Removal
R = (A < B)? C : D //remember: everything packed
A 0.0 0.0 -3.0 3.0

cmplt
B 0.0 1.0 -5.0 5.0
00000 11111 00000 11111

and nand
c3 c2 c1 c0 d3 d2 d1 d0
00000 c2 00000 c0 d3 00000 d1 00000

or
87
d3
c2 d1 c0
SSE-2 Instructions Set Extensions
Introduced by Intel® Pentium®4 processor in

2000
Some 140 new instructions
Added double precision floating point data
(2x64Bit) and all related instructions including
conversion
Again some extensions to MMX
Added all possible combinations of integer data to
SSE ( 1x128, 2x64, 4x32, 8x16, 16x8) and related
operations

SIMD Single vs. SIMD Double
SIMD SP FP Operand = 4 Elements 4 x Single Precision:

Element = SP FP Number SSE-1
127 0
X3 X2 X1 X0
31 30 23 22 0
S Exponent Significand
SIMD DP FP Operand = 2 Elements

2 x Double Precision:
Element = DP FP Number SSE-2
127 0
X1 X0
63 62 52 51 0
S Exponent Significand
Sample for SSE-2:

SIMD Double ↔ SIMD Int Conversion
SIMD Double SIMD Int: conversion to two lower ints, two

higher ints cleared
x1 x0 __m128d x;
__m128i ix;
ix = _mm_cvtpd_epi32(x);
00000 00000 (int)x1 (int)x0
SIMD Int SIMD Double: conversion from

two lower ints
???? ???? ix1 ix0 x = _mm_cvtepi32_pd(ix);
90
(double)x1 (double)x0
SSE3: No new Data Types but new Instructions
FISTTP
FP to integer
conversions
ADDSUBPD, ADDSUBPS,
Complex arithmetic
MOVDDUP, MOVSHDUP,
MOVSLDUP
Video encoding
SIMD FP using AOS LDDQU

format*
HADDPD, HSUBPD
Thread
Synchronization HADDPS, HSUBPS
MONITOR, MWAIT
* Also benefits Complex and Vectorization
91
Streaming SIMD Extensions 3

13 new instructions
Three have limited use for application performance

improvement
• FISTTP - X87 to integer conversion (requires –longdouble switch)
• MONITOR/MWAIT - thread synchronization
• Available today in Ring 0 only; being used by newer Windows* and Linux*
thread packages
The other ten have some potential for specifc

application domains
92
SSE-3 Sample Complex Arithmetic: ADDSUBPS
ADDSUBPS OperandA OperandB

• OperandA (xmm register; 4 data elements)
• a3, a2, a1, a0
• OperandB (xmm reg. Or memory addr; 4 data elements)
• b3, b2, b1, b0
• Result (Stored in OperandA)
• a3+b3, a2-b2, a1+b1, a0-b0
__m128 _mm_addsub_ps(__m128 a, __m128 b)
a3 a2 a1 a0
b3 b2 b1 b0
Add Sub Add Sub
93
a3+b3 a2-b2 a1+b1 a0-b0
Sample SSSE-3 Inst.: Byte Permute
PSHUFB mm, mm/m64

PSHUFB xmm, xmm/m128
• A complete byte-granularity permutation
• The source operand is used as the control field (variable control)
• The destination operand gets permuted
• Each byte of the source field selects the origin of the corresponding
destination byte
• Also includes force-byte-to-zero flag (bit 7)
src 0x7 0x7 0xFF 0x80 0x01 0x00 0x00 0x00
dest 0x04 0x01 0x07 0x03 0x02 0x02 0xFF 0x01
dest 0x04 0x04 0x00 0x00 0xFF 0x01 0x01 0x01
94
Ways to SSE/SIMD programming
Coding using SSE/SSE2/3/4 assembler instructions

• Very tedious (manually schedule) – discouraged: Don’t do it !
• E.g.: How do you exploit the benefits of having now 16 instead of
8 SSE registers for Intel® 64 without maintaining two versions ?
Intel® compiler’s C/C++ SIMD intrinsics

• No need to take care of register allocation, scheduling etc
Intel® compiler’s C++ Vector Class Library

• Use this if you are heavy into C++ classes
Vectorizer of Intel® C++ and Fortran Compilers

• Recommended for most cases – easy and efficient
Use ready-to-go vectorized code from a library like

Intel® Math Kernel Library (MKL)

Compiler Based Vectorization Intel® Software College
Processor Specific
Generate Code and Optimize for Linux*

Pentium® 3 compatible and Athlon XPprocessors including code generation for -axK
MMX and SSE -axK
Pentium® 4 compatible, Athlon 64, Opteron processors in 32 and 64 bit mode, -xW
including code generation for MMX, SSE and SSE2 -axW
Pentium® 4 processors in 32, including code generation for MMX, SSE and SSE2 -xN
- depreciated switch: use xW instead -axN
Pentium® M processors including code generation for MMX, SSE and SSE-2 -xB
-axB
Intel® processors with SSE3 capability including Pentium 4 (both 32 and 64bit -xP,
mode) – including code generation for MMX, SSE, SSE2 and SSE-3 -axP
Intel® processors with MNI capability – Intel® Core™2 Duo processors ( -xT,
Conroe, Merom, Woodcrest) including code generation for MMX, SSE, SSE2, SSE- -axT
3 and MNI
96

Features (cont.) New Instructions Return
Instruction name Description
psignb/w/d mm, mm/m64 Per element, if the source operand is

psignb/w/d xmm, xmm/m128 negative, multiply the destination operand
by -1.
pabsb/w/d mm, mm/m64 Per element, overwrite destination with
pabsb/w/d xmm, xmm/m128 absolute value of source.
phaddw/d/sw mm, mm/m64 Pairwise integer horizontal addition + pack.
phaddw/d/sw xmm, xmm/m128
phsubw/d/sw mm, mm/m64 Pairwise integer horizontal subtract + pack.
phsubw/d/sw xmm, xmm/m128
PMADDUBSW mm, mm/m64 Multiply signed & unsigned bytes.
PMADDUBSW xmm, xmm/m128 Accumulate result to signed-words.
(Multiply Accumulate)
PMULHRSW mm, mm/m64 Signed 16 bits multiply, return high bits.
PMULHRSW xmm, xmm/m128
PSHUFB mm, mm/m64 A complete byte-granularity permutation,
PSHUFB xmm, xmm/m128 including force-to-zero flag.
PALIGNR mm, mm/m64, imm8 Extract any continuous 16 (8 in the 64 bit
PALIGNR xmm, xmm/m128,Intel®
imm8 case) bytes from the pair [dst, src] and
Processor Micro-architecture - Core® microarchitecture
store them to the dst register.
97
Dependencies and Bypasses
“Read-after-Write” Dependency - 1 clock stall assuming

register file can be written-through
add eax, ecx eax F D E W
sub ebx, eax ebx F D D E W
“E to D” Bypass - save clock penalty
add eax, ecx eax F D E W
sub ebx, eax ebx F D E W
Long Latency operations
Load [ecx+edi] eax F D E E E W
add ebx, eax ebx F D D D E W
98
Fighting Stalls: Branch Handling
Given the code:

for (i=100, a=0; i>0; i--) a+=B[i];
Compiler would generate
• // eax initiated with zero, edi initiated with 100
loop: load B[edi] ebx // read B[i] from memory
add eax, ebx eax // a+=B[i]
add edi,-1 edi // i-=1
jnz edi, loop
store eax a // store result
99
Fighting Stalls: Branch Handling (cont.)

load B[edi] ebx F D E W
add eax,ebx eax F D E W
add edi,-1 edi F D E W
jnz edi, loop F D E W
store eax a F D E W
xxx F D E W
Only after branch Execute stage we know that next fetch was wrong
• Need to flush the pipe
• IPC: 4 instructions in 6 clocks (IPC = 0.66 vs. optimum IPC =
1)
• ‘Pipe break’ penalty = 2 clocks
• Adding a stage?: IPC = 0.57 ~14% slower!!!
Prolonging the pipeline achieves higher frequencies
however pipe break penalty increases!
MUST solve the pipe break penalty problem!
100
Fighting Stalls: Branch Handling (cont.)
H/W can ‘learn’ about SW behavior

• Same branch goes same direction in most cases
• Learn branch address and target
• Branch Target Buffer (BTB)
• Predict based on branch history, surrounding branch behavior, loop
behavior.
• We are at ~95% correct prediction.
• Looks in BTB while fetching instruction
• Lee&Smith or Yeh&Patt algorithms
New (and correct) pointer calculated in Fetch stage of branch

add eax,ebx eax F D E W
add edi,-1 edi F D E W
jnz edi, loop F/P D E W
101
Advanced Pipeline Techniques
Limitations of the Typical Pipeline Scheme

• IPC is theoretically limited by 1
• Actually IPC is less than 1 because of long latency operations,
stalls (e.g. cache miss), pipeline flushes (due to branch miss
prediction) etc.
• Pipeline stages are frequently not balanced
• Cycle Time (Tc) is determined by the longest pipeline stage
Advanced Pipeline Techniques
• Super pipeline
• Super-scalar
102
Advanced Pipeline Techniques (cont.)
Super pipeline: shorter stages allows higher frequency

F1 F2 D1 D2 E1 E2 W1 W2
F1 F2 D1 D2 E1 E2 W1 W2
F1 F2 D1 D2 E1 E2 W1 W2
Super-scalar: perform more in a single cycle
F D E W
F D E W
F D E W
F D E W
103
Fighting stalls: Out Of Order Execution

(OoO)
Instructions are executed based on “data flow” rather than
program order (Tomasulo’s algorithm ) Avoid the stall that
1. Instruction Fetch and Decode. occurs on this
stage in an in-order
2. Instruction queue @ Reservation Station. processor
3. Instruction
• waits in the queue until all input operands are available
• leaves the queue before earlier, older instructions.
4. Instruction Execution
5. Results are queued.
6. Instruction Reorder and Writeback.
104
Fighting stalls: Register Renaming
Creates new opportunities for OOO execution

• Eliminates Write-after-write (WAW) and Write-after-
read (WAR) dependencies = hazards.
Architectural vs physical registers dispatch
1. mov eax, [m1]
2. add eax, 2
MULTD F4,F2,F2 reads from F2
3. mov [m2], eax
4. F2,F0,F6
ADDD mov eaxwrites
, [m3]to F2
5. add eax, 4
6. mov [m4], eax
MULTD F4,F2,F2
4, 5,
ADDD 6 can be
F8,F0,F6 executed
(assume F8 is in parallel with 1, 2, 3
unused)
but after registers renaming only!!!
105
Fighting Stalls: Re-Order Buffer (ROB)
Mechanism for renaming and retirement

Table contains in-order instructions order instructions
• Instructions are entered in order
• Registers renamed by the entry number
• Once assigned: execution order unimportant
• After execution: entries marked
• An executed entry can be “retired” once all prior instruction
have retired. That is: instruction have retired -
• Update “real registers real registers” with value of renamed regs
• Update memory
• Leave the ROB
106
Fighting Stalls: Reservation Station(s)
Pool(s) of all “not yet executed” instructions

Maintains operands status “ready / not-ready”
Each cycle, executed instructions make more operands “ready”
Instructions whose all operands are “ready” can be “dispatched”
for execution
Dispatcher chooses which of the “ready” instructions will be
executed next
107
Fighting Stalls: Memory Order Buffer (MOB)
Idea - allow out of order among memory operations

Problem Memory dependencies cannot fully resolved statically
(memory disambiguation)
Structure similar in concept to ROB
Every access is allocated an entry
Address & data (for stores) are updated when known
Load is checked against all previous stores: Load is checked
against all previous stores
Return
108

Features (cont.)
Intelligent Power Capability - Split Busses (core power feature)
Many buses are sized

for worst case data
(x86 instruction of 15 bytes)

(ALU can write-back 128 bits)
Improved Energy Efficiency

109

Features (cont.)
Intelligent Power Capability - Split Busses (core power feature)
By splitting buses to deal

with varying data widths,
we can gain the performance
benefit of bus width while
maintaining C dynamic
closer to thinner buses
Improved Energy Efficiency

110
Agenda
Introduction
Notable features
Micro-architecture drill-down
• Front End
111
Intel® Core® Micro-architecture Overview
System Bus
Bus Unit
2nd Level Cache 1st Level Cache (Data)
Instruction Decode Renamer/Allocator Execution

Fetch Unit /IQ Buffers(Retirement) Unit
Scheduler
Front End
Execution Core
Branch Prediction Unit
112
Intel® Core® Micro-architecture Drill-down
page miss handler store

icache
branch address integer
prediction
predecode unit
data memory FP
load SIMD
cache order
instruction unit buffer store
(3x)
queue data
instruction register Reservation

decode alias table Station
MS ALLOC Re-Order Buffer

113
Example Code to Be Used
…
mulps xmm0, xmm0
movps [EAX+240], xmm0
cmp EAX, 100000
jge label
…
114
Agenda
Introduction
Notable features
• Front End
115
Instruction preparation before executed

• Instruction Fetch Unit
• Instruction Queue
• Instruction Decode Unit
• Branch Prediction Unit
116
Instruction Fetch Unit

Instruction Queue
Instruction Decode Unit
117
Prefetches instructions that are likely to be icache

executed branch
prediction
Caches frequently-used instructions predecode unit
Predecodes and Buffers instructions

instruction
queue
instruction
Instruction
Fetch Unit
IQ/
Decode
Renamer/Allocator
Buffers(Retirement)
Execution
Unit
decode
Scheduler
Front End
Execution Core
BTBs/Branch Prediction MS
118
Instruction Fetch Unit (cont.)
I-Cache (Instruction Cache)

• 32 KBytes / 8-way / 64-byte line
• 16 aligned bytes fetched per cycle
ITLB (Instruction Translation Lookaside Buffer)
• 128 4k pages, 8 2M pages
Instruction Prefetcher
• 16-byte aligned lookup through the ITLB into the instruction cache
and instruction prefetch buffers
Instruction Pre-decoder
• Instruction Length Decode (predecode)
• Avoid Length Changing Prefix, for example
• The REX (EM64T) prefix (4xH) is not an LCP
Avoid in loop:
MOV dx, 1234h
Opcode
Instruction Prefixes (66H/67H)Intel® ModR/M
ModR/M SIB Displacement
Processor Micro-architecture - Core® microarchitecture
Immediate
119

Instruction Queue
120
Instruction Queue
Buffer between instruction pre-decode unit and
decoder
• up to six predecoded instructions written per
cycle icache
• 18 Instructions contained in IQ branch
• up to 5 Instructions read from IQ prediction
predecode unit
Potential Loop cache
Loop Stream Detector (LSD) support
• Re-use of decoded instruction instruction
• Potential power saving queue
instruction
Instruction
Fetch Unit
IQ/
Decode
Renamer/Allocator
Buffers(Retirement)
Execution
Unit
decode
Scheduler
Front End
Execution Core
121

Instruction Queue
122
Instruction Decode
Decode the instructions into micro-ops

icache
Ready for the execution in OOO core branch
prediction
predecode unit
instruction
queue
instruction
Instruction
Fetch Unit
IQ/
Decode
Renamer/Allocator
Buffers(Retirement)
Execution
Unit
decode
Scheduler
Front End
Execution Core
123
Instruction Decode
Decoders
Features
• Macro-fusion
• Micro-fusion
• Stack Pointer Tracking
124
Instruction Decode / Decoders
Instructions converted to micro-ops (uops)

• 1-uop includes load+op, stores, indirect jump, RET...
4 decoders:1 “large” and 3 “small”
• All decoders handle “simple” 1-uop instructions
• One large decoder handles instructions up to 4 uops
All decoder working in parallel
• Four(+) instructions / cycle
Micro-Sequencer takes over for long flows (handling instruction
contains 2~4 uops, uCodeRom handles more complex)
125
Code Sequence in Front End

cmp EAX, 100000 IQ
these instructions took jne label
more than one fetch
as they are 22 bytes movps [EAX+240], xmm0
IQ buffers them together mulps xmm0, xmm0
all instructions are
decodable by all small small small
decoders Large
(dec1) (dec2) (dec3)
(dec0)
CMP and adjacent JCC
are “fused” into a single
uop. up to 5 instructions cmpjne EAX, 100000, label
decoded per cycle sta_std [EAX+240], xmm0
mulps xmm0, xmm0, xmm0
load_add xmm0, xmm0, [EAX+16]
126
Instruction Decode
Decoders
Features
• Macro-fusion
• Micro-fusion
127
Instruction Decode / Macro - Fusion
Scheduler
Roughly ~15% of all instructions are
cmpjae eax, [mem], label
conditional branches.
Macro-fusion merges two instructions
into a single micro-op, as if the two
instructions were a single long
instruction. Execution
Enhanced Arithmetic Logic Unit (ALU)

for macro-fusion. Each macro-fused
instruction executes with a single
dispatch. Branch
Eval
Not supported in EM64T long mode
flags and target to Write back
128
Instruction Decode / Macro- Instruction Queue

Fusion Absent addps xmm0, [EAX+16]
Read four instructions from
mulps xmm0, xmm0
Instruction Queue
Each instruction gets decoded movps [EAX+240], xmm0
into separate uops
cmp eax, 100000
Enabling Example
jge label
for (int i=0; i<100000; i++) {
… addps xmm0, [EAX+16] dec0
Cycle 1
} mulps xmm0, xmm0 dec1
movps [EAX+240], xmm0 dec2
cmp eax, 100000 dec3
Cycle 2 jge label dec0
129
Instruction Decode / Macro- Instruction Queue

Fusion Presented addps xmm0, [EAX+16]
Read five Instructions from
Instruction Queue mulps xmm0, xmm0
Send fusable pair to single movps [EAX+240], xmm0

decoder
cmp eax, 100000
Single uop represents two
instructions jae label
Enabling Example
for (unsigned int i=0; Cycle 1 addps xmm0, [EAX+16] dec0
i<100000; i++) {
mulps xmm0, xmm0 dec1
… movps [EAX+240], xmm0 dec2
} cmpjae eax, 100000, label dec3
130
Instruction Decode / Macro – Fusion (cont.)
Benefits
• Reduces latency
• Increased renaming
• Increased retire bandwidth
• Increased virtual storage
• Power savings
Enabling Greater Performance &

Efficiency
131
Instruction Decode
Decoders
Features
• Macro-fusion
• Micro-fusion
132
Instruction Decode / Micro-Op Fusion
Frequent pairs of micro-operations derived from the same

Macro Instruction can be fused into a single micro-operation
Micro-op fusion effectively widens the pipeline
133
Instruction Decode / Micro-Fusion (cont.)
u-ops of a Store “movps [EAX+240], xmm0”
sta eax+240
st xmm0, [eax+240]
std xmm0, [eax+240]
134
Instruction Decode
Decoders
Features
• Macro-fusion
• Micro-fusion
135
Instruction Decode / Stack Pointer Tracker

(Extended Stack Pointer folding)
ESP is calculated by dedicate logic
PUSH EAX PUSH EDX POP EBX
• No explicit Micro-Ops updating ESP
• Micro-Ops saving Decoder 4 Decoder 0 Decoder
ESPd=8 …
• Power saving 0 1 N
Recovery .
Information .
.
136

Instruction Queue
137
Allow executing instructions long before the

branch outcome is decided icache
branch
• Superset of Prescott / Pentium-M features prediction
predecode unit
• One taken branch every other clock
• Branch predictions for 32 bytes at a time,
twice the width of the fetch engine instruction
queue
instruction
Instruction
Fetch Unit
IQ/
Decode
Renamer/Allocator
Buffers(Retirement)
Execution
Unit
decode
Scheduler
Front End
Execution Core
138
Branch Prediction Unit (cont.)
16-entry Return Stack Buffer (RSB)

Front end queuing of BPU lookups
Type of predictions
• Direct Calls and Jumps
• Indirect Calls and Jumps
• Conditional branches
139
Branch Prediction Improvements
Intel® Pentium® 4 Processor branch prediction

PLUS the following two improvements:
Indirect Branch Predictor Loop Detector
Branch miss-predictions reduced by >20%
140
Agenda
Introduction
Notable features
• Front End
141
Core® Micro-architecture Execution Core
store
Accepted decoded u-ops, assign resources, address integer
execute and retire u-ops FP
load
• Renamer SIMD
store
data
(3x)
• Reservation station (RS)
• Issue ports
alias table Station
• Execution Unit ALLOC Re-Order Buffer
IQ/ Renamer/Allocator Execution

Instruction
Decode Buffers(Retirement) Unit
Fetch Unit
Scheduler
Front End Execution Core
BTBs/Branch Prediction
142
Execution Core Building Blocks
Renamer Ports (number)
RS
0,1,5 0,1,5
SIMD/Integer 0,1,5
SIMD Floating
MUL Integer
ROB Integer Point
Execution Unit
2 Load
3,4 Store
Memory Sub-system
143
Rename and Resources
4 uops renamed / retired per clock

• one taken branch, any # of untaken
• one fxchg per cycle
Uops written to RS and ROB
• Decoded uops were renamed and allocated with resource by
RAT and sent to ROB read and RS
• RS waits for sources to arrive allowing OOO execution
• Registers not “in flight” read from ROB during RS write
alias table Station
ALLOC Re-Order Buffer
144
Issue Ports and Execution Units

6 dispatch ports from RS
• 3 execution ports store
• (shared for integer / fp / simd) integer
address
• load FP
• store (address) load SIMD
• store (data) store (3x)
data
128-bit SSE implementation
• Port 0 has packed multiply (4 cycles SP 5 DP pipelined)
• Port 1 has packed add (3 cycles all precisions)
FP data has one additional cycle bypass latency
• Do not mix SSE FP and SSE integer ops on same register
Avoid: Addps XMM0,XMM1 Better: Addps XMM0,XMM1
Pand xmm0,xmm3 Addps xmm2,xmm0
Addps xmm2,xmm0 Pand xmm0,xmm3
145
The Out Of Order
each uop only takes a single RS entry

load + add dispatches twice (load, then add)
mulps dispatches once when load + add to write back
sta + std dispatches twice
sta (address) can fire as early as possible
std must wait for mulps to write back
cmpjne dispatches only once (functionality is truly fused)
no dependency, can fire as early as it wants
cmpjne EAX, 100000, label RS

sta_std [EAX+240], xmm0
146
Dispatching to OOO EXE

cmpjne EAX, 100000, label
sta_std [EAX+240], xmm0 RS 5 GP (incl jmp)
4 STD
mulps xmm0, xmm0, xmm0 3 STA
cmpjne EAX, 100000, label 2 Load

load_add xmm0, xmm0, [EAX+16] 1 GP (incl FP add)

sta_std [EAX+24C], xmm0 0 GP (incl FP mul)
mulps xmm0, xmm0, Intel®
xmm0 Processor Micro-architecture - Core® microarchitecture
load_add
147 xmm0, xmm0, [EAX+16]
Advanced Memory Access
3 clk latency and 1 clk thrput of L1D; 14 and 2 for L2

Miss Latencies
• L1 miss hits L2 ~ 10 cycles
• L2 miss, access to memory ~300 cycles (server/FBD)
• L2 miss, access to memory ~165 cycles (Desk/DDR2)
• C step broadwater is reported to have ~50ns latency
Cache Bandwidth
• Bandwidth to cache ~ 8.5 bytes/cycle
Memory Bandwidth
• Desktop ~ 6 GB/sec/socket (linux)
• Server ~3.5 GB/sec/socket
148
Optimizing for Intel® Core™

Microarchitecture
Use CMP = employ both Cores
• Go to multithreading!
Prefer SSE as much as possible. If you didn’t do it so far,
vectorize the code now!!
• Intel Compiler has very good vectorization engine
Align data and data layout (sequential)
• To align use __declspec(align (16)) float a[1000];
149
Optimizing for Intel® Core™

Microarchitecture (advanced)
Use Intel VTune™ Performance Analyzer for performance
problems revealing
• CPI
• Specific CPU events for Core-arch:
RESOURCE_STALLS.RS_FULL, L2_IFETCH.SELF.MESI,
RESOURCE_STALLS.RS_FULL, RESOURCE_STALLS.ROB_FULL etc-
see VTune help
150
Front End Issue Debugging

Look for Front End optimization only when code is FE bound
• Reservation station (RS) is the front end and allocation target
• Low RESOURCE_STALLS.RS_FULL and poor CPI should be debugged as front
end issue
• If there are no issues in the FE the RS should be full above 30% of the time
Front End typical issues:
• Code is too big to fit in the L1:
• When L2_IFETCH.SELF.MESI happens every 10-15 instructions
• Code that could have been with CPI 1 will be around 2
• 14 cycles penalty for L1 demand miss
• Average instruction size above 6 bytes
• Happens typically with SSE code and more with EM64T
• Can have impact only in case of otherwise excellent CPI
• Code with length changing prefix issues (LCP)
• Penalty of 6 cycles or more
• Look at ILD_STALL VTune event
Front-End should not be the bottleneck.

Focus on Front End issues only if it is the issue.
151
Execution micro architecture

The busiest port may determent the potential execution speed
Single clock latency operations are best

• Different latency operations can create writeback conflicts
Creating bubble in the port
Look at the dependency chains to see the potential

parallelism
• Remember that the RS has only 32 entries and only those
instructions are candidates for scheduling to the execution
ports
• High RESOURCE_STALLS.RS_FULL percentage if the code is
latency bound
• The ROB has 96 entries
• High RESOURCE_STALLS.ROB_FULL percentage only if
Execution
• Code stage:
has long latency The key
instructions (L2 formisses) good performance.
Focus
•152 oncanport
Other code utilization
be executed while waiting and dependency chains
Execution micro architecture
The Divider is a big potential stall source

• DIV for the number Divide operations executed
• IDLE_DURING_DIV for number of cycles of no port issue while the
diverter is busy
• Try to find some useful work to do in parallel with divide operations
Extra cycle latency for bypass between
execution domains
• For example: FP (ADDPS) and logical
EXE
ops (PAND) on XMMn
• DELAYED_BYPASS.FP Data Cache Unit
0,1,5 0,1,5 0,1,5
• DELAYED_BYPASS.LOAD SIMD integer / Floating
Integer
• DELAYED_BYPASS.SIMD Integer SIMD
MUL
Point
dtlb
memoryorderring
store forwarding
load 2
store (address) 3
store (data)
4
153
Enhancements and Optimization Opportunities

IP Prefetcher
• Prefetches stride loads associated with the same IP
• Uses History table
• Use VTune events to identify misses when expected prefetches
Memory Disambiguation
• Predicts when OK to fire load before preceding stores with unknown
address
• Misprediction triggers Pipeline flash and load restart
• Disambiguation is temporarily disabled if frequently fails
• LOAD_BLOCK.STA where Loads blocked by a preceding store with
unknown address
• In case not to the same address:
Possible reasons for not working: Address collision with other load(s)
154
Other Opportunities for Performance Gain in the

memory sub-system
4k Aliasing
• OOO engine can fire Load before preceding Store if not collides on the Store’s
address
• Address collision serializes execution
• Address checking uses only the last 12 bits (4K)
• False blocking - if Load’s & Store’s addresses have 4KB offset
• e.g. accessing large, power of two, sized arrays in a loop
• Resolve 4K aliasing conflicts by changing memory layout
• VTune event LOAD_BLOCK.OVERLAP_STORE
Load block cases
• Increase the distance between the store and the dependant load, so that the
store data/address is known at the time the load is dispatched
• Store address unknown - LOAD_BLOCK.STA
• Loads blocked by a preceding store with unknown address
• Store data unknown - LOAD_BLOCK.STD
• Loads blocked by a preceding store with unknown data
• Loads blocked until retirement LOAD_BLOCK.UNTIL_RETIRE
• This includes mainly uncacheable loads and split loads (loads that cross the cache
line boundary)
155

Intel Processor Architecture-Core

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Intel Processor Architecture-Core

Încărcat de

Drepturi de autor:

Formate disponibile

Intel® Core™ Microarchitecture

Intel® Software College

After completion of this module you will be able to describe

Intel® Processor Micro-architecture - Core® microarchitecture

Intel® Processor Micro-architecture - Core® microarchitecture

Intel® Processor Micro-architecture - Core® microarchitecture

PC Format May 2006

Intel's Next Generation Microarchitecture Unveiled

Intel Dishes the Knockout Punch to AMD with Conroe, GD Hardware.com

Intel Regains Performance Crown, Anandtech

Intel Reveals Conroe Architecture, Extremetech

Conroe Benchmarks - Intel Showing BigMicro-architecture

Intel® Core™ Microarchitecture dramatically boosts Intel

Intel® Core™ Microarchitecture-based platforms set the

Intel® Processor Micro-architecture - Core® microarchitecture

Architecture and Micro-architecture

What is Computer Architecture?

Intel® Processor Micro-architecture - Core® microarchitecture

Architecture and Micro-architecture (cont.)

Intel® Processor Micro-architecture - Core® microarchitecture

Intel® Architecture History

Intel® Processor Micro-architecture - Core® microarchitecture

Intel® Core™ Microarchitecture Processors

Intel® Core™ 2 Duo/Quad/Extreme processors

RISC Approach to CPU design

Achieve Maximum performance by

CISC Approach to CPU design

(CISC = Complex Instruction Set Computers)

Maximize information passed to the HW

Improve ILP Improve Timing

Intel® Processor Micro-architecture - Core® microarchitecture

Performance Measurement (cont.)

Intel® Processor Micro-architecture - Core® microarchitecture

Design Considerations for Different

Intel® Processor Micro-architecture - Core® microarchitecture

IPC = Instructions per Cycle

Intel® Processor Micro-architecture - Core® microarchitecture

Break the work to smaller pieces

Intel® Processor Micro-architecture - Core® microarchitecture

Pipeline Design - Explore Parallelism

Clock cycle is determined

Fetch Decode Exec WB

Pipeline Design – Fighting Stalls

Data flow dependency (instructions output/input)

Intel® Processor Micro-architecture - Core® microarchitecture

Race of CISC vs. RISC

In modern CPUs Advanced µ-Architecture Techniques minimize the

Intel® Processor Micro-architecture - Core® microarchitecture

µop – Intel’s Take of the CICS/RISC Race

(CISC) Instructions are translated into one or more (RISC)

Intel® Processor Micro-architecture - Core® microarchitecture

Power and Energy

Maximum power (TDP):

Intel® Processor Micro-architecture - Core® microarchitecture

Dual/Multi Core and SMT

Core Core Core Core Core Core

A “Acronym Cheat Sheet” of Parallel

Intel® Processor Micro-architecture - Core® microarchitecture

Intel® Processor Micro-architecture - Core® microarchitecture

Intel® Core® Micro-architecture Notable

L1 D-Cache and D-TLB

Intel® Core® Micro-architecture Notable

Intel® Processor Micro-architecture - Core® microarchitecture