Sunteți pe pagina 1din 155

Intel® Core™ Microarchitecture

Intel® Software College


Intel® Software College

Objectives

After completion of this module you will be able to describe


• Components of an IA processor
• Working flow of the instruction pipeline
• Notable features of the architecture

Intel® Processor Micro-architecture - Core® microarchitecture

2
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Agenda

Introduction
Knowledge preparation
Notable features
Micro-architecture tour
Coding considerations

Intel® Processor Micro-architecture - Core® microarchitecture

3
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Agenda

Introduction
Knowledge preparation
Notable features
Micro-architecture tour
Coding considerations

Intel® Processor Micro-architecture - Core® microarchitecture

4
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Industrial Recognition Intel® Software College

PC Format May 2006


“Intel Strikes Back! Conroe is the name. Pistol-whipping Athlon
64s into burger meat is the game..“

Intel's Next Generation Microarchitecture Unveiled


Real World Tech
“Just as important as the technical innovations in Core MPUs, this
microarchitecture will have a profound impact on the industry. “

Intel Dishes the Knockout Punch to AMD with Conroe, GD Hardware.com


“…the results were far more than we could hope for and it'll be
amusing to see AMD's response to this beat-down session

Intel Regains Performance Crown, Anandtech


“… At 2.8 or 3.0GHz, a Conroe EE would offer even stronger performance
than what we’ve seen here.”

Intel Reveals Conroe Architecture, Extremetech


“… And not only was the Intel system running at 2.66GHz— a slower
clock rate than the top Pentium 4—it was outpacing an overclocked
Athlon 64 FX-60. Wrap your brain around that idea for a bit…”

Conroe Benchmarks - Intel Showing BigMicro-architecture


Intel® Processor Strength Hot Hardware.com
- Core® microarchitecture
“… Intel is poised to change the face of the desktop computing landscape…”
5
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Performance Summary

Intel® Core™ Microarchitecture dramatically boosts Intel


platform performance
• Conroe & Woodcrest drive clear Desktop/Server performance
leadership
• Merom extends Intel Mobile performance leadership

Intel® Core™ Microarchitecture-based platforms set the


bar in Performance and Energy Efficiency for the Multi-
Core era
• Intel’s 3rd generation dual-core (while competition stuck on 1st
generation)
• New Intel high-performance ‘engine’: Wider, Smarter, Faster, More
Efficient

Energy-Efficient Performance 1
Best Processor on the Planet: Energy-
The “Core™ Effect”: Intel® Core™ Microarchitecture
20% (Merom),
ramp 40% (Conroe),
fuels broad roadmap 80% (Woodcrest) Performance Boosts1 !
accelerations
Intel® Processor Micro-architecture - Core® microarchitecture

6 1 Based on SPECint*_rate_base2000
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Agenda

Introduction
Knowledge preparation
• Architecture VS Microarchitecture
• CISC VS RISC
• Performance Measurements
• Pipeline Design
• Power and Energy
• Chip Multi-Processing
Notable features
Micro-architecture tour
Coding considerations

Intel® Processor Micro-architecture - Core® microarchitecture

7
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Architecture and Micro-architecture

What is Computer Architecture?


• Architecture is the set of features which are externally visible:
• Instruction set
• Registers
• Addressing modes
• Bus protocols
Intel Architectures (IA)
• IA32/X86 (8-bit, 16-bit and 32-bit Integer architecture)
• X87 (Floating Point extension)
• MMX (Multi-Media extension)
• SSE, SSE2, SSE3 (SIMD Streaming Extension)
• Intel® 64/EM64T (64-bit Integer extension of IA32) ? Go to detail!
• IA64 (Intel new 64-bit architecture)
• Itanium/Itainium2 processor family

Intel® Processor Micro-architecture - Core® microarchitecture

8
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Architecture and Micro-architecture (cont.)

What is Micro-architecture?
• Same as m–Architecture or u-Architecture
• “Invisible” features that provide meaningful value to the end
user (whatever makes you buy a new compatible PC)
• Programs run faster Improved Performance
• Reduced Power consumption Extended Battery life
• H/W fits into Smaller Form Factor

Intel® Processor Micro-architecture - Core® microarchitecture

9
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Intel® Architecture History


* IXA – Intel Internet Exchange Architecture/ EPIC – Explicitly Parallel Instruction Computing
Examples:
Architecture:
Instruction set definition EPIC* (Itanium®) IA-32 IXA* (XScale)
and compatibility

Microarchitecture:
Hardware implementation Examples:
maintaining instruction set
compatibility with high-level P5 P6 Intel NetBurst® Banias
architecture

Processors:
Productized
implementation of
Microarchitecture Examples:
Pentium® 4
Pentium® Pro
Pentium® Pentium® D Pentium® M
Pentium® II/III
Xeon®

Intel® Processor Micro-architecture - Core® microarchitecture

10
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Intel® Core™ Microarchitecture Processors

Intel® NetBurst®

+ New Innovations

Mobile
Microarchitecture

Intel® Core™ 2 Duo/Quad/Extreme processors


Intel® Processor Micro-architecture - Core® microarchitecture

11
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

RISC Approach to CPU design


(RISC = Reduced Instruction Set Computers)
Optimize H/W for common basic operations
• Fixed instruction length
• Shorter Execution Pipeline
• Ease of Instruction Level Parallelism
• Large number of registers
• Less memory accesses
• ‘Load/Store’ architecture
• Shorter Execution Pipeline
• Ease of advancing Loads
• Branch Hints
• Reduce pipeline flush events
• ‘Exotic’ stuff to be implemented in S/W with minimal H/W support
• No ‘complex’ H/W instructions
• Handle exceptional conditions in S/W
Examples: MIPS, IBM Power and PowerPC, Sun Sparc

Achieve Maximum performance by


right partitioning between H/W and S/W Intel® Processor Micro-architecture - Core® microarchitecture

12
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

CISC Approach to CPU design

(CISC = Complex Instruction Set Computers)


Rich architecture
• Variable length instructions.
• Complex addressing modes.
On-chip HW / SW partitioning required
• H/W keeps executing ‘simple’ stuff
• Complex instructions are ‘emulated’ using u-code routines
from ROM
• More instructions treated as ‘simple’ as more H/W is available
COMPATIBILITY has some major advantages:
• Large (and forever increasing) software base
• Code development tools
• Expertise
• H/W - S/W spiral
Example: Intel IA32, Motorola 680X0

Maximize information passed to the HW


Intel® Processor Micro-architecture - Core® microarchitecture

13
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Performance Measurement
Performance is the reciprocal of the “Time of execution”:

1 1
Performance ≈ =
Were: Time _ of _ Execution L * CPI * TC
L = Code Length (# of machine instructions)
CPI = Clock cycles Per Instruction
Tc = Clock period (nSecs)

Substitute:
IPC = Instructions Per Cycle = 1/CPI
F = Frequency = 1/Tc

Improve ILP Improve Timing

IPC * F
Performance ≈
L
Arch Enhancements

Intel® Processor Micro-architecture - Core® microarchitecture

14
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Performance Measurement (cont.)

Benchmarks examples
Performance considerations: • Industry Standard
• Which Code/Application to run? • Spec (ISPEC, FSPEC)
• Which OS? • TPC
• Commercial
• Which other components in the • SysMark
platform? • MobileMark
• Under which thermal conditions? • PCMark
• Multithreading? Multiprocessing? • Sandra
• ScienceMark
• Applications
• Video (Windows Media encoder, DivX)
• Audio (Lame MP3)
• Compression (RAR)
• Content creation (3DSM, Photoshop, Premiere)
• Latest Games (Doom III, FarCry, but changes
fast)
• Specific industries use specific benchmarks
• Linux compilation, POVRay, LinPack, lmbench

Intel® Processor Micro-architecture - Core® microarchitecture

15
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Design Considerations for Different


Market Segments
Constrains:
• Thermally, area constrained  Desktop
• Unconstrained  Extreme
• Very area constrained  Value
• Thermally, Energy and Area constrained  Mobile
• Thermally, Energy  Servers
Micro-architecture is the Art of Tradeoffs between:
• Schedule
• Requirements / Standards
• Performance
• Features
• Power / Energy
• Area / Cost

Intel® Processor Micro-architecture - Core® microarchitecture

16
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Design Metrics

IPC = Instructions per Cycle


• The more the better
Latency – same as Response Time
• The time interval between
• when any request for data is made and
• when the data transfer completes
• The less the better
Throughput
• The amount of work completed by the system per unit of time.
• The more the better
• ops/sec

Intel® Processor Micro-architecture - Core® microarchitecture

17
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

CPU Pipeline

Break the work to smaller pieces


• Four basic stages of instruction life
• Fetch - bring instruction to core
• Decode - read operands from register
• Execute - perform the operation
• Writeback - save result to register
• Execution timing of simple instructions
(legend: “op src1,src2  dst”)
add eax, ebx  eax F D E W
sub ecx, edx  ecx F D E W
Increased throughput
• increased number of completed instructions per cycle

Intel® Processor Micro-architecture - Core® microarchitecture

18
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Pipeline Design - Explore Parallelism


New instruction not always depends on previous one
• Can start new instruction before previous one is finished
• ...if different stages use different H/W resources
Run instructions in parallel (pipeline)
Add eax, ebx  eax F D E W
Sub ecx, edx  ecx F D E W
Or edi, esi  edi F D E W
Need to balance pipe stages
• Each stage should take same time for best throughput and utilization

Clock cycle is determined


by the longest path!

Fetch Decode Exec WB


Fetch Decode Exec WB
Fetch Decode Exec WB
Fetch Decode Exec WB
Intel® Processor Micro-architecture - Core® microarchitecture

19
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Pipeline Design – Fighting Stalls

Data flow dependency (instructions output/input)


• Solved by bypasses, renaming etc
Control flow dependencies
• Solved by branch prediction
Others (Cache misses, long latency instructions)
• Solved by other dynamic scheduling techniques

? Go to detail!

Intel® Processor Micro-architecture - Core® microarchitecture

20
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Race of CISC vs. RISC

In modern CPUs Advanced µ-Architecture Techniques minimize the


advantages of RISC over CISC
• Branch Prediction
• Reduces the effect of extra pipeline stages
• Register Renaming
• Effectively Increase the Number of Registers
• Out Of Order
• Reduce Number of stalls caused by shortage of registers
• Speculative Execution
• Further Reduce Number of stalls
• Power saving features
• Reduce the overhead when not needed.

Intel® Processor Micro-architecture - Core® microarchitecture

21
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

µop – Intel’s Take of the CICS/RISC Race

(CISC) Instructions are translated into one or more (RISC)


uop(micro-operation)s
• Fixed format
• Wide and simple
• Temp registers
Usually one uop per instruction
Complex instruction can be thousands of uops
Stores divided into two uops (STA and STD)
Fusion play games here

Intel® Processor Micro-architecture - Core® microarchitecture

22
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Power and Energy

Maximum power (TDP):


•  Cooling requirements
•  Cooling solution
•  Computer form factor and acoustic noise
Average power
•  Battery life
•  Electricity bill
General calculation:
• P = frequency * voltage^2 * activity factor * capacitance + leakage
Reducing TDP
• Less transistors and wires
• Smaller transistors and wires
• Power features  less activity
• Low leakage transistors
Reducing average power
• Energy efficiency
• Power states
• Lower leakage

Intel® Processor Micro-architecture - Core® microarchitecture

23
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Dual/Multi Core and SMT


Put more than one core per package
Architectural change:
• Software must be multi-threaded or multi-process
• …but backward compatible with multiprocessor systems (MP)
Several ways of implementing it
• All of them being used

I/O I/O
I/O I/O
LLC
LLC LLC LLC LLC

Core Core Core Core Core Core

SMT: Run two (or more) threads on the same core, simultaneously
Intel® Processor Micro-architecture - Core® microarchitecture

24
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Intel Approach

?
Intel®
Intel®
XQ6700*
Intel®
Intel®
Core 2 Duo®
Duo®
Intel®
Intel®
Pentium®
Pentium® D
Processor 80 Threads
Intel®
Intel®
Pentium®
Pentium®
With HT
Intel®
Intel® 4 Threads
Pentium®
Pentium®
2 Threads
State
2 Threads Execution Units
Cache
Bus
2 Threads
1 Threads
Q4 2000 Q2 2003 Q2 2005 Q3 2006 Q4 2006

While
While single
single core
core performance
performance has has increased
increased due due to to clock
clock speed,
speed,
increased
increased cache
cache and
and improved
improved ILP ILP the
the biggest
biggest performance
performance increases
increases
have
have come
come from
from
Intel® the
the
Processor thread
thread level level
Micro-architecture parallelism.
parallelism.
- Core® microarchitecture

25
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

A “Acronym Cheat Sheet” of Parallel


Computing
CMP: Chip Multi Processor (two or more cores per package)
• Dual Core: two cores in same package
• Quad Core: four cores in same package
DP: Dual Processor (two packages)
MP: Multi Processor (four or more packages)
SMT: Symmetric Multi Threading (virtual multi core: HyperThreading)

Intel® Processor Micro-architecture - Core® microarchitecture

26
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Agenda

Introduction
Knowledge preparation
Notable features
• Wide Dynamic Execution
• Smart Memory Access
• Advanced Smart Cache
• Advanced Digital Media Boost
• Intelligent Power Capability
Micro-architecture tour
Coding considerations

Intel® Processor Micro-architecture - Core® microarchitecture

27
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Intel® Core® Micro-architecture Notable


Features Instruction Fetch
Intel® Wide Dynamic Execution and PreDecode
• 14-stage efficient pipeline
Instruction Queue 2M/4M
• Wider execution path 5 shared L2
• Advanced branch prediction uCode
ROM
Decode Cache
• Macro-fusion 4
• Roughly ~15% of all instructions are
conditional branches up to
• Macro-fusion fuses a comparison Rename/Alloc
and jump to reduce micro-ops
10.4 Gb/s
running down the pipeline FSB
• Micro-fusion Retirement Unit
4
• Merges the load and operation (ReOrder Buffer)
micro-ops into one macro-op
• 64-Bit Support Schedulers
ALU ALU ALU
• Merom, Conroe, and Woodcrest Branch FAdd FMul
support EM64T MMX/SSE MMX/SSE MMX/SSE Load Store
FPmove FPmove FPmove

L1 D-Cache and D-TLB


Intel® Processor Micro-architecture - Core® microarchitecture

28
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Intel® Core® Micro-architecture Notable


Features (cont.)
Intel® Advanced Memory Access
• Improved prefetching
• Memory disambiguation
• Advance load before a possible data dependency (pointer conflict)
• Earlier loads hide memory latencies

Intel® Processor Micro-architecture - Core® microarchitecture

29
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Intel® Core® Micro-architecture Notable


Features (cont.)
Intel® Advanced Smart Cache
• Multi-core optimization
• Shared between the two cores
• Advanced Transfer Cache architecture
• Reduced bus traffic
• Both cores have full access to the entire cache
• Dynamic Cache sizing

Intel® Processor Micro-architecture - Core® microarchitecture

30
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Intel® Core® Micro-architecture Notable


Features (cont.)
Advantages of Shared Cache
Memory

Front Side Bus (FSB)


Shipping L2 Cache Line
~Half access to memory

Cache Line
CPU1 CPU2

Intel® Processor Micro-architecture - Core® microarchitecture

31
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Intel® Core® Micro-architecture Notable


Features (cont.)
Advantages of Shared Cache (cont.)
Memory

Front Side Bus (FSB)


L2 is shared:
No need to ship cache
line
Cache Line
CPU1 CPU2

Intel® Processor Micro-architecture - Core® microarchitecture

32
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Intel® Core® Micro-architecture Notable


Features (cont.)
Intel® Advanced Digital Media Boost SIMD Operation
(SSE/SSE2/SSE3/SSSE)
• Single Cycle SIMD Operation
SOURCE 127 0
• 8 Single Precision Flops/cycle X4 X3 X2 X1
• 4 Double Precision Flops/cycle SSE/2/3 OP

• Wide Operations Y4 Y3 Y2 Y1

• 128-bit packed Add DEST

• 128-bit packed Multiply


Core™ µarch
• 128-bit packed Load
CLOCK
X4opY4 X3opY3 X2opY2 X1opY1
• 128-bit packed Store CYCLE 1

• Support for Intel® EM64T Previous CLOCK


X2opY2 X1opY1
CYCLE 1
instructions
CLOCK X4opY4 X3opY3
CYCLE 2

Intel® Processor Micro-architecture - Core® microarchitecture

33
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Intel® Core® Micro-architecture Notable


Features
Intel® Advanced Digital Media Boost
• Additional Media Instructions - Supplemental Streaming SIMD
Extensions 3 (SSSE3)
• 16 new packed integer instructions
• Targeting video encode/decode
• Significantly improved strings
• REP MOVS and REP STOS
• ~8 bytes / cycle throughput
• mileage may vary

Intel® Processor Micro-architecture - Core® microarchitecture

34
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Intel® Core® Micro-architecture Notable


Features
Intel® Advanced Digital Media Boost
• Supplemental SSE-3 (SSSE-3)

Horizontal Addition/Subtraction
PHADDW, PHADDSW, PHADDD,
PHSUBW, PHSUBSW, PHSUBD

Packed Absolute Values

PABSB, PABSW, PABSD


Multiply and Add Packed
Signed/Unsigned bytes
PMADDUBSW

Packed multiply High with


Round and Scale PMULHRSW

Packed Shuffle Bytes


PSHUFB

Packed SIGN PSIGNB/W/D

Packed Align Right


PALIGNR
Intel® Processor Micro-architecture - Core® microarchitecture

35
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Intel® Core® Micro-architecture Notable


Features (cont.)
Intelligent Power Capability
• Advanced power gating & Dynamic power coordination
• Multi-point demand-based switching
• Voltage-Frequency switching separation
• Supports transitions to deeper sleep modes
• Event blocking
• Clock partitioning and recovery
• Dynamic Bus Parking
• During periods of high performance execution, many parts of the
chip core can be shut off

Intel® Processor Micro-architecture - Core® microarchitecture

36
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Agenda

Introduction
Knowledge preparation
Notable features
Micro-architecture tour
• Front End
• Out-Of-Order Execution Core
• Memory Sub-system
Coding considerations

Intel® Processor Micro-architecture - Core® microarchitecture

37
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Intel® Core® Micro-architecture Drill-down

page miss handler store


icache
branch address integer
prediction
predecode unit
data memory FP
load SIMD
cache order
instruction unit buffer store
(3x)
queue data

instruction register Reservation


decode alias table Station

MS ALLOC Re-Order Buffer


Intel® Processor Micro-architecture - Core® microarchitecture

38
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Agenda

Introduction
Knowledge refreshment
Notable features
Micro-architecture tour
• Front End
• Out-Of-Order Execution Core
• Memory Sub-system
Coding considerations

Intel® Processor Micro-architecture - Core® microarchitecture

39
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Core® Micro-architecture Front End

Instruction preparation before executed icache


branch
• Instruction Fetch Unit prediction
predecode unit
• Instruction Queue
• Instruction Decode Unit
• Branch Prediction Unit instruction
queue

instruction
decode

MS
Intel® Processor Micro-architecture - Core® microarchitecture

40
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End

Instruction Queue

Buffer between instruction pre-decode unit and decoder


• up to six predecoded instructions written per cycle
• 18 Instructions contained in IQ
• up to 5 Instructions read from IQ
Potential Loop cache
Loop Stream Detector (LSD) support
• Re-use of decoded instruction
• Potential power saving

Intel® Processor Micro-architecture - Core® microarchitecture

41
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End

Macro - Fusion

Scheduler
Roughly ~15% of all instructions are
cmpjae eax, [mem], label
conditional branches.
Macro-fusion merges two instructions
into a single micro-op, as if the two
instructions were a single long
instruction. Execution

Enhanced Arithmetic Logic Unit (ALU)


for macro-fusion. Each macro-fused
instruction executes with a single
dispatch. Branch
Eval
Not supported in EM64T long mode
flags and target to Write back

Intel® Processor Micro-architecture - Core® microarchitecture

42
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End

Macro-Fusion Absent Instruction Queue


addps xmm0, [EAX+16]
Read four instructions from
mulps xmm0, xmm0
Instruction Queue
Each instruction gets decoded movps [EAX+240], xmm0
into separate uops
cmp eax, 100000
Enabling Example
jge label
for (int i=0; i<100000; i++) {
… addps xmm0, [EAX+16] dec0
Cycle 1
} mulps xmm0, xmm0 dec1
movps [EAX+240], xmm0 dec2
cmp eax, 100000 dec3
Cycle 2 jge label dec0
Intel® Processor Micro-architecture - Core® microarchitecture

43
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End

Macro-Fusion Presented Instruction Queue


addps xmm0, [EAX+16]
Read five Instructions from
Instruction Queue mulps xmm0, xmm0

Send fusable pair to single movps [EAX+240], xmm0


decoder
cmp eax, 100000
Single uop represents two
instructions jae label
Enabling Example
for (unsigned int i=0; Cycle 1 addps xmm0, [EAX+16] dec0
i<100000; i++) {
mulps xmm0, xmm0 dec1
… movps [EAX+240], xmm0 dec2
} cmpjae eax, 100000, label dec3

Intel® Processor Micro-architecture - Core® microarchitecture

44
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End

Instruction Decode / Micro-Op Fusion

Frequent pairs of micro-operations derived from the same


Macro Instruction can be fused into a single micro-operation

Micro-op fusion effectively widens the pipeline

Intel® Processor Micro-architecture - Core® microarchitecture

45
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End

Instruction Decode / Micro-Fusion (cont.)

u-ops of a Store “movps [EAX+240], xmm0”

sta eax+240
st xmm0, [eax+240]
std xmm0, [eax+240]

Intel® Processor Micro-architecture - Core® microarchitecture

46
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End

Branch Prediction Improvements

Intel® Pentium® 4 Processor branch prediction


PLUS the following two improvements:

Indirect Branch Predictor Loop Detector

Branch miss-predictions reduced by >20%

Intel® Processor Micro-architecture - Core® microarchitecture

47
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Agenda

Introduction
Knowledge preparation
Notable features
Micro-architecture tour
• Front End
• Out-Of-Order Execution Core
• Memory Sub-system
Coding considerations

Intel® Processor Micro-architecture - Core® microarchitecture

48
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Core® Micro-architecture Execution Core

store
Accepted decoded u-ops, assign resources, address integer
execute and retire u-ops FP
load
• Renamer SIMD
store
data
(3x)
• Reservation station (RS)
register Reservation
• Issue ports
alias table Station
• Execution Unit ALLOC Re-Order Buffer

Intel® Processor Micro-architecture - Core® microarchitecture

49
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Execution Core

Execution Core Building Blocks

Renamer Ports (number)

RS
0,1,5 0,1,5
SIMD/Integer 0,1,5
SIMD Floating
MUL Integer
ROB Integer Point
Execution Unit

2 Load
3,4 Store

Memory Sub-system
Intel® Processor Micro-architecture - Core® microarchitecture

50
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Execution Core

Issue Ports and Execution Units


6 dispatch ports from RS
• 3 execution ports
• (shared for integer / fp / simd)
• load
• store (address)
• store (data)
128-bit SSE implementation
• Port 0 has packed multiply (4 cycles SP 5 DP pipelined)
• Port 1 has packed add (3 cycles all precisions)

Intel® Processor Micro-architecture - Core® microarchitecture

51
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Execution Core

Retirement Unit

ReOrder Buffer (ROB)


• Holds micro-ops in various stages of completion
• Buffers completed micro-ops
• updates the architectural state in order
• manages ordering of exceptions

register Reservation
alias table Station
ALLOC Re-Order Buffer

Intel® Processor Micro-architecture - Core® microarchitecture

52
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Agenda

Introduction
Knowledge preparation
Notable features
Micro-architecture tour
• Front End
• Out-Of-Order Execution Core
• Memory Sub-system
Coding considerations

Intel® Processor Micro-architecture - Core® microarchitecture

53
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Core® Micro-architecture Memory Sub-


System
Memory Ordering Buffer
• Store Address Buffer
• Stores the address of each store not actually performed
• Loads compare address to any store older than itself
• If it find a hole…
• Store Data Buffer
• Stores data of each store not actually performed
• If load hit on the SAB, it forward the data from here
• Load Buffer
• Stores address of non-retired loads
• For snoops and re-dispatch
• One 128-bit load and one 128-bit store per cycle to different
memory locations
• Out of order Memory operations

Intel® Processor Micro-architecture - Core® microarchitecture

54
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Memory Sub-system

Core® Micro-architecture Memory Sub-


System (cont.)
32k D-Cache (8-way, 64 byte line size)
Shared second level (L2) 2MB 8-way or 4MB 16-way instruction and data cache
Cache to cache transfer
• improves producer / consumer style MP
Wider interface to L2
• reduced interference
• processor line fill is 2 cycles
Core1 Core2
Higher bandwidth from the L2 cache to the core
• ~14 clock latency and 2 clock throughput
Load & Store Access order
Bus
1. L1 cache of immediate core
2. L1 cache of the other core 2 MB L2 Cache
3. L2 cache
4. Memory

Intel® Processor Micro-architecture - Core® microarchitecture

55
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Memory Sub-system

Advanced Memory Access / Enhanced Data


Pre-fetch Logic
Speculates the next needed data and loads it into cache by HW
and/or SW

Door Valet Parking Area Main Parking Lot


(L1 Cache) (L2 Cache) (External Memory)

Intel® Processor Micro-architecture - Core® microarchitecture

56
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Memory Sub-system

Advanced Memory Access / Enhanced Data


Pre-fetch Logic (cont.)
• L1D cache prefetching
• Data Cache Unit Prefetcher
• Known as the streaming prefetcher
• Recognizes ascending access patterns in recently loaded data
• Prefetches the next line into the processors cache
• Instruction Based Stride Prefetcher
• Prefetches based upon a load having a regular stride
• Can prefetch forward or backward 2 Kbytes
• 1/2 default page size
• L2 cache prefetching: Data Prefetch Logic (DPL)
• Prefetches data to the 2nd level cache before the DCU requests
the data
• Maintains 2 tables for tracking loads
• Upstream – 16 entries
• Downstream – 4 entries
• Every load is either found in the DPL or generates a new entry
• Upon recognition of the 2nd load of a “stream” the DPL will
prefetch the next load
Intel® Processor Micro-architecture - Core® microarchitecture

57
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Memory Sub-system

Advanced Memory Access / Memory


Disambiguation
Memory Disambiguation predictor
• Loads that are predicted NOT to forward from preceding store
are allowed to schedule as early as possible
• increasing the performance of OOO memory pipelines

Disambiguated loads checked at retirement


• Extension to existing coherency mechanism
• Invisible to software and system

Intel® Processor Micro-architecture - Core® microarchitecture

58
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Memory Sub-system

Advanced Memory Access / Memory


Disambiguation Absent
Load4 must WAIT until previous stores complete

Memory
Data W
Store1 Y
Load2 Y
Data Z
Store3 W

Load4 X
Data Y

Data X
Intel® Processor Micro-architecture - Core® microarchitecture

59
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Memory Sub-system

Advanced Memory Access / Memory


Disambiguation Presented
Loads can decouple from stores
Load4 can get its data WITHOUT waiting for stores
Memory
Data W
Load4 X
Store1 Y
Load2 Y Data Z
Store3 W

Data Y

Data X
Intel® Processor Micro-architecture - Core® microarchitecture

60
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Memory Sub-system

Advanced Memory Access / Stores


Forwarding
If a load follows a store and reloads the data that the store
writes to memory, the micro-architecture can forward the data
directly from the store to the load

Memory

Store1 Y
Internal
Load2 Y Buffers
Data Y
Intel® Processor Micro-architecture - Core® microarchitecture

61
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Advanced Memory Access / Stores


Forwarding: Aligned Store Cases
store 16 store 32 bit store 64 bit

load 16 load 32 bit load 64 bit

ld 8 ld 8 load 16 load 16 load 32 bit load 32 bit

ld 8 ld 8 ld 8 ld 8 load 16 load 16 load 16 load 16

ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8

store 128 bit

load 128 bit

load 64 bit load 64 bit

load 32 bit load 32 bit load 32 bit load 32 bit

load 16 load 16 load 16 load 16 load 16 load 16 load 16 load 16

ld 8 ld 8 ld 8 ld 8 ld 8 Intel®
ld 8Processor
ld 8 ld Micro-architecture - Core® microarchitecture
8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8
62
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Advanced Memory Access / Stores


Forwarding: Unaligned Cases
Note that unaligned store forward does not occur when the load
crosses a cache line boundary
store 16 store 32 bit store 64 bit

load 16‡ load 32 bit‡ load 64 bit

ld 8 ld 8 load 16‡ load 16 load 32 bit‡ load 32 bit

ld 8 ld 8 ld 8 ld 8 load 16‡ load 16 load 16 load 16

ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8 ld 8
ld 8 Store forwarded to load
Note: Unaligned 128-bit stores
ld 8 No forwarding are issued as two 64-bit stores.
‡:
This provides two alignments for
No forwarding if the load store forwarding
crosses a cache line boundary
Intel® Processor Micro-architecture - Core® microarchitecture

63
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Agenda

Introduction
Knowledge preparation
Notable features
Micro-architecture tour
Coding considerations

Intel® Processor Micro-architecture - Core® microarchitecture

64
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Optimizing for
Instruction Fetch and PreDecode
Avoid “Length Changing Prefixes” (LCPs)
• Affects instructions with immediate data or offset
• Operand Size Override (66H)
• Address Size Override (67H) [obsolete]
• LCPs change the length decoding algorithm – increasing the
processing time from one cycle to six cycles (or eleven cycles
when the instruction spans a 16-byte boundary)
• The REX (EM64T) prefix (4xH) is not an LCP
• The REX prefix does lengthen the instruction by one byte, so use
of the first eight general registers in EM64T is preferred

Intel® Processor Micro-architecture - Core® microarchitecture

65
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Optimizing for
Instruction Queue
Includes a “Loop Stream Detector” (LSD)
• Potentially very high bandwidth instruction streaming
• A number of requirements to make use of the LSD
• Maximum of 18 instructions in up to four 16-byte packets
• No RET instructions (hence, little practical use for CALLs)
• Up to four taken branches allowed
• Most effective at 70+ iterations
• LSD is after PreDecode so there is no added cost for LCPs
• Trade-off LSD with conventional loop unrolling

Intel® Processor Micro-architecture - Core® microarchitecture

66
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Optimizing for
Decode
Decoder issues up to 4 uOps for renaming/ allocation per clock
• This creates a trade off between more complex instruction
uOps versus multiple simple instruction uOps
• For example, a single four uOp instruction is all that can be
renamed/allocated in a single clock
• In some cases, multiple simple instructions may be a better
choice than a single complex instruction
• Single uOp instructions allow more decoder flexibility
• For example, 4-1-1-1 can be decoded in one clock
• However, 2-2-2-1 takes three clocks to decode

Intel® Processor Micro-architecture - Core® microarchitecture

67
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Optimizing for
Execution
Up to six uOps can be dispatched per clock
• “Store Data” and “Store Address” dispatch ports are combined on
the block diagram
Up to four results can be written back per clock
Single clock latency operations are best
• Differing latency operations can create writeback conflicts
• Separate multiple-clock uOps with several single uOp instructions
• Typical instructions here: ADC/SBB, RWM, CMOVcc
• In some cases, separating a RMW instruction into its piece might be
faster (decode and scheduling flexibility)
When equivalent, PS preferred to PD (LCP)

• For example, MOVAPS over MOVAPD, XORPS over XORPD

Intel® Processor Micro-architecture - Core® microarchitecture

68
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Optimizing for
Execution (cont.)
Bypass register “access” preferred to register reads
Partial register accesses often lead to stalls
• Register size access that ‘conflicts’ with recent previous register
write
• Partial XMM updates subject to dependency delays
• Partial flag stall can occur, too  much higher cost
• Use TEST instruction between shift and conditional to prevent
• Common zeroing instructions (e.g., XOR reg,reg) don’t stall
Avoid bypass between execution domains
• For example: FP (ADDPS) and logical ops (PAND) on XMMn
Vectorization: careful packing/unpacking sequence
• Use MXCSR’s FZ and DAZ controls as appropriate

Intel® Processor Micro-architecture - Core® microarchitecture

69
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Optimizing for
Memory
Software prefetch instructions
• Can reach beyond a page boundary (including page walk)
• Prefetches only when it completes without an exception
General techniques to help these prefetchers
• Organize data in consecutive lines
• In general, increasing addresses are more easily prefetched

Intel® Processor Micro-architecture - Core® microarchitecture

70
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Summary

What has been covered


• Notable features of Core® Micro-architecture
• Wide Dynamic Execution
• Advanced Memory Access
• Advanced Smart Cache
• Advanced Digital Media Boost
• Power Efficient Support
• Core® Micro-architecture components
• Front End
• OOO execution core
• Memory sub-system

Intel® Processor Micro-architecture - Core® microarchitecture

71
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Intel® Processor Micro-architecture - Core® microarchitecture

72
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Platform
Legacy & Debug I/O

Intel provides most of the silicon Core


LLC
on any computer Core

FSB
CPU
Classical platform partition
• CPU – Computation FSB
HD video
• MCH – high speed IO ME MEM

DDR
Graphics
• ICH – low speed IO PCIe PEG
Display
TVout Analog
Graphics speed and memory DMI
MCH
latencies will require different
partition
Wireless DMI
This presentation focuses on the PCI (IO)
SATA
core microarchitecture USB
KBRD ICH
others

Intel® Processor Micro-architecture - Core® microarchitecture

73
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Intel® 64 = Extending IA-32 to 64 Bit

Extended
ExtendedMemory
Memory
Addressability
Addressability
64 -Bit Pointers,
64-Bit Pointers,Registers
Registers

+ Additional
AdditionalRegisters
88-SSE
Registers
-SSE &&88-Gen
-Gen Purpose
Purpose
=
With 64-Bit
Double
DoublePrecision
Precision(64-bit)
(64-bit) Extension
Integer
IntegerSupport
Support Technology

Added to Intel XEON™ and Pentium® 4 Processor in 2004; today


available in all main stream Intel IA-32 processors – in particular in
all processors based on Intel® Core™ Architecture

Intel® Processor Micro-architecture - Core® microarchitecture

74
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Intel® 64 - New Modes of Operation

Compile New Features Defaults


required
Mode OS
Req’d
64- RIP New GPR Addr Operand
bit Rel. Regs Widt Size Size
IP h
64-bit Yes Yes Yes Yes 64 64 32
Mode

Long New 32 32
Mode 64-bit
Compa OS No Yes No No 32
tibility 16 16
Mode

Legacy Mode Legac 32 32


y 32-
(IA32 Mode) bit or No No No No 32
16-bit 16 16
OS

Intel® Processor Micro-architecture - Core® microarchitecture

75
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Registers : Extensions and Additions


RIP EIP

63 32 31 0 127 64 63 0
RAX EAX XMM0
RBX EBX XMM1

RCX ECX XMM2

RDX EDX XMM3

79 0 RBP EBP XMM4

RSI ESI XMM5

RDI EDI XMM6

RSP ESP XMM7

R8 XMM8

R9 XMM9

R10 XMM10

R11 XMM11

R12 XMM12
XMM13
R13
XMM14
X87/ R14
R15
XMM15

MMX

Intel® Processor Micro-architecture - Core® microarchitecture

76
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Registers : Availability in different


modes

Intel® Processor Micro-architecture - Core® microarchitecture

77
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

64-bit Mode of Operation

Default data size is 32-bits


• Override to 64-bits using new REX prefix
All registers are 64-bit, 32-bit, 16-bit and 8-bit addressable
REX prefixes
• A family of 16 prefixed, encoded 0x40-0x4F
• Allows the use of general purpose registers as 64-bits
• Allows the use of new registers (like r8-r15)
Instructions that set a 32 bit register automatically zero extend
the upper 32-bits

Intel® Processor Micro-architecture - Core® microarchitecture

78
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

REX Prefix

A new instruction-prefix byte used in 64-bit mode


• Specify the new GPRs and SSE registers
• Specify a 64-bit operand size.
• Specify extended control registers (used by system software)
An instruction can only have one REX prefix and if used, must immediately
precede the opcode or the two-byte opcode escape prefix .
The legacy instruction-size limit of 15 bytes still applies to instructions that
contains a REX prefix.

Intel® Processor Micro-architecture - Core® microarchitecture

79
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Physical and Linear Addressing

Linear Addressing
• Initial Intel® 64 implementation support 48
bits of Virtual addressing.
• Addresses are required to be in canonical form
– bits 47 thru 63 must all be 1 or all be 0.
Physical Addressing
• Initial Netburst™ Intel® 64 implementation
support 36 bit, today all current processors
support 40bit at least
• Entries in page tables expanded for up to 52
bits of physical address.
Intel® Processor Micro-architecture - Core® microarchitecture

80
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Intel®64 - Large Memory Considerations

Canonical addressing for 64 bit addresses


• Although the architecture now allows calculating flat
addresses to 64 bits, today’s processors limit virtual
addressing to 48 bits
• Canonical address definition: An address that has address
bit 63 through 47 set to either all ones or all zeros

• Canonical addresses are a requirement

• Values for addresses that are not canonical will cause faults
when put into locations expecting a valid address, such as
segment registers

Return
Intel® Processor Micro-architecture - Core® microarchitecture

81
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Introducing SIMD: Single Instruction


Multiple Data
Scalar processing SIMD processing
• traditional mode • with SSE / SSE2
• one operation produces • one operation produces
one result multiple results

X X x3 x2 x1 x0

+ +
Y Y y3 y2 y1 y0

X+Y X+Y x3+y3 x2+y2 x1+y1 x0+y0


Intel® Processor Micro-architecture - Core® microarchitecture

82
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

X86 Register Sets


SSE-Registers introduced first in Pentium® 3

IA-INT MMX™ Technology / SSE Registers


Registers IA-FP Registers
80 128
32 64
xmm0
eax st0 mm0

xmm7
edi st7 mm7  Eight 128-bit registers
 Eight 80/64-bit registers  Hold data only:
 Fourteen 32-bit registers  4 x single FP numbers
 Hold data only
 Scalar data & addresses
 Stack access to FP0..FP7  2 x double FP numbers
 Direct access to regs
 Direct access to MM0..MM7  128-bit packed integers

 No MMX™ Technology / FP  Direct access to the registers


interoperability  Use simultaneously with FP /
MMX Technology
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Instruction Set Extensions

New Instructions Added to Intel® Processors


160 144
140
120
100
80 70
56 ~ 50
60
32
40 32
20 13

0
Jan-97 Feb-99 Dec-00 Feb-04 Jul-06 2008+
Future
MMX™ Streaming SIMD Streaming SIMD Streaming SIMD Supplemental SSE3 FutureSSE-4
Intel instruction
Extensions (SSE) Extensions 2 (SSE2) Extensions 3 (SSE3) (SSSE3) set extensions
Process (nm) 350 250 180 90 65 45 45
nm

Beginning in 2008: ~50 new instructions in 13 groups


All function in 32-bit and 64-bit modes
Improvements in Commercial Data Integrity i-SCSI, Video Processing, String and Text Processing, 2D &
3D Imaging, Vectorizing Compiler Performance

Intel® Processor Micro-architecture - Core® microarchitecture

84
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

SSE and SSE-2 Data Types

SSE 4x floats

2x doubles

16x bytes

8x 16-bit shorts
SSE-2
4x 32-bit integers

2x 64-bit integers

1x 128-bit(!) integer

Intel® Processor Micro-architecture - Core® microarchitecture

85
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

SSE-Instructions Set Extensions

Introduced by Pentium® 3 in 1999; now frequently called


SSE-1
Only new data type supported: 4x32Bit (Single Precision)
floating point data
Some 70 instructions
• Arithmetic, compare, convert operations on SSE SP FP data
• PACKED, UNPACKED
• Data load/store
• Prefetch
• Extension of MMX
• Streaming Store (store without using cache in between)
• …

2001 PTE Engineering Enabling Conference

Copyright © 2006, Intel Corporation. All rights reserved.


Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

SSE Sample: Branch Removal

R = (A < B)? C : D //remember: everything packed

A 0.0 0.0 -3.0 3.0


cmplt
B 0.0 1.0 -5.0 5.0

00000 11111 00000 11111


and nand
c3 c2 c1 c0 d3 d2 d1 d0

00000 c2 00000 c0 d3 00000 d1 00000


or
Intel® Processor Micro-architecture - Core® microarchitecture

87
d3
Copyright © 2006, Intel Corporation. All rights reserved.
c2 d1 c0
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

SSE-2 Instructions Set Extensions

Introduced by Intel® Pentium®4 processor in


2000
Some 140 new instructions
Added double precision floating point data
(2x64Bit) and all related instructions including
conversion
Again some extensions to MMX
Added all possible combinations of integer data to
SSE ( 1x128, 2x64, 4x32, 8x16, 16x8) and related
operations

2001 PTE Engineering Enabling Conference

Copyright © 2006, Intel Corporation. All rights reserved.


Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

SIMD Single vs. SIMD Double

SIMD SP FP Operand = 4 Elements 4 x Single Precision:


Element = SP FP Number SSE-1
127 0

X3 X2 X1 X0
31 30 23 22 0

S Exponent Significand

SIMD DP FP Operand = 2 Elements


2 x Double Precision:
Element = DP FP Number SSE-2
127 0

X1 X0
63 62 52 51 0

S Exponent Significand
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Sample for SSE-2:


SIMD Double ↔ SIMD Int Conversion

SIMD Double  SIMD Int: conversion to two lower ints, two


higher ints cleared

x1 x0 __m128d x;
__m128i ix;
ix = _mm_cvtpd_epi32(x);
00000 00000 (int)x1 (int)x0

 SIMD Int  SIMD Double: conversion from


two lower ints
???? ???? ix1 ix0 x = _mm_cvtepi32_pd(ix);
Intel® Processor Micro-architecture - Core® microarchitecture

90
(double)x1 (double)x0
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

SSE3: No new Data Types but new Instructions

FISTTP
FP to integer
conversions
ADDSUBPD, ADDSUBPS,
Complex arithmetic
MOVDDUP, MOVSHDUP,
MOVSLDUP
Video encoding

SIMD FP using AOS LDDQU


format*
HADDPD, HSUBPD
Thread
Synchronization HADDPS, HSUBPS

MONITOR, MWAIT

* Also benefits Complex and Vectorization

Intel® Processor Micro-architecture - Core® microarchitecture

91
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Streaming SIMD Extensions 3


13 new instructions

Three have limited use for application performance


improvement
• FISTTP - X87 to integer conversion (requires –longdouble switch)
• MONITOR/MWAIT - thread synchronization
• Available today in Ring 0 only; being used by newer Windows* and Linux*
thread packages

The other ten have some potential for specifc


application domains

Intel® Processor Micro-architecture - Core® microarchitecture

92
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

SSE-3 Sample Complex Arithmetic: ADDSUBPS

ADDSUBPS OperandA OperandB


• OperandA (xmm register; 4 data elements)
• a3, a2, a1, a0
• OperandB (xmm reg. Or memory addr; 4 data elements)
• b3, b2, b1, b0
• Result (Stored in OperandA)
• a3+b3, a2-b2, a1+b1, a0-b0
__m128 _mm_addsub_ps(__m128 a, __m128 b)

a3 a2 a1 a0

b3 b2 b1 b0
Add Sub Add Sub
Intel® Processor Micro-architecture - Core® microarchitecture

93
a3+b3 a2-b2 a1+b1 a0-b0
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Sample SSSE-3 Inst.: Byte Permute

PSHUFB mm, mm/m64


PSHUFB xmm, xmm/m128
• A complete byte-granularity permutation
• The source operand is used as the control field (variable control)
• The destination operand gets permuted
• Each byte of the source field selects the origin of the corresponding
destination byte
• Also includes force-byte-to-zero flag (bit 7)

src 0x7 0x7 0xFF 0x80 0x01 0x00 0x00 0x00

dest 0x04 0x01 0x07 0x03 0x02 0x02 0xFF 0x01

dest 0x04 0x04 0x00 0x00 0xFF 0x01 0x01 0x01

Intel® Processor Micro-architecture - Core® microarchitecture

94
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Ways to SSE/SIMD programming

Coding using SSE/SSE2/3/4 assembler instructions


• Very tedious (manually schedule) – discouraged: Don’t do it !
• E.g.: How do you exploit the benefits of having now 16 instead of
8 SSE registers for Intel® 64 without maintaining two versions ?

Intel® compiler’s C/C++ SIMD intrinsics


• No need to take care of register allocation, scheduling etc

Intel® compiler’s C++ Vector Class Library


• Use this if you are heavy into C++ classes

Vectorizer of Intel® C++ and Fortran Compilers


• Recommended for most cases – easy and efficient

Use ready-to-go vectorized code from a library like


Intel® Math Kernel Library (MKL)
2001 PTE Engineering Enabling Conference

Copyright © 2006, Intel Corporation. All rights reserved.


Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Compiler Based Vectorization Intel® Software College

Processor Specific

Generate Code and Optimize for Linux*


Pentium® 3 compatible and Athlon XPprocessors including code generation for -axK
MMX and SSE -axK

Pentium® 4 compatible, Athlon 64, Opteron processors in 32 and 64 bit mode, -xW
including code generation for MMX, SSE and SSE2 -axW

Pentium® 4 processors in 32, including code generation for MMX, SSE and SSE2 -xN
- depreciated switch: use xW instead -axN

Pentium® M processors including code generation for MMX, SSE and SSE-2 -xB
-axB

Intel® processors with SSE3 capability including Pentium 4 (both 32 and 64bit -xP,
mode) – including code generation for MMX, SSE, SSE2 and SSE-3 -axP

Intel® processors with MNI capability – Intel® Core™2 Duo processors ( -xT,
Conroe, Merom, Woodcrest) including code generation for MMX, SSE, SSE2, SSE- -axT
3 and MNI
Intel® Processor Micro-architecture - Core® microarchitecture

96
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Intel® Core® Micro-architecture Notable


Features (cont.) New Instructions Return
Instruction name Description

psignb/w/d mm, mm/m64 Per element, if the source operand is


psignb/w/d xmm, xmm/m128 negative, multiply the destination operand
by -1.
pabsb/w/d mm, mm/m64 Per element, overwrite destination with
pabsb/w/d xmm, xmm/m128 absolute value of source.
phaddw/d/sw mm, mm/m64 Pairwise integer horizontal addition + pack.
phaddw/d/sw xmm, xmm/m128
phsubw/d/sw mm, mm/m64 Pairwise integer horizontal subtract + pack.
phsubw/d/sw xmm, xmm/m128
PMADDUBSW mm, mm/m64 Multiply signed & unsigned bytes.
PMADDUBSW xmm, xmm/m128 Accumulate result to signed-words.
(Multiply Accumulate)
PMULHRSW mm, mm/m64 Signed 16 bits multiply, return high bits.
PMULHRSW xmm, xmm/m128
PSHUFB mm, mm/m64 A complete byte-granularity permutation,
PSHUFB xmm, xmm/m128 including force-to-zero flag.
PALIGNR mm, mm/m64, imm8 Extract any continuous 16 (8 in the 64 bit
PALIGNR xmm, xmm/m128,Intel®
imm8 case) bytes from the pair [dst, src] and
Processor Micro-architecture - Core® microarchitecture
store them to the dst register.
97
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Dependencies and Bypasses

“Read-after-Write” Dependency - 1 clock stall assuming


register file can be written-through
add eax, ecx  eax F D E W
sub ebx, eax  ebx F D D E W
“E to D” Bypass - save clock penalty
add eax, ecx  eax F D E W
sub ebx, eax  ebx F D E W
Long Latency operations
Load [ecx+edi]  eax F D E E E W
add ebx, eax  ebx F D D D E W

Intel® Processor Micro-architecture - Core® microarchitecture

98
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Fighting Stalls: Branch Handling

Given the code:


for (i=100, a=0; i>0; i--) a+=B[i];
Compiler would generate
• // eax initiated with zero, edi initiated with 100
loop: load B[edi]  ebx // read B[i] from memory
add eax, ebx  eax // a+=B[i]
add edi,-1  edi // i-=1
jnz edi, loop
store eax  a // store result

Intel® Processor Micro-architecture - Core® microarchitecture

99
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Fighting Stalls: Branch Handling (cont.)


load B[edi]  ebx F D E W
add eax,ebx  eax F D E W
add edi,-1  edi F D E W
jnz edi, loop F D E W
store eax  a F D E W
xxx F D E W
load B[edi]  ebx F D E W
Only after branch Execute stage we know that next fetch was wrong
• Need to flush the pipe
• IPC: 4 instructions in 6 clocks (IPC = 0.66 vs. optimum IPC =
1)
• ‘Pipe break’ penalty = 2 clocks
• Adding a stage?: IPC = 0.57 ~14% slower!!!
Prolonging the pipeline achieves higher frequencies
however pipe break penalty increases!
MUST solve the pipe break penalty problem!
Intel® Processor Micro-architecture - Core® microarchitecture

100
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Fighting Stalls: Branch Handling (cont.)

H/W can ‘learn’ about SW behavior


• Same branch goes same direction in most cases
• Learn branch address and target
• Branch Target Buffer (BTB)
• Predict based on branch history, surrounding branch behavior, loop
behavior.
• We are at ~95% correct prediction.
• Looks in BTB while fetching instruction
• Lee&Smith or Yeh&Patt algorithms
New (and correct) pointer calculated in Fetch stage of branch

load B[edi]  ebx F D E W


add eax,ebx  eax F D E W
add edi,-1  edi F D E W
jnz edi, loop F/P D E W
load B[edi]  ebx F D E W

Intel® Processor Micro-architecture - Core® microarchitecture

101
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Advanced Pipeline Techniques

Limitations of the Typical Pipeline Scheme


• IPC is theoretically limited by 1
• Actually IPC is less than 1 because of long latency operations,
stalls (e.g. cache miss), pipeline flushes (due to branch miss
prediction) etc.
• Pipeline stages are frequently not balanced
• Cycle Time (Tc) is determined by the longest pipeline stage
Advanced Pipeline Techniques
• Super pipeline
• Super-scalar

Intel® Processor Micro-architecture - Core® microarchitecture

102
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Advanced Pipeline Techniques (cont.)

Super pipeline: shorter stages allows higher frequency


F1 F2 D1 D2 E1 E2 W1 W2
F1 F2 D1 D2 E1 E2 W1 W2
F1 F2 D1 D2 E1 E2 W1 W2
Super-scalar: perform more in a single cycle
F D E W
F D E W
F D E W
F D E W

Intel® Processor Micro-architecture - Core® microarchitecture

103
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Fighting stalls: Out Of Order Execution


(OoO)
Instructions are executed based on “data flow” rather than
program order (Tomasulo’s algorithm ) Avoid the stall that
1. Instruction Fetch and Decode. occurs on this
stage in an in-order
2. Instruction queue @ Reservation Station. processor
3. Instruction
• waits in the queue until all input operands are available
• leaves the queue before earlier, older instructions.
4. Instruction Execution
5. Results are queued.
6. Instruction Reorder and Writeback.

Intel® Processor Micro-architecture - Core® microarchitecture

104
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Fighting stalls: Register Renaming

Creates new opportunities for OOO execution


• Eliminates Write-after-write (WAW) and Write-after-
read (WAR) dependencies = hazards.
Architectural vs physical registers dispatch
1. mov eax, [m1]
2. add eax, 2
MULTD F4,F2,F2 reads from F2
3. mov [m2], eax
4. F2,F0,F6
ADDD mov eaxwrites
, [m3]to F2
5. add eax, 4
6. mov [m4], eax
MULTD F4,F2,F2
4, 5,
ADDD 6 can be
F8,F0,F6 executed
(assume F8 is in parallel with 1, 2, 3
unused)
but after registers renaming only!!!
Intel® Processor Micro-architecture - Core® microarchitecture

105
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Fighting Stalls: Re-Order Buffer (ROB)

Mechanism for renaming and retirement


Table contains in-order instructions order instructions
• Instructions are entered in order
• Registers renamed by the entry number
• Once assigned: execution order unimportant
• After execution: entries marked
• An executed entry can be “retired” once all prior instruction
have retired. That is: instruction have retired -
• Update “real registers real registers” with value of renamed regs
• Update memory
• Leave the ROB

Intel® Processor Micro-architecture - Core® microarchitecture

106
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Fighting Stalls: Reservation Station(s)

Pool(s) of all “not yet executed” instructions


Maintains operands status “ready / not-ready”
Each cycle, executed instructions make more operands “ready”
Instructions whose all operands are “ready” can be “dispatched”
for execution
Dispatcher chooses which of the “ready” instructions will be
executed next

Intel® Processor Micro-architecture - Core® microarchitecture

107
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Fighting Stalls: Memory Order Buffer (MOB)

Idea - allow out of order among memory operations


Problem Memory dependencies cannot fully resolved statically
(memory disambiguation)
Structure similar in concept to ROB
Every access is allocated an entry
Address & data (for stores) are updated when known
Load is checked against all previous stores: Load is checked
against all previous stores

Return
Intel® Processor Micro-architecture - Core® microarchitecture

108
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Intel® Core® Micro-architecture Notable


Features (cont.)
Intelligent Power Capability - Split Busses (core power feature)

Many buses are sized


for worst case data

(x86 instruction of 15 bytes)


(ALU can write-back 128 bits)

Improved Energy Efficiency


Intel® Processor Micro-architecture - Core® microarchitecture

109
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Intel® Core® Micro-architecture Notable


Features (cont.)
Intelligent Power Capability - Split Busses (core power feature)

By splitting buses to deal


with varying data widths,
we can gain the performance
benefit of bus width while
maintaining C dynamic
closer to thinner buses

Improved Energy Efficiency


Intel® Processor Micro-architecture - Core® microarchitecture

110
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Agenda

Introduction
Knowledge refreshment
Notable features
Micro-architecture drill-down
• Front End
• Out-Of-Order Execution Core
• Memory Sub-system
Coding considerations

Intel® Processor Micro-architecture - Core® microarchitecture

111
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Intel® Core® Micro-architecture Overview

System Bus

Bus Unit

2nd Level Cache 1st Level Cache (Data)

Instruction Decode Renamer/Allocator Execution


Fetch Unit /IQ Buffers(Retirement) Unit
Scheduler
Front End
Execution Core
Branch Prediction Unit

Intel® Processor Micro-architecture - Core® microarchitecture

112
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Intel® Core® Micro-architecture Drill-down

page miss handler store


icache
branch address integer
prediction
predecode unit
data memory FP
load SIMD
cache order
instruction unit buffer store
(3x)
queue data

instruction register Reservation


decode alias table Station

MS ALLOC Re-Order Buffer


Intel® Processor Micro-architecture - Core® microarchitecture

113
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Example Code to Be Used


addps xmm0, [EAX+16]
mulps xmm0, xmm0
movps [EAX+240], xmm0
cmp EAX, 100000
jge label

Intel® Processor Micro-architecture - Core® microarchitecture

114
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Agenda

Introduction
Knowledge refreshment
Notable features
Micro-architecture drill-down
• Front End
• Out-Of-Order Execution Core
• Memory Sub-system
Coding considerations

Intel® Processor Micro-architecture - Core® microarchitecture

115
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Core® Micro-architecture Front End

Instruction preparation before executed


• Instruction Fetch Unit
• Instruction Queue
• Instruction Decode Unit
• Branch Prediction Unit

Intel® Processor Micro-architecture - Core® microarchitecture

116
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Core® Micro-architecture Front End

Instruction Fetch Unit


Instruction Queue
Instruction Decode Unit
Branch Prediction Unit

Intel® Processor Micro-architecture - Core® microarchitecture

117
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End

Instruction Fetch Unit

Prefetches instructions that are likely to be icache


executed branch
prediction
Caches frequently-used instructions predecode unit

Predecodes and Buffers instructions


instruction
queue

2nd Level Cache 1st Level Cache (Data)

instruction
Instruction
Fetch Unit
IQ/
Decode
Renamer/Allocator
Buffers(Retirement)
Execution
Unit
decode
Scheduler
Front End
Execution Core

BTBs/Branch Prediction MS

Intel® Processor Micro-architecture - Core® microarchitecture

118
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End

Instruction Fetch Unit (cont.)

I-Cache (Instruction Cache)


• 32 KBytes / 8-way / 64-byte line
• 16 aligned bytes fetched per cycle
ITLB (Instruction Translation Lookaside Buffer)
• 128 4k pages, 8 2M pages
Instruction Prefetcher
• 16-byte aligned lookup through the ITLB into the instruction cache
and instruction prefetch buffers
Instruction Pre-decoder
• Instruction Length Decode (predecode)
• Avoid Length Changing Prefix, for example
• The REX (EM64T) prefix (4xH) is not an LCP

Avoid in loop:
MOV dx, 1234h
Opcode
Instruction Prefixes (66H/67H)Intel® ModR/M
ModR/M SIB Displacement
Processor Micro-architecture - Core® microarchitecture
Immediate

119
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Core® Micro-architecture Front End

Instruction Fetch Unit


Instruction Queue
Instruction Decode Unit
Branch Prediction Unit

Intel® Processor Micro-architecture - Core® microarchitecture

120
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End

Instruction Queue
Buffer between instruction pre-decode unit and
decoder
• up to six predecoded instructions written per
cycle icache
• 18 Instructions contained in IQ branch
• up to 5 Instructions read from IQ prediction
predecode unit
Potential Loop cache
Loop Stream Detector (LSD) support
• Re-use of decoded instruction instruction
• Potential power saving queue

2nd Level Cache 1st Level Cache (Data)

instruction
Instruction
Fetch Unit
IQ/
Decode
Renamer/Allocator
Buffers(Retirement)
Execution
Unit
decode
Scheduler
Front End
Execution Core

BTBs/Branch Prediction MS
Intel® Processor Micro-architecture - Core® microarchitecture

121
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Core® Micro-architecture Front End

Instruction Fetch Unit


Instruction Queue
Instruction Decode Unit
Branch Prediction Unit

Intel® Processor Micro-architecture - Core® microarchitecture

122
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End

Instruction Decode

Decode the instructions into micro-ops


icache
Ready for the execution in OOO core branch
prediction
predecode unit

instruction
queue

2nd Level Cache 1st Level Cache (Data)

instruction
Instruction
Fetch Unit
IQ/
Decode
Renamer/Allocator
Buffers(Retirement)
Execution
Unit
decode
Scheduler
Front End
Execution Core

BTBs/Branch Prediction MS
Intel® Processor Micro-architecture - Core® microarchitecture

123
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End

Instruction Decode

Decoders
Features
• Macro-fusion
• Micro-fusion
• Stack Pointer Tracking

Intel® Processor Micro-architecture - Core® microarchitecture

124
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End

Instruction Decode / Decoders

Instructions converted to micro-ops (uops)


• 1-uop includes load+op, stores, indirect jump, RET...
4 decoders:1 “large” and 3 “small”
• All decoders handle “simple” 1-uop instructions
• One large decoder handles instructions up to 4 uops
All decoder working in parallel
• Four(+) instructions / cycle
Micro-Sequencer takes over for long flows (handling instruction
contains 2~4 uops, uCodeRom handles more complex)

Intel® Processor Micro-architecture - Core® microarchitecture

125
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End

Code Sequence in Front End


cmp EAX, 100000 IQ
these instructions took jne label
more than one fetch
as they are 22 bytes movps [EAX+240], xmm0
IQ buffers them together mulps xmm0, xmm0
addps xmm0, [EAX+16]
all instructions are
decodable by all small small small
decoders Large
(dec1) (dec2) (dec3)
(dec0)
CMP and adjacent JCC
are “fused” into a single
uop. up to 5 instructions cmpjne EAX, 100000, label
decoded per cycle sta_std [EAX+240], xmm0
mulps xmm0, xmm0, xmm0
load_add xmm0, xmm0, [EAX+16]
Intel® Processor Micro-architecture - Core® microarchitecture

126
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End

Instruction Decode

Decoders
Features
• Macro-fusion
• Micro-fusion
• Stack Pointer Tracking

Intel® Processor Micro-architecture - Core® microarchitecture

127
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End

Instruction Decode / Macro - Fusion

Scheduler
Roughly ~15% of all instructions are
cmpjae eax, [mem], label
conditional branches.
Macro-fusion merges two instructions
into a single micro-op, as if the two
instructions were a single long
instruction. Execution

Enhanced Arithmetic Logic Unit (ALU)


for macro-fusion. Each macro-fused
instruction executes with a single
dispatch. Branch
Eval
Not supported in EM64T long mode
flags and target to Write back

Intel® Processor Micro-architecture - Core® microarchitecture

128
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End

Instruction Decode / Macro- Instruction Queue


Fusion Absent addps xmm0, [EAX+16]
Read four instructions from
mulps xmm0, xmm0
Instruction Queue
Each instruction gets decoded movps [EAX+240], xmm0
into separate uops
cmp eax, 100000
Enabling Example
jge label
for (int i=0; i<100000; i++) {
… addps xmm0, [EAX+16] dec0
Cycle 1
} mulps xmm0, xmm0 dec1
movps [EAX+240], xmm0 dec2
cmp eax, 100000 dec3
Cycle 2 jge label dec0
Intel® Processor Micro-architecture - Core® microarchitecture

129
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End

Instruction Decode / Macro- Instruction Queue


Fusion Presented addps xmm0, [EAX+16]
Read five Instructions from
Instruction Queue mulps xmm0, xmm0

Send fusable pair to single movps [EAX+240], xmm0


decoder
cmp eax, 100000
Single uop represents two
instructions jae label
Enabling Example
for (unsigned int i=0; Cycle 1 addps xmm0, [EAX+16] dec0
i<100000; i++) {
mulps xmm0, xmm0 dec1
… movps [EAX+240], xmm0 dec2
} cmpjae eax, 100000, label dec3

Intel® Processor Micro-architecture - Core® microarchitecture

130
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End

Instruction Decode / Macro – Fusion (cont.)

Benefits
• Reduces latency
• Increased renaming
• Increased retire bandwidth
• Increased virtual storage
• Power savings

Enabling Greater Performance &


Efficiency

Intel® Processor Micro-architecture - Core® microarchitecture

131
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End

Instruction Decode

Decoders
Features
• Macro-fusion
• Micro-fusion
• Stack Pointer Tracking

Intel® Processor Micro-architecture - Core® microarchitecture

132
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End

Instruction Decode / Micro-Op Fusion

Frequent pairs of micro-operations derived from the same


Macro Instruction can be fused into a single micro-operation

Micro-op fusion effectively widens the pipeline

Intel® Processor Micro-architecture - Core® microarchitecture

133
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End

Instruction Decode / Micro-Fusion (cont.)

u-ops of a Store “movps [EAX+240], xmm0”

sta eax+240
st xmm0, [eax+240]
std xmm0, [eax+240]

Intel® Processor Micro-architecture - Core® microarchitecture

134
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End

Instruction Decode

Decoders
Features
• Macro-fusion
• Micro-fusion
• Stack Pointer Tracking

Intel® Processor Micro-architecture - Core® microarchitecture

135
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End

Instruction Decode / Stack Pointer Tracker


(Extended Stack Pointer folding)
ESP is calculated by dedicate logic
PUSH EAX PUSH EDX POP EBX
• No explicit Micro-Ops updating ESP
• Micro-Ops saving Decoder 4 Decoder 0 Decoder
ESPd=8 …
• Power saving 0 1 N

Recovery .
Information .
.

Intel® Processor Micro-architecture - Core® microarchitecture

136
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Core® Micro-architecture Front End

Instruction Fetch Unit


Instruction Queue
Instruction Decode Unit
Branch Prediction Unit

Intel® Processor Micro-architecture - Core® microarchitecture

137
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End

Branch Prediction Unit

Allow executing instructions long before the


branch outcome is decided icache
branch
• Superset of Prescott / Pentium-M features prediction
predecode unit
• One taken branch every other clock
• Branch predictions for 32 bytes at a time,
twice the width of the fetch engine instruction
queue

2nd Level Cache 1st Level Cache (Data)

instruction
Instruction
Fetch Unit
IQ/
Decode
Renamer/Allocator
Buffers(Retirement)
Execution
Unit
decode
Scheduler
Front End
Execution Core

BTBs/Branch Prediction MS
Intel® Processor Micro-architecture - Core® microarchitecture

138
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End

Branch Prediction Unit (cont.)

16-entry Return Stack Buffer (RSB)


Front end queuing of BPU lookups
Type of predictions
• Direct Calls and Jumps
• Indirect Calls and Jumps
• Conditional branches

Intel® Processor Micro-architecture - Core® microarchitecture

139
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Front End

Branch Prediction Improvements

Intel® Pentium® 4 Processor branch prediction


PLUS the following two improvements:

Indirect Branch Predictor Loop Detector

Branch miss-predictions reduced by >20%

Intel® Processor Micro-architecture - Core® microarchitecture

140
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Agenda

Introduction
Knowledge preparation
Notable features
Micro-architecture drill-down
• Front End
• Out-Of-Order Execution Core
• Memory Sub-system
Coding considerations

Intel® Processor Micro-architecture - Core® microarchitecture

141
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Core® Micro-architecture Execution Core

store
Accepted decoded u-ops, assign resources, address integer
execute and retire u-ops FP
load
• Renamer SIMD
store
data
(3x)
• Reservation station (RS)
register Reservation
• Issue ports
alias table Station
• Execution Unit ALLOC Re-Order Buffer

2nd Level Cache 1st Level Cache (Data)

IQ/ Renamer/Allocator Execution


Instruction
Decode Buffers(Retirement) Unit
Fetch Unit
Scheduler
Front End Execution Core

BTBs/Branch Prediction

Intel® Processor Micro-architecture - Core® microarchitecture

142
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Execution Core

Execution Core Building Blocks

Renamer Ports (number)

RS
0,1,5 0,1,5
SIMD/Integer 0,1,5
SIMD Floating
MUL Integer
ROB Integer Point
Execution Unit

2 Load
3,4 Store

Memory Sub-system
Intel® Processor Micro-architecture - Core® microarchitecture

143
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Execution Core

Rename and Resources

4 uops renamed / retired per clock


• one taken branch, any # of untaken
• one fxchg per cycle
Uops written to RS and ROB
• Decoded uops were renamed and allocated with resource by
RAT and sent to ROB read and RS
• RS waits for sources to arrive allowing OOO execution
• Registers not “in flight” read from ROB during RS write
register Reservation
alias table Station
ALLOC Re-Order Buffer
Intel® Processor Micro-architecture - Core® microarchitecture

144
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Execution Core

Issue Ports and Execution Units


6 dispatch ports from RS
• 3 execution ports store
• (shared for integer / fp / simd) integer
address
• load FP
• store (address) load SIMD
• store (data) store (3x)
data
128-bit SSE implementation
• Port 0 has packed multiply (4 cycles SP 5 DP pipelined)
• Port 1 has packed add (3 cycles all precisions)
FP data has one additional cycle bypass latency
• Do not mix SSE FP and SSE integer ops on same register
Avoid: Addps XMM0,XMM1 Better: Addps XMM0,XMM1
Pand xmm0,xmm3 Addps xmm2,xmm0
Addps xmm2,xmm0 Pand xmm0,xmm3
Intel® Processor Micro-architecture - Core® microarchitecture

145
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Execution Core

The Out Of Order

each uop only takes a single RS entry


load + add dispatches twice (load, then add)
mulps dispatches once when load + add to write back
sta + std dispatches twice
sta (address) can fire as early as possible
std must wait for mulps to write back
cmpjne dispatches only once (functionality is truly fused)
no dependency, can fire as early as it wants

cmpjne EAX, 100000, label RS


sta_std [EAX+240], xmm0
mulps xmm0, xmm0, xmm0
load_add xmm0, xmm0, [EAX+16]
Intel® Processor Micro-architecture - Core® microarchitecture

146
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Execution Core

Dispatching to OOO EXE


cmpjne EAX, 100000, label
sta_std [EAX+240], xmm0 RS 5 GP (incl jmp)
mulps xmm0, xmm0, xmm0
load_add xmm0, xmm0, [EAX+16]
4 STD
cmpjne EAX, 100000, label
sta_std [EAX+244], xmm0
mulps xmm0, xmm0, xmm0 3 STA
load_add xmm0, xmm0, [EAX+16]

cmpjne EAX, 100000, label 2 Load


sta_std [EAX+248], xmm0
mulps xmm0, xmm0, xmm0
load_add xmm0, xmm0, [EAX+16] 1 GP (incl FP add)

cmpjne EAX, 100000, label


sta_std [EAX+24C], xmm0 0 GP (incl FP mul)
mulps xmm0, xmm0, Intel®
xmm0 Processor Micro-architecture - Core® microarchitecture

load_add
147 xmm0, xmm0, [EAX+16]
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Intel® Core™ Microarchitecture – Memory Sub-system

Advanced Memory Access

3 clk latency and 1 clk thrput of L1D; 14 and 2 for L2


Miss Latencies
• L1 miss hits L2 ~ 10 cycles
• L2 miss, access to memory ~300 cycles (server/FBD)
• L2 miss, access to memory ~165 cycles (Desk/DDR2)
• C step broadwater is reported to have ~50ns latency
Cache Bandwidth
• Bandwidth to cache ~ 8.5 bytes/cycle
Memory Bandwidth
• Desktop ~ 6 GB/sec/socket (linux)
• Server ~3.5 GB/sec/socket

Intel® Processor Micro-architecture - Core® microarchitecture

148
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Optimizing for Intel® Core™


Microarchitecture
Use CMP = employ both Cores
• Go to multithreading!
Prefer SSE as much as possible. If you didn’t do it so far,
vectorize the code now!!
• Intel Compiler has very good vectorization engine
Align data and data layout (sequential)
• To align use __declspec(align (16)) float a[1000];

Intel® Processor Micro-architecture - Core® microarchitecture

149
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Optimizing for Intel® Core™


Microarchitecture (advanced)
Use Intel VTune™ Performance Analyzer for performance
problems revealing
• CPI
• Specific CPU events for Core-arch:
RESOURCE_STALLS.RS_FULL, L2_IFETCH.SELF.MESI,
RESOURCE_STALLS.RS_FULL, RESOURCE_STALLS.ROB_FULL etc-
see VTune help

Intel® Processor Micro-architecture - Core® microarchitecture

150
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Front End Issue Debugging


Look for Front End optimization only when code is FE bound
• Reservation station (RS) is the front end and allocation target
• Low RESOURCE_STALLS.RS_FULL and poor CPI should be debugged as front
end issue
• If there are no issues in the FE the RS should be full above 30% of the time
Front End typical issues:
• Code is too big to fit in the L1:
• When L2_IFETCH.SELF.MESI happens every 10-15 instructions
• Code that could have been with CPI 1 will be around 2
• 14 cycles penalty for L1 demand miss
• Average instruction size above 6 bytes
• Happens typically with SSE code and more with EM64T
• Can have impact only in case of otherwise excellent CPI
• Code with length changing prefix issues (LCP)
• Penalty of 6 cycles or more
• Look at ILD_STALL VTune event

Front-End should not be the bottleneck.


Focus on Front End issues only if it is the issue.
Intel® Processor Micro-architecture - Core® microarchitecture

151
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Execution micro architecture


The busiest port may determent the potential execution speed

Single clock latency operations are best


• Different latency operations can create writeback conflicts 
Creating bubble in the port

Look at the dependency chains to see the potential


parallelism
• Remember that the RS has only 32 entries and only those
instructions are candidates for scheduling to the execution
ports
• High RESOURCE_STALLS.RS_FULL percentage if the code is
latency bound
• The ROB has 96 entries
• High RESOURCE_STALLS.ROB_FULL percentage only if
Execution
• Code stage:
has long latency The key
instructions (L2 formisses) good performance.
Intel® Processor Micro-architecture - Core® microarchitecture

Focus
•152 oncanport
Other code utilization
be executed while waiting and dependency chains
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College
Execution micro architecture

The Divider is a big potential stall source


• DIV for the number Divide operations executed
• IDLE_DURING_DIV for number of cycles of no port issue while the
diverter is busy
• Try to find some useful work to do in parallel with divide operations
Extra cycle latency for bypass between
execution domains
• For example: FP (ADDPS) and logical
EXE
ops (PAND) on XMMn
• DELAYED_BYPASS.FP Data Cache Unit
0,1,5 0,1,5 0,1,5
• DELAYED_BYPASS.LOAD SIMD integer / Floating
Integer
• DELAYED_BYPASS.SIMD Integer SIMD
MUL
Point

dtlb
memoryorderring
store forwarding

load 2
store (address) 3
store (data)
4

Intel® Processor Micro-architecture - Core® microarchitecture

153
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Enhancements and Optimization Opportunities


IP Prefetcher
• Prefetches stride loads associated with the same IP
• Uses History table
• Use VTune events to identify misses when expected prefetches

Memory Disambiguation
• Predicts when OK to fire load before preceding stores with unknown
address
• Misprediction triggers Pipeline flash and load restart
• Disambiguation is temporarily disabled if frequently fails
• LOAD_BLOCK.STA where Loads blocked by a preceding store with
unknown address
• In case not to the same address:
Possible reasons for not working: Address collision with other load(s)

Intel® Processor Micro-architecture - Core® microarchitecture

154
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.
Intel® Software College

Other Opportunities for Performance Gain in the


memory sub-system
4k Aliasing
• OOO engine can fire Load before preceding Store if not collides on the Store’s
address
• Address collision serializes execution
• Address checking uses only the last 12 bits (4K)
• False blocking - if Load’s & Store’s addresses have 4KB offset
• e.g. accessing large, power of two, sized arrays in a loop
• Resolve 4K aliasing conflicts by changing memory layout
• VTune event LOAD_BLOCK.OVERLAP_STORE
Load block cases
• Increase the distance between the store and the dependant load, so that the
store data/address is known at the time the load is dispatched
• Store address unknown - LOAD_BLOCK.STA
• Loads blocked by a preceding store with unknown address
• Store data unknown - LOAD_BLOCK.STD
• Loads blocked by a preceding store with unknown data
• Loads blocked until retirement LOAD_BLOCK.UNTIL_RETIRE
• This includes mainly uncacheable loads and split loads (loads that cross the cache
line boundary)

Intel® Processor Micro-architecture - Core® microarchitecture

155
Copyright © 2006, Intel Corporation. All rights reserved.
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other brands and names are the property of their respective owners.

S-ar putea să vă placă și