Intel Hyper Threading

Hyper-Threading Technology in the Netburst Microarchitecture
Debbie Marr
Hyper-Threading Technology Architect
Intel Corporation August 19, 2002
Copyright 2001 Intel Corporation.
Agenda
z Hyper-Threading
Technology in the Netburst Microarchitecture z Microarchitecture Choices & Tradeoffs z Performance Results
Page 2
Agenda
z Hyper-Threading
Page 3
Hyper-Threading Technology
z Simultaneous
Multi-threading
2 logical processors (LP) simultaneously share one physical processors execution resources
z Appears
to software as 2 processors (2-way shared memory multiprocessor)

Shrink-wrapped operating system schedules software threads/processes to both logical processors Fully compatible to existing multi-processor system software and hardware.
z Integral
part of Intel Netburst Microarchitecture

Page 4
Intel Processors with Netburst Microarchitecture
Intel Xeon MP Processor

256KB 2nd-Level Cache 1MB 3rd-Level Cache .18u process
Intel Xeon Processor

256KB 2nd-Level Cache .18u process
Intel Xeon Processor

512KB 2nd-Level Cache .13u process
Page 5
Die Size Increase is Small

z Total
die area added is small
A few small structures duplicated Some additional control logic and pointers
Page 6
What was added

Instruction Streaming Buffers Instruction TLB Trace Cache Next IP Trace Cache Fill Buffers Next IP
Register Alias Tables
Page 7
Complexity is Large
z Challenged
many basic assumptions z New microarchitecture algorithms

To address new uop (micro-operation) prioritization issues To solve potential new livelock scenarios
z High
logic design complexity z Validation Effort

Explosion of validation space
Page 8
Agenda
z Hyper-Threading
Page 9
Managing Resources
z Choices
Partition
Half of resource dedicated to each logical processor
Threshold
Flexible resource sharing with limit on maximum resource usage
Full Sharing
Flexible resource sharing with no limit on maximum resource usage
Others (not discussed in this talk)

z Considerations
Throughput and fairness Die size and Complexity

Page 10
Partitioning
z Half
of resource dedicated to each logical processor

Simple, low complexity
z Good
for structures where
Occupancy time can be high and unpredictable High average utilization

z Major
pipeline queues are a good example
Provide buffering to avoid pipeline stalls Allow slip between logical processors
Page 11
Execution Pipeline
I-Fetch
IP
Register Rename
Fetch Queue
Rename
Uop Queue
Sched
Register Read
Execute D-Cache
Store Buffer
Register Write
Retire Queue
Trace Cache
Allocate
Registers L1 D-Cache
Registers ROB
Page 12
Execution Pipeline
I-Fetch
IP
Register Rename
Fetch Queue
Rename
Uop Queue
Sched
Register Read
Execute D-Cache
Store Buffer
Register Write
Retire Queue
Trace Cache
Allocate
Registers ROB
In-Order Pipeline
Out-of-Order Pipeline
InOrder
Page 13
Execution Pipeline
I-Fetch
IP
Register Rename
Fetch Queue
Rename
Uop Queue
Sched
Register Read
Execute D-Cache
Store Buffer
Register Write
Retire Queue
Trace Cache
Allocate
Registers ROB
Partition queues between major pipestages of pipeline

Page 14
Partitioned Queue Example

z With
full sharing, a slow thread can get unfair share of resources

Can prevent a faster thread from making rapid progress.
Page 15

z Green
thread stalled z Yellow thread not stalled
1010
Cycle 0
1010
Shared Queue
Partitioned Queue
Page 16

z Green
0 1 10
Cycle 1
0 1 10
Shared Queue
Partitioned Queue (Max entries/LP = 2)

Page 17

z Green
2 1 10
Cycle 1
2 1 10
Shared Queue

Page 18

z Green
1210
Cycle 1
1210
Shared Queue

Page 19

z Green
1 210
Cycle 2
1 210
Shared Queue

Page 20

z Green
2 210
Cycle 2
3 210
Shared Queue

Page 21

z Green
2210
Cycle 2
3210
Shared Queue

Page 22

z Green
2 210
Cycle 3
2 3 10
Shared Queue

Page 23

z Green
3 210
Cycle 3
4 3 10
Shared Queue

Page 24

z Green
3210
Cycle 3
3410
Shared Queue

Page 25

z Green

Partitioning resource ensures fairness and ensures progress for both logical processors
Yellow is Blocked! 3210
3
Cycle 4
410
Shared Queue

Page 26
Thresholds
resource sharing with limit on maximum resource usage z Good for small structures where
z Flexible
Occupancy time is low and predictable Low average utilization with occasional high peaks
z Schedulers
are a good example
Throughput is high because of data speculation (get data regardless of cache hit) uOps pass through scheduler very quickly Schedulers are small for speed
Page 27
Schedulers, Queues
z
5 Schedulers
MEM ALU0 ALU1 FP Move FP/MMX/SSE
Mem Queue LP0 Mem Queue LP1
+ Counters Memory uOp Scheduler
Threshold prevents one logical processor from consuming all entries

Round robin until reach threshold
ALU0 Scheduler
Similarly..
ALU1 Scheduler FP Move Scheduler FP/MMX/SSE Scheduler
Page 28
Scheduler Occupancy Histogram Image Composition Workload

40% 35% 30% % of Time 25% 20% 15% 10% 5% 0% 0% 20% 40% 60% 80% 100% % of Entries Occupied
alu0 alu1 memory fp move fp/mmx/sse
Measurement of image composition workload on an Intel Xeon Processor

Page 29
Scheduler Occupancy Histogram Transaction Processing Workload

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 0% 20% 40% 60% % of Entries Occupied
% of Time
alu0 alu1 memory fp move fp/mmx/sse
80%
100%
Measurement of transaction processing workload on a 4P Intel Xeon MP Processor System

Page 30
Memory Scheduler Occupancy Over Time

8 7 6
Logical Processor 0 Logical Processor 1
Occupancy
5 4 3 2 1 0
<---------
Time
-------->
Measurement of transaction processing workload on a 4P Intel Xeon MP Processor System
Variable partitioning allows a logical processor to use most resources when the other doesnt need them
Page 31
Full Sharing
resource sharing with no limit on maximum resource usage z Good for large structures where
z Flexible
Working set sizes are variable Sharing between logical processors possible Not possible for one logical processor to starve
z Caches
are a good example
All caches are shared

Better overall performance vs. partitioned caches Some applications share code and/or data
High set associativity minimizes conflict misses.

Level 2 and Level 3 caches are 8-way set associative
Page 32
Shared Cache vs. Partitioned Cache

2.00 Shared Cache improvement over Split Cache
2.2 3.1
1.80
Cache Hit Rate Performance
1.60
1.40
1.20
1.00
0.80 255.vortex 186.crafty 300.twolf 189.lucas 253.perlbmk 168.wupwise 200.sixtrack 181.mcf 175.vpr 173.applu 188.ammp 187.facerec 191.fma3d 252.eon 171.swim 179.art 183.equake 197.parser 178.galgel 172.mgrid 177.mesa 256.bzip2 164.gzip 301.apsi Average 176.gcc 254.gap
Results for 2 copies of application run simultaneously. Measured on Intel Xeon Processor C0 step. Cache miss statistics using EMON event: MN MEM:2nd Level Cache Load Misses Retired
On average, a shared cache has 40% better hit rate and 12% better performance for these applications
Page 33
Agenda
z Hyper-Threading
Page 34
Server Performance
Transaction Processing Workload
3.00 2.50 Performance 2.00 1.50
E-Commerce Workload
2.50 2.00 Performance
HT Off HT On
20%
10%
HT Off HT On
14%
1.50
17%
1.00
20%
1.00 0.50 0.00 1 2 Number of Processors 4
0.50 0.00 2 4 Number of Processors
Good performance benefit from small die area investment

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/procs/perf/limits.htm or call (U.S.) www.intel.com/procs/perf/limits.htm 1-800-628-8686 or 1-916-356-3104
Page 35
Multi-tasking
Hyper-Threading Technology Speedup 1.2 1.18 1.16 1.14 1.12 1.1 1.08 1.06 1.04 1.02 1 Integer (same) Integer (different) FP (same) FP (different) Integer vs. FP
Intel Xeon Processor platform is a prototype system
Larger gains can be realized by running dissimilar applications due to different resource requirements
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/procs/perf/limits.htm or call (U.S.) www.intel.com/procs/perf/limits.htm 1-800-628-8686 or 1-916-356-3104
Page 36
Conclusions
z Hyper-Threading
Technology is an integral part of the Netburst Microarchitecture

Very little additional die area needed Compelling performance Currently enabled for server processors
z Microarchitecture
design choices
Resource sharing policy matched to traffic and performance requirements

z New
challenging microarchitecture direction
Continuous improvements in future processors for years to come
Page 37

Intel Hyper Threading

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Intel Hyper Threading

Încărcat de

Drepturi de autor:

Formate disponibile

Hyper-Threading Technology in the Netburst Microarchitecture

Intel Corporation August 19, 2002

Copyright 2001 Intel Corporation.

Copyright 2002 Intel Corporation.

Copyright 2002 Intel Corporation.

to software as 2 processors (2-way shared memory multiprocessor)

part of Intel Netburst Microarchitecture

Copyright 2002 Intel Corporation.

Intel Processors with Netburst Microarchitecture

Intel Xeon MP Processor

Intel Xeon Processor

Intel Xeon Processor

Copyright 2002 Intel Corporation.

Die Size Increase is Small

die area added is small

Copyright 2002 Intel Corporation.

What was added

Register Alias Tables

Copyright 2002 Intel Corporation.

many basic assumptions z New microarchitecture algorithms

logic design complexity z Validation Effort

Copyright 2002 Intel Corporation.

Others (not discussed in this talk)

Throughput and fairness Die size and Complexity

of resource dedicated to each logical processor

for structures where

Occupancy time can be high and unpredictable High average utilization

pipeline queues are a good example

Copyright 2002 Intel Corporation.

Copyright 2002 Intel Corporation.

Copyright 2002 Intel Corporation.

Partition queues between major pipestages of pipeline

Partitioned Queue Example

full sharing, a slow thread can get unfair share of resources

Copyright 2002 Intel Corporation.

Partitioned Queue Example

thread stalled z Yellow thread not stalled

Copyright 2002 Intel Corporation.

Partitioned Queue Example

thread stalled z Yellow thread not stalled

Partitioned Queue (Max entries/LP = 2)

Copyright 2002 Intel Corporation.

Partitioned Queue Example

thread stalled z Yellow thread not stalled

Partitioned Queue (Max entries/LP = 2)

Copyright 2002 Intel Corporation.

Partitioned Queue Example

thread stalled z Yellow thread not stalled

Partitioned Queue (Max entries/LP = 2)

Copyright 2002 Intel Corporation.

Partitioned Queue Example

thread stalled z Yellow thread not stalled

Partitioned Queue (Max entries/LP = 2)

Copyright 2002 Intel Corporation.

Partitioned Queue Example

thread stalled z Yellow thread not stalled

Partitioned Queue (Max entries/LP = 2)

Copyright 2002 Intel Corporation.

Partitioned Queue Example

thread stalled z Yellow thread not stalled

Partitioned Queue (Max entries/LP = 2)

Copyright 2002 Intel Corporation.

Partitioned Queue Example

thread stalled z Yellow thread not stalled