Documente Academic
Documente Profesional
Documente Cultură
Debbie Marr
Hyper-Threading Technology Architect
Agenda
z Hyper-Threading
Technology in the Netburst Microarchitecture z Microarchitecture Choices & Tradeoffs z Performance Results
Page 2
Agenda
z Hyper-Threading
Technology in the Netburst Microarchitecture z Microarchitecture Choices & Tradeoffs z Performance Results
Page 3
Hyper-Threading Technology
z Simultaneous
Multi-threading
2 logical processors (LP) simultaneously share one physical processors execution resources
z Appears
z Integral
Page 5
A few small structures duplicated Some additional control logic and pointers
Page 6
Page 7
Complexity is Large
z Challenged
Page 8
Agenda
z Hyper-Threading
Technology in the Netburst Microarchitecture z Microarchitecture Choices & Tradeoffs z Performance Results
Page 9
Managing Resources
z Choices
Partition
Half of resource dedicated to each logical processor
Threshold
Flexible resource sharing with limit on maximum resource usage
Full Sharing
Flexible resource sharing with no limit on maximum resource usage
Page 10
Partitioning
z Half
z Good
Provide buffering to avoid pipeline stalls Allow slip between logical processors
Page 11
Execution Pipeline
I-Fetch
IP
Register Rename
Fetch Queue
Rename
Uop Queue
Sched
Register Read
Execute D-Cache
Store Buffer
Register Write
Retire Queue
Trace Cache
Allocate
Registers L1 D-Cache
Registers ROB
Page 12
Execution Pipeline
I-Fetch
IP
Register Rename
Fetch Queue
Rename
Uop Queue
Sched
Register Read
Execute D-Cache
Store Buffer
Register Write
Retire Queue
Trace Cache
Allocate
Registers L1 D-Cache
Registers ROB
In-Order Pipeline
Out-of-Order Pipeline
InOrder
Page 13
Execution Pipeline
I-Fetch
IP
Register Rename
Fetch Queue
Rename
Uop Queue
Sched
Register Read
Execute D-Cache
Store Buffer
Register Write
Retire Queue
Trace Cache
Allocate
Registers L1 D-Cache
Registers ROB
Page 14
Page 15
1010
Cycle 0
1010
Shared Queue
Partitioned Queue
Page 16
0 1 10
Cycle 1
0 1 10
Shared Queue
2 1 10
Cycle 1
2 1 10
Shared Queue
1210
Cycle 1
1210
Shared Queue
1 210
Cycle 2
1 210
Shared Queue
2 210
Cycle 2
3 210
Shared Queue
2210
Cycle 2
3210
Shared Queue
2 210
Cycle 3
2 3 10
Shared Queue
3 210
Cycle 3
4 3 10
Shared Queue
3210
Cycle 3
3410
Shared Queue
3
Cycle 4
410
Shared Queue
Thresholds
resource sharing with limit on maximum resource usage z Good for small structures where
z Flexible
Occupancy time is low and predictable Low average utilization with occasional high peaks
z Schedulers
Throughput is high because of data speculation (get data regardless of cache hit) uOps pass through scheduler very quickly Schedulers are small for speed
Page 27
Schedulers, Queues
z
5 Schedulers
MEM ALU0 ALU1 FP Move FP/MMX/SSE
Mem Queue LP0 Mem Queue LP1
ALU0 Scheduler
Similarly..
Page 28
Page 29
% of Time
80%
100%
Page 30
Occupancy
5 4 3 2 1 0
<---------
Time
-------->
Variable partitioning allows a logical processor to use most resources when the other doesnt need them
Copyright 2002 Intel Corporation.
Page 31
Full Sharing
resource sharing with no limit on maximum resource usage z Good for large structures where
z Flexible
Working set sizes are variable Sharing between logical processors possible Not possible for one logical processor to starve
z Caches
Page 32
1.80
1.60
1.40
1.20
1.00
0.80 255.vortex 186.crafty 300.twolf 189.lucas 253.perlbmk 168.wupwise 200.sixtrack 181.mcf 175.vpr 173.applu 188.ammp 187.facerec 191.fma3d 252.eon 171.swim 179.art 183.equake 197.parser 178.galgel 172.mgrid 177.mesa 256.bzip2 164.gzip 301.apsi Average 176.gcc 254.gap
Results for 2 copies of application run simultaneously. Measured on Intel Xeon Processor C0 step. Cache miss statistics using EMON event: MN MEM:2nd Level Cache Load Misses Retired
On average, a shared cache has 40% better hit rate and 12% better performance for these applications
Copyright 2002 Intel Corporation.
Page 33
Agenda
z Hyper-Threading
Technology in the Netburst Microarchitecture z Microarchitecture Choices & Tradeoffs z Performance Results
Page 34
Server Performance
Transaction Processing Workload
3.00 2.50 Performance 2.00 1.50
E-Commerce Workload
2.50 2.00 Performance
HT Off HT On
20%
10%
HT Off HT On
14%
1.50
17%
1.00
20%
1.00 0.50 0.00 1 2 Number of Processors 4
Page 35
Multi-tasking
Hyper-Threading Technology Speedup 1.2 1.18 1.16 1.14 1.12 1.1 1.08 1.06 1.04 1.02 1 Integer (same) Integer (different) FP (same) FP (different) Integer vs. FP
Larger gains can be realized by running dissimilar applications due to different resource requirements
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/procs/perf/limits.htm or call (U.S.) www.intel.com/procs/perf/limits.htm 1-800-628-8686 or 1-916-356-3104
Page 36
Conclusions
z Hyper-Threading
z Microarchitecture
design choices
Page 37