Sunteți pe pagina 1din 37

Hyper-Threading Technology in the Netburst Microarchitecture

Debbie Marr
Hyper-Threading Technology Architect

Intel Corporation August 19, 2002

Copyright 2001 Intel Corporation.

Agenda
z Hyper-Threading

Technology in the Netburst Microarchitecture z Microarchitecture Choices & Tradeoffs z Performance Results

Copyright 2002 Intel Corporation.

Page 2

Agenda
z Hyper-Threading

Technology in the Netburst Microarchitecture z Microarchitecture Choices & Tradeoffs z Performance Results

Copyright 2002 Intel Corporation.

Page 3

Hyper-Threading Technology
z Simultaneous

Multi-threading

2 logical processors (LP) simultaneously share one physical processors execution resources
z Appears

to software as 2 processors (2-way shared memory multiprocessor)


Shrink-wrapped operating system schedules software threads/processes to both logical processors Fully compatible to existing multi-processor system software and hardware.

z Integral

part of Intel Netburst Microarchitecture


Page 4

Copyright 2002 Intel Corporation.

Intel Processors with Netburst Microarchitecture

Intel Xeon MP Processor


256KB 2nd-Level Cache 1MB 3rd-Level Cache .18u process

Intel Xeon Processor


256KB 2nd-Level Cache .18u process

Intel Xeon Processor


512KB 2nd-Level Cache .13u process

Copyright 2002 Intel Corporation.

Page 5

Die Size Increase is Small


z Total

die area added is small

A few small structures duplicated Some additional control logic and pointers

Copyright 2002 Intel Corporation.

Page 6

What was added


Instruction Streaming Buffers Instruction TLB Trace Cache Next IP Trace Cache Fill Buffers Next IP

Register Alias Tables

Copyright 2002 Intel Corporation.

Page 7

Complexity is Large
z Challenged

many basic assumptions z New microarchitecture algorithms


To address new uop (micro-operation) prioritization issues To solve potential new livelock scenarios
z High

logic design complexity z Validation Effort


Explosion of validation space
Copyright 2002 Intel Corporation.

Page 8

Agenda
z Hyper-Threading

Technology in the Netburst Microarchitecture z Microarchitecture Choices & Tradeoffs z Performance Results

Copyright 2002 Intel Corporation.

Page 9

Managing Resources
z Choices

Partition
Half of resource dedicated to each logical processor

Threshold
Flexible resource sharing with limit on maximum resource usage

Full Sharing
Flexible resource sharing with no limit on maximum resource usage

Others (not discussed in this talk)


z Considerations

Throughput and fairness Die size and Complexity


Copyright 2002 Intel Corporation.

Page 10

Partitioning
z Half

of resource dedicated to each logical processor


Simple, low complexity

z Good

for structures where

Occupancy time can be high and unpredictable High average utilization


z Major

pipeline queues are a good example

Provide buffering to avoid pipeline stalls Allow slip between logical processors

Copyright 2002 Intel Corporation.

Page 11

Execution Pipeline
I-Fetch
IP
Register Rename

Fetch Queue

Rename

Uop Queue

Sched

Register Read

Execute D-Cache
Store Buffer

Register Write

Retire Queue

Trace Cache

Allocate

Registers L1 D-Cache

Registers ROB

Copyright 2002 Intel Corporation.

Page 12

Execution Pipeline
I-Fetch
IP
Register Rename

Fetch Queue

Rename

Uop Queue

Sched

Register Read

Execute D-Cache
Store Buffer

Register Write

Retire Queue

Trace Cache

Allocate

Registers L1 D-Cache

Registers ROB

In-Order Pipeline

Out-of-Order Pipeline

InOrder

Copyright 2002 Intel Corporation.

Page 13

Execution Pipeline
I-Fetch
IP
Register Rename

Fetch Queue

Rename

Uop Queue

Sched

Register Read

Execute D-Cache
Store Buffer

Register Write

Retire Queue

Trace Cache

Allocate

Registers L1 D-Cache

Registers ROB

Partition queues between major pipestages of pipeline


Copyright 2002 Intel Corporation.

Page 14

Partitioned Queue Example


z With

full sharing, a slow thread can get unfair share of resources


Can prevent a faster thread from making rapid progress.

Copyright 2002 Intel Corporation.

Page 15

Partitioned Queue Example


z Green

thread stalled z Yellow thread not stalled

1010

Cycle 0

1010

Shared Queue

Partitioned Queue

Copyright 2002 Intel Corporation.

Page 16

Partitioned Queue Example


z Green

thread stalled z Yellow thread not stalled

0 1 10
Cycle 1

0 1 10

Shared Queue

Partitioned Queue (Max entries/LP = 2)


Page 17

Copyright 2002 Intel Corporation.

Partitioned Queue Example


z Green

thread stalled z Yellow thread not stalled

2 1 10
Cycle 1

2 1 10

Shared Queue

Partitioned Queue (Max entries/LP = 2)


Page 18

Copyright 2002 Intel Corporation.

Partitioned Queue Example


z Green

thread stalled z Yellow thread not stalled

1210

Cycle 1

1210

Shared Queue

Partitioned Queue (Max entries/LP = 2)


Page 19

Copyright 2002 Intel Corporation.

Partitioned Queue Example


z Green

thread stalled z Yellow thread not stalled

1 210
Cycle 2

1 210

Shared Queue

Partitioned Queue (Max entries/LP = 2)


Page 20

Copyright 2002 Intel Corporation.

Partitioned Queue Example


z Green

thread stalled z Yellow thread not stalled

2 210
Cycle 2

3 210

Shared Queue

Partitioned Queue (Max entries/LP = 2)


Page 21

Copyright 2002 Intel Corporation.

Partitioned Queue Example


z Green

thread stalled z Yellow thread not stalled

2210

Cycle 2

3210

Shared Queue

Partitioned Queue (Max entries/LP = 2)


Page 22

Copyright 2002 Intel Corporation.

Partitioned Queue Example


z Green

thread stalled z Yellow thread not stalled

2 210
Cycle 3

2 3 10

Shared Queue

Partitioned Queue (Max entries/LP = 2)


Page 23

Copyright 2002 Intel Corporation.

Partitioned Queue Example


z Green

thread stalled z Yellow thread not stalled

3 210
Cycle 3

4 3 10

Shared Queue

Partitioned Queue (Max entries/LP = 2)


Page 24

Copyright 2002 Intel Corporation.

Partitioned Queue Example


z Green

thread stalled z Yellow thread not stalled

3210

Cycle 3

3410

Shared Queue

Partitioned Queue (Max entries/LP = 2)


Page 25

Copyright 2002 Intel Corporation.

Partitioned Queue Example


z Green

thread stalled z Yellow thread not stalled


Partitioning resource ensures fairness and ensures progress for both logical processors
Yellow is Blocked! 3210

3
Cycle 4

410

Shared Queue

Partitioned Queue (Max entries/LP = 2)


Page 26

Copyright 2002 Intel Corporation.

Thresholds
resource sharing with limit on maximum resource usage z Good for small structures where
z Flexible

Occupancy time is low and predictable Low average utilization with occasional high peaks
z Schedulers

are a good example

Throughput is high because of data speculation (get data regardless of cache hit) uOps pass through scheduler very quickly Schedulers are small for speed

Copyright 2002 Intel Corporation.

Page 27

Schedulers, Queues
z

5 Schedulers
MEM ALU0 ALU1 FP Move FP/MMX/SSE
Mem Queue LP0 Mem Queue LP1

+ Counters Memory uOp Scheduler

Threshold prevents one logical processor from consuming all entries


Round robin until reach threshold

ALU0 Scheduler

Similarly..

ALU1 Scheduler FP Move Scheduler FP/MMX/SSE Scheduler

Copyright 2002 Intel Corporation.

Page 28

Scheduler Occupancy Histogram Image Composition Workload


40% 35% 30% % of Time 25% 20% 15% 10% 5% 0% 0% 20% 40% 60% 80% 100% % of Entries Occupied

alu0 alu1 memory fp move fp/mmx/sse

Measurement of image composition workload on an Intel Xeon Processor


Copyright 2002 Intel Corporation.

Page 29

Scheduler Occupancy Histogram Transaction Processing Workload


100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 0% 20% 40% 60% % of Entries Occupied

% of Time

alu0 alu1 memory fp move fp/mmx/sse

80%

100%

Measurement of transaction processing workload on a 4P Intel Xeon MP Processor System


Copyright 2002 Intel Corporation.

Page 30

Memory Scheduler Occupancy Over Time


8 7 6
Logical Processor 0 Logical Processor 1

Occupancy

5 4 3 2 1 0

<---------

Time

-------->

Measurement of transaction processing workload on a 4P Intel Xeon MP Processor System

Variable partitioning allows a logical processor to use most resources when the other doesnt need them
Copyright 2002 Intel Corporation.

Page 31

Full Sharing
resource sharing with no limit on maximum resource usage z Good for large structures where
z Flexible

Working set sizes are variable Sharing between logical processors possible Not possible for one logical processor to starve
z Caches

are a good example

All caches are shared


Better overall performance vs. partitioned caches Some applications share code and/or data

High set associativity minimizes conflict misses.


Level 2 and Level 3 caches are 8-way set associative
Copyright 2002 Intel Corporation.

Page 32

Shared Cache vs. Partitioned Cache


2.00 Shared Cache improvement over Split Cache
2.2 3.1

1.80

Cache Hit Rate Performance

1.60

1.40

1.20

1.00

0.80 255.vortex 186.crafty 300.twolf 189.lucas 253.perlbmk 168.wupwise 200.sixtrack 181.mcf 175.vpr 173.applu 188.ammp 187.facerec 191.fma3d 252.eon 171.swim 179.art 183.equake 197.parser 178.galgel 172.mgrid 177.mesa 256.bzip2 164.gzip 301.apsi Average 176.gcc 254.gap

Results for 2 copies of application run simultaneously. Measured on Intel Xeon Processor C0 step. Cache miss statistics using EMON event: MN MEM:2nd Level Cache Load Misses Retired

On average, a shared cache has 40% better hit rate and 12% better performance for these applications
Copyright 2002 Intel Corporation.

Page 33

Agenda
z Hyper-Threading

Technology in the Netburst Microarchitecture z Microarchitecture Choices & Tradeoffs z Performance Results

Copyright 2002 Intel Corporation.

Page 34

Server Performance
Transaction Processing Workload
3.00 2.50 Performance 2.00 1.50

E-Commerce Workload
2.50 2.00 Performance

HT Off HT On
20%

10%

HT Off HT On

14%

1.50

17%
1.00

20%
1.00 0.50 0.00 1 2 Number of Processors 4

0.50 0.00 2 4 Number of Processors

Good performance benefit from small die area investment


Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/procs/perf/limits.htm or call (U.S.) www.intel.com/procs/perf/limits.htm 1-800-628-8686 or 1-916-356-3104

Copyright 2002 Intel Corporation.

Page 35

Multi-tasking
Hyper-Threading Technology Speedup 1.2 1.18 1.16 1.14 1.12 1.1 1.08 1.06 1.04 1.02 1 Integer (same) Integer (different) FP (same) FP (different) Integer vs. FP

Intel Xeon Processor platform is a prototype system

Larger gains can be realized by running dissimilar applications due to different resource requirements
Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/procs/perf/limits.htm or call (U.S.) www.intel.com/procs/perf/limits.htm 1-800-628-8686 or 1-916-356-3104

Copyright 2002 Intel Corporation.

Page 36

Conclusions
z Hyper-Threading

Technology is an integral part of the Netburst Microarchitecture


Very little additional die area needed Compelling performance Currently enabled for server processors

z Microarchitecture

design choices

Resource sharing policy matched to traffic and performance requirements


z New

challenging microarchitecture direction

Continuous improvements in future processors for years to come

Copyright 2002 Intel Corporation.

Page 37

S-ar putea să vă placă și