Multi Threading Unit 4 CA

MULTITHREADING
The most important measure of performance for a processor is the rate at which it executes
instructions. This can be expressed as where f is the processor clock frequency, in MHz, and IPC (instructions
per cycle) is the aera!e number of instructions executed per cycle. "ncreasin! clock frequency and increasin!
the number of instructions executed or, more properly, the number of instructions that complete durin! a
processor cycle.
"#$ can be increased by usin! an instruction pipeline and then by usin! multiple parallel instruction
pipelines in a superscalar architecture. %ith pipelined and multiple&pipeline desi!ns, the principal problem is to
maximize the utilization of each pipeline sta!e.
To improe throu!hput,
'xecutin! some instructions in a different order from the way they occur in the instruction stream
and be!innin! execution of instructions that may neer be needed. This approach may be reachin! a
limit due to complexity and power consumption concerns.
(n alternatie approach, which allows for a hi!h de!ree of instruction&leel parallelism without
increasin! circuit complexity or power consumption, is called multithreadin!.
The instruction stream is diided into seeral smaller streams, known as threads, such that the threads can
be executed in parallel.
Implicit and Explicit Multithreadin
The concept of thread used in multithreaded processors may or may not be the same as the concept of
software threads in a multipro!rammed operatin! system.
Process) (n instance of a pro!ram runnin! on a computer. ( process embodies
two key characteristics)
Resource o!nership) ( process includes a irtual address space to hold the process ima!e*
the process ima!e is the collection of pro!ram, data, stack, and attributes that define the
process. +rom time to time, a process may be allocated control or ownership of resources,
such as main memory, ",- channels, ",- deices, and files.
"chedulin#execution) The execution of a process follows an execution path(trace)
throu!h one or more pro!rams. This execution may be interleaed with that of other
processes. Thus, a process has an execution state (.unnin!, .eady, etc.) and a dispatchin!
priority and is the entity that is scheduled and dispatched by the operatin! system
Process s!itch$ (n operation that switches the processor from one process to another, by sain! all the
process control data, re!isters, and other information for the first and replacin! them with the process
information for the second./
Thread$ ( dispatchable unit of work within a process. "t includes a processor context (which includes
the pro!ram counter and stack pointer) and its own data area for a stack (to enable subroutine branchin!).
( thread executes sequentially and is interruptible so that the processor can turn to another thread.
Thread s!itch$ The act of switchin! processor control from one thread to another within the same
process. Typically, this type of switch is much less costly than a process switch.
Thus% a thread is concerned !ith schedulin and execution% !hereas a process is concerned !ith
schedulin#execution and resource o!nership&
The multiple threads within a process share the same resources. This is why a thread switch is
much less time consumin! than a process switch. Traditional operatin! systems, such as earlier ersions of
01"2, did not support threads. Most modern operatin! systems, such as 3inux, other ersions of 01"2,
and %indows, do support thread.
( distinction is made between user&leel threads, which are isible to the application pro!ram, and
kernel&leel threads, which are isible only to the operatin! system. 4oth of these may be referred to as
explicit threads, defined in software. (ll of the commercial processors and most of the experimental
processors so far hae used explicit multithreadin!. These systems concurrently execute instructions from
different explicit threads, either by interleain! instructions from different threads on shared pipelines or by
parallel execution on parallel pipelines.
Implicit multithreadin refers to the concurrent execution of multiple threads extracted from a
sin!le sequential pro!ram. These implicit threads may be defined either statically by the compiler or
dynamically by the hardware.
Approaches to Explicit Multithreadin
(t minimum, a multithreaded processor must proide a separate pro!ram counter for each thread of
execution to be executed concurrently. The desi!ns differ in the amount and type of additional hardware
used to support concurrent thread execution.
"n !eneral, instruction fetchin! takes place on a thread basis. The processor treats each thread
separately and may use a number of techniques for optimizin! sin!le&thread execution, includin! branch
prediction, re!ister renamin!, and superscalar techniques. %hat is achieed is thread&leel parallelism,
which may proide for !reatly improed performance when married to instruction&leel parallelism.
4roadly speakin!, there are four principal approaches to multithreadin!)
5 Interlea'ed multithreadin)
This is also known as fine&!rained multithreadin!. The processor deals with two or more thread
contexts at a time, switchin! from one thread to another at each clock cycle. "f a thread is blocked because
of data dependencies or memory latencies, that thread is skipped and a ready thread is executed.
5 (loc)ed multithreadin)
This is also known as coarse&!rained ultithreadin!. The instructions of a thread are executed
successiely until an eent occurs that may cause delay, such as a cache miss. This eent induces a
switch to another thread. This approach is effectie on an in&order processor that would stall the
pipeline for a delay eent such as a cache miss.
5 "imultaneous multithreadin *"MT+$
"nstructions are simultaneously issued from multiple threads to the execution units of a
superscalar processor. This combines the wide superscalar instruction issue capability with the use of
multiple thread contexts.
5 Chip multiprocessin)
"n this case, the entire processor is replicated on a sin!le chip and each processor handles
separate threads. The adanta!e of this approach is that the aailable lo!ic area on a chip is used
effectiely without dependin! on eer&increasin! complexity in pipeline desi!n.
+or the first two approaches, instructions from different threads are not executed simultaneously.
"nstead, the processor is able to rapidly switch from one thread to another, usin! a different set of re!isters and
other context information. This results in a better utilization of the processor6s execution resources and aoids
a lar!e penalty due to cache misses and other latency eents.
The 7MT approach inoles true simultaneous execution of instructions from different threads, usin!
replicated execution resources. $hip multiprocessin! also enables simultaneous execution of instructions from
different threads. +i!ure 8, illustrates some of the possible pipeline architectures that inole multithreadin!
and contrasts these with approaches that do not use multithreadin!. 'ach horizontal row represents the
potential issue slot or slots for a sin!le execution cycle* that is, the width of each row corresponds to the
maximum number of instructions that can be issued in a sin!le clock cycle.9
The ertical dimension represents the time sequence of clock cycles. (n empty (shaded) slot represents
an unused execution slot in one pipeline. ( no&op is indicated by 1.
The ,irst three illustrations in -iure&. sho! di,,erent approaches !ith a scalar *i&e&% sinle/issue+
processor$
"inle/threaded scalar)
This is the simple pipeline found in traditional ."7$ and $"7$ machines, with no
multithreadin!.
5 Interlea'ed multithreaded scalar)
This is the easiest multithreadin! approach to implement. 4y switchin! from one thread to
another at each clock cycle, the pipeline sta!es can be kept fully occupied, or close to fully occupied.
The hardware must be capable of switchin! from one thread context to another between cycles.
-iure$ . Approaches to Executin Multiple Threads
5 (loc)ed multithreaded scalar$
"n this case, a sin!le thread is executed until a latency eent occurs that would stop the
pipeline, at which time the processor switches to another thread.
-iure .c shows a situation in which the time to perform a thread switch is one cycle, whereas -iure .0
shows that thread switchin! occurs in zero cycles.
"n the case of interleaed multithreadin!, it is assumed that there are no control or data dependencies
between threads, which simplifies the pipeline desi!n and therefore should allow a thread switch with no delay.
Howeer, dependin! on the specific desi!n and implementation, block multithreadin! may require a clock
cycle to perform a thread switch, as illustrated in -iure&.
This is true if a fetched instruction tri!!ers the thread switch and must be discarded from the pipeline
(lthou!h interleaed multithreadin! appears to offer better processor utilization than blocked multithreadin!,
it does so at the sacrifice of sin!le&thread performance. The multiple threads compete for cache resources,
which raises the probability of a cache, miss for a !ien thread.
More opportunities for parallel execution are aailable if the processor can issue multiple instructions
per cycle. -iures .d throu!h -i&.i illustrate a number of ariations amon! processors that hae hardware for
issuin! four instructions per cycle. "n all these cases, only instructions from a sin!le thread are issued in a
sin!le cycle.
The ,ollo!in alternati'es are illustrated)
5 "uperscalar) This is the basic superscalar approach with no multithreadin!. 0ntil relatiely recently, this was
the most powerful approach to proidin! parallelism within a processor. 1ote that durin! some cycles, not all
of the aailable issue slots are used. :urin! these cycles, less than the maximum number of instructions is
issued* this is referred to as horizontal loss. :urin! other instruction cycles, no issue slots are used* these are
cycles when no instructions can be issued* this is referred to as ertical loss.
1 Interlea'ed multithreadin superscalar) :urin! each cycle, as many instructions as possible are issued
from a sin!le thread. %ith this technique, potential delays due to thread switches are eliminated, as preiously
discussed. Howeer, the number of instructions issued in any !ien cycle is still limited by dependencies that
exist within any !ien thread.
5 (loc)ed multithreaded superscalar) (!ain, instructions from only one thread may be issued durin! any
cycle, and blocked multithreadin! is used.
5 2er3 lon instruction !ord *2LI4+$ ( ;3"% architecture, such as "(&<=, places multiple instructions in a
sin!le word. Typically, a ;3"% is constructed by the compiler, which places operations that may be executed
in parallel in the same word. "n a simple ;3"% machine (-iure .+% if it is not possible to completely fill the
word with instructions to be issued in parallel, no&ops are used.
5 Interlea'ed multithreadin 2LI4$ This approach should proide similar efficiencies to those proided by
interleaed multithreadin! on a superscalar architecture.
5 (loc)ed multithreaded 2LI4$ This approach should proide similar efficiencies to those proided by
blocked multithreadin! on a superscalar architecture. The final two approaches illustrated in -iure . enable
the parallel, simultaneous execution of multiple threads)
5 "imultaneous multithreadin) -iure .i shows a system capable of issuin! > instructions at a time. "f one
thread has a hi!h de!ree of instruction&leel parallelism, it may on some cycles be able fill all of the horizontal
slots. -n other cycles, instructions from two or more threads may be issued. "f sufficient threads are actie, it
should usually be possible to issue the maximum number of instructions on each cycle, proidin! a hi!h leel
of efficiency.
5 Chip multiprocessor *multicore+$ -iure .) shows a chip containin! four processors, each of which has a
two&issue superscalar processor. 'ach processor is assi!ned a thread, from which it can issue up to two
instructions per cycle. $omparin! -iures .5 and .)% we see that a chip multiprocessor with the same
instruction issue capability as an 7MT cannot achiee the same de!ree of instruction&leel parallelism. This is
because the chip multiprocessor is not able to hide latencies by issuin! instructions from other threads. -n the
other hand, the chip multiprocessor should outperform a superscalar processor with the same instruction issue
capability, because the horizontal losses will be !reater for the superscalar processor. "n addition, it is possible
to use multithreadin! within each of the processors on a chip multiprocessor, and this is done on some
contemporary machines.
E6AMPLE "7"TEM"
PENTIUM 8 More recent models of the #entium = use a multithreadin! technique that. The
#entium = approach is to use 7MT with support for two threads. Thus, the sin!le multithreaded
processor is lo!ically two processors.
I(M P94ER: The "4M #ower? chip, which is used in hi!h&end #ower#$ products, combines
chip multiprocessin! with 7MT. The chip has two separate processors, each of which is a
multithreaded processor capable of supportin! two threads concurrently usin! 7MT. "nterestin!ly,
the desi!ners simulated arious alternaties and found that hain! two two&way 7MT processors on
a sin!le chip proided superior performance to a sin!le four&way 7MT processor. The simulations
showed that additional multithreadin! beyond the support for two threads mi!ht decrease
performance because of cache thrashin!, as data from one thread displaces data needed by another
thread.
-iure&; shows the "4M #ower?6s instruction flow dia!ram. -nly a few of the elements in the
processor need to be replicated, with separate elements dedicated to separate threads. Two pro!ram counters
are used. The processor alternates fetchin! instructions, up to ei!ht at a time, between the two threads. (ll the
instructions are stored in a common instruction cache and share an instruction translation facility, which does a
partial instruction decode. %hen a conditional branch is encountered, the branch prediction facility predicts the
direction of the branch and, if possible, calculates the tar!et address. +or predictin! the tar!et of a subroutine
return, the processor uses a return stack, one for each thread.
"nstructions then moe into two separate instruction buffers. Then, on the basis of thread priority, a
!roup of instructions is selected and decoded in parallel. 1ext, instructions flow throu!h a re!ister&renamin!
facility in pro!ram order. 3o!ical re!isters are mapped to physical re!isters. The #ower ? has 8/@ physical
!eneral purpose re!isters and 8/@ physical floatin!&point re!isters. The instructions are then moed into issue
queues. +rom the issue queues, instructions are issued usin!
-iure&; Po!er : Instruction Data -lo!
symmetric multithreadin!. That is, the processor has a superscalar architecture and can issue instructions from
one or both threads in parallel. (t the end of the pipeline, separate thread resources are needed to commit the
instructions.

Multi Threading Unit 4 CA

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Multi Threading Unit 4 CA

Încărcat de

Drepturi de autor:

Formate disponibile

MULTITHREADING

S-ar putea să vă placă și