CompArch Studcopy4units

UNIT I
FUNDAMENTALS OF COMPUTER DESIGN AND PIPELINING I. INTRODUCTION 1. Technological Improvements First 25 years performance improved of electronic computers is 25% per year. After 1970 only we understand about the importance and improvement of microprocessors. New set of architectures with simple instructions called RISC (Reduced Instruction Set Architecture).
64-bit Intel Xeon 3.6GHz Intel Xeon 3.6GHz Intel Pentium IV 4.3 GHz & AMD 1.6GHz Intel Pentium III 1.0GHz Alpha 0.7 GHz Power PC HD IBM MIPS
(From 1978 to 2006)
3. Physical Implementation and 4. Design Validation II. MEANING AND REPORTING PERFORMANCE Response Time: The time between the start and completion of an event also referred to as execution time. Throughput: The total amount of work done in given time. Compare the performance of two computers, X and Y, Execution TimeY = n (X is n times faster than Y) Execution TimeX The definition of time is called wall-clock time, response time or elapsed time, which is the latency to complete a task, including disk accesses, memory accesses, input/output activities and operating system overhead. Reporting performance results The guiding principle of reporting performance measurements should be reproducibility. Using SPEC report (Standard Performance Evaluation Corporation) contains the actual performance times (tabular form and graphs). III. QUANTITATIVE PRINCIPLES OF COMPUTER DESIGN 1. Task advantage of parallelism Parallelism is one of the most important methods for improving performance. Being able to expand memory and the number of processors and disks is called scalability that it is a valuable asset for servers. 2. Principle of Locality Again and again we reuse the data and instructions through programs in recently. It applies to data accesses and code accesses. There are two types of locality. Temporal locality states that recently accessed items. Spatial locality says that items whose addresses are near one another tend to be referenced close together in time. 3. Focus on the common case There are two types of cases. Frequent case is simpler and can be done faster than the infrequence case. 1
1. Instruction Level Parallelism (ILP) 2. Thread Level Parallelism (TLP) 3. Data Level Parallelism (DIP) Fundamentals of Computer Design Computer Architecture is the science and art of selecting and interconnecting hardware components to create computers. Computer architecture or digital computer organization is a blueprint, a description of requirements and basic design implementations for the various parts of a computer. Computer architecture is the conceptual design and fundamental operational structure of a computer system. Computer Architecture has three main categories. 1. Instruction Set Architecture (ISA) 2. Microarchitecture and 3. System Design Once the Instruction Set Architecture and Microarchitecture are described, a practical machine needs to be designed. This design process is called the implementation. These types are, 1. Logic Implementation 2. Circuit Implementation
The overall performance improved by optimizing for the normal case. 4. Amdahls Law It defines the speedup that can be gained by using a particular feature. Speedup is the ratio between,
Performance for entire task using the enhancement
CPU time = CPU clock cycles for a program X clock cycle time (or) CPU clock cycles for a program CPU time = Clock rate The instruction path length or instruction count (IC) is the counting of number of instructions executed. Here, we calculate the average number of clock cycles per instruction (CPI). CPU clock cycles for a program CPI = Instruction Count Clock cycles can be defined as IC X CPI. This allows us to use CPI in the execution time, CPU time = Instruction count X Cycles per instruction X Clock cycle time
Instructions CPU time = Program X Clock cycles Seconds Seconds X = Instruction Clock Cycle Program
Speedup =
Performance time for entire task without using enhancement
Otherwise,
Execution time for entire task without using the enhancement
Speedup =
Execution time for entire task using enhancement when possible
Speedup tells us how much faster a task will run using the computer with the enhancement. Amdahls law gives us a quick way to find the speedup. It depends on two factors. i. The fraction of the computation time in the original computer that can be converted to take advantage of the enhancement ii. The improvement gained by the enhanced execution mode.
Fractionenhanced Execution timenew = Execution timeold X ( 1- Fractionenhanced) + Speedupenhanced
Processor performance is dependent upon three characteristics. These are, a) Clock cycle (rate) b) Clock cycles per instructions c) Instruction Count To calculate the number of total processor clock cycles as, CPU clock cycles = ICi X CPIi
i=1
n
The overall speedup is the ration of the execution times:

Execution timeold Speedupoverall = = Executiontimenew 1 Fractionenhanced Fractionenhanced + Speedupenhaced 1
Overall CPU time = ICi X CPIi

i=1
X Clock cycle time and
ICi X CPIi
i=1 ICi n Overall CPI = = i=1 Instruction Count Instruction Count
X CPIi
5. The processor performance equation All computers are constructed using a clock running at a constant rate. These discretetime events are called ticks, clock ticks, clock periods, clocks, cycles or clock cycles. Clock period rate measured in ns or GHz. These measurements expressed in two ways. 2
III. INSTRUCTION SET PRINCIPLES 1. Classifying Instruction Set Architecture Instruction Set Architecture is the code of machine language that includes the instruction set, word size, memory address modes, processor registers, address and data formats. Instruction Set Architecture defines the items
in the computer that it is the set of instructions. These types are, 1. Stack Architecture 2. Accumulator architecture 3. General-purpose register architecture or Register-memory architecture 4. Load store or Register register architecture An Instruction Set Architecture can be classified according to: i) Register Model ii) The number of operands for instructions iii) Addressing modes iv) The operations provided in the instruction set v) Type and size of operands vi) Control flow instructions and vii) Encoding Architecture may require that data is aligned: i) Byte always aligned ii) Half word (16 bits) aligned at byte offsets 0,2,4,6, iii) Word (32 bits) aligned at byte offsets 0,4,8,12, iv) Double word (64 bits) aligned at byte offsets 0,8,16,24, Types of Addressing modes are, i) Register ii) Immediate iii) Displacement iv) Register Indirect v) Indexed vi) Direct or absolute vii) Memory Indirect viii) Auto-increment ix) Auto-decrement x) Scaled Types and size of operands are, i) Integer ii) Floating point (Single precision) iii) Character iv) Packed decimal etc.,
Types of operations are, i) Arithmetic (+, -, * and /) ii) Logical (and, or , xor) iii) Data transfer (load, store and move) iv) Control flow (branch, jump, procedure call, return and traps) v) System (OS call, virtual memory instructions) vi) Floating point (add, subtract, multiply, divide, compare vii) Decimal (add, multiply and divide) viii) String (move, compare and search) ix) Graphics (compression / decompression) Types of control instructions are, i) Conditional branches ii) Unconditional branches and iii) Procedure call / returns There are three primary components available in an instruction set. These components are, i) Operations ii) The data types and iii) addressing modes 2. Design Issues Designing Instruction Set Architecture to improve compilation. i) Provide enough general purpose registers to ease register allocation ii) Provide regular instruction sets by keeping the operations, data types & addressing modes orthogonal iii) Provide primitive constructs rather than trying to map to a high-level language iv) Simplify trade-off among alternatives v) Allow compilers to help make the common case fast These metrics are, i) Orthogonality No special registers, special cases with any data type or instruction type ii) Completeness Support for a wide range of operations and target applications iii) Regularity No overloading for the meanings of instruction fields iv) Streamlined design Resource needs easily determined. Simplify tradeoffs. i) Ease of Compilation ii) Ease of Implementation and iii) Scalability 3
III. PIPELINING 1. Basic Concepts: Pipelining is the key implementation techniques used to make fast CPUs. A pipeline is like an assembly line that it has several steps. Each of these steps is called a pipe stage or a pipe segment. Every different step is completing different parts of different instructions in parallel. Simplified View Instructions
3. The instruction formats are few in number with all instructions being one size. RISC Instruction Set Implementation Most RISC architecture has three classes of instructions. These are, 1. ALU Instructions 2. Load and Store Instructions 3. Branch and Jumps Every instruction in this RISC subset can be implemented in at most five clock cycles. The five clock cycles are as follows. 1. Instruction Fetch Cycle (IF) 2. Instruction Decode / Register Fetch Cycle (ID) 3. Execution / Effective address cycle (Ex) 4. Memory Access (MEM) and 5. Write-back cycle (WB) For example RISC pipeline,
Clock Number 4 5 6 M W E M W D E M F D E F D
Fetch
Instructions
Execute
Result
Expanded View New address Wait Instructions wait Instructions
Fetch
Execute
Result
The throughput is defined by how often an instruction exits the pipeline. The time required between moving an instruction one step down the pipeline is a processor cycle. In a computer, this processor cycle is 1 clock cycle. The pipeline designers goal is to balance the length of each pipeline stage. Pipelining causes a reduction in the average execution time per instruction. The reduction can be viewed as decreasing the number of clock cycles per instruction (CPI). If the starting point is a processor that takes multiple clock cycles per instruction then pipelining is viewed as reducing the CPI. If the starting point is a processor that takes 1 (long) clock cycle per instruction, pipelining decreases the clock cycle time. The basics of a RISC Instruction Set RISC architecture are characterized by a few key properties. These are, 1. All operations on data apply to data in register and change the entire register 2. The only operations that affect memory are load and store operations that move data from memory to a register or to memory from a register. 4
Instruction Number 1 Instruction i F Instruction i+1 Instruction i+2 Instruction i+3 Instruction i+4
2 D F
3 E D F
W M E
W M
Issues in Pipelining It does not reduce the execution time of an individual instruction. The increase in instruction throughput means that a program runs faster and has lower total execution time. Pipeline overhead arises from the combination of pipeline register delay and clock skew. Clock skew, which is maximum delay between when the clock arrives at any registers, also contributes to the lower limit on the clock cycle. 2. Hazards A Hazard is a phenomenon or a process, either natural or human made. A hazard is a situation that poses a level of threat to life, health, property or environment. A hazardous situation that has come to pass is called an incident. Hazard and vulnerability interact together to
create a risk. In pipeline, hazard is a situation that prevents next instruction in the instruction stream from executing during its designated clock cycle. Hazards are problems with the instruction pipeline in CPU microarchitectures that potentially result in incorrect computation. There are three types of hazards. i) Data Hazards ii) Structural Hazards and iii) Control Hazards Hazards reduce the performance from the ideal speedup gained by pipelining. There are several methods used to deal with hazards, including pipeline stalls, pipeline bubbling, register forwarding, and in the case of out-oforder execution, the score boarding and the Tomasulo algorithm. A hazard occurs when two or more of these simultaneous instructions conflict.
Pipeline length Speedup = 1 + Pipeline stall cycles per instruction 1 Speedup from pipelining = 1 + pipeline stall cycles per instruction X Pipeline depth
iii) Control Hazards It also known as branch hazards because it occur with branches. To avoid control hazards microarchitectures can, a) Insert a pipeline bubble, guaranteed to increase latency or b) Use branch prediction and estimate which instructions to insert 3. Pipeline Implementation We focus on a pipeline for an integer subset of MIPS that consists of load-store word, branch equal zero and integer ALU operations. Every MIPS instruction can be implemented in five clock cycles. The five clock cycles are, i) Instruction Fetch Cycle (IF) IR Mem [PC]; NPC PC + 4; The IR is used to hold the instruction that will be needed or subsequent clock cycles. The register NPC is used to hold the next sequential PC. ii) Instruction decode / register fetch cycle (ID) Decode the instruction and access the register file to read the resgisters (rs and rt are the register specifiers) A Regs [rs]; B Regs [rt]; Imm Sign-extended immediate field of IR; iii) Execution / effective address cycle (EX) The ALU operates on the operands prepared in the prior cycle, performing one of four functions depending on the MIPS instruction type. These functions are, a) Memory Reference b) Register Register ALU instruction c) Register Immediate ALU instruction and d) Branch iv) Memory access / branch completion cycle (MEM) The PC is updated for all instructions PC NPC; a) Memory Reference ii) Branch v) Write-back Cycle (WB) Write the result into the register file, whether it comes from the memory system.
i) Data Hazards These are three situations in which a data hazard can occur. 1. Read after write (RAW), a true dependency 2. Write after read (WAR) and 3. Write after write (WAW) Pipeline is to change the relative timing of instructions by overlapping their execution. This overlap introduces data hazards and control hazards. Sometimes it is called race hazards because ignoring data hazards can result in race conditions. A pipeline interlock detects a hazard and stalls the pipeline until the hazard is cleared. ii) Structural Hazards It occurs when a part of the processors hardware is needed by two or more instructions at the same time. It appear is when some resource has not been duplicated enough to allow all combinations of instructions in the pipeline to execute. 5
1) A basic pipeline for MIPS The pipeline registers carry both data and control from the pipeline stage to the next. The registers are labeled with the names of the stages they connect. The fields of the instruction register (IR), which is part of the IF/ID register, are labeled when they are used to supply register names. Any instruction is active in exactly one stage of the pipeline at a time. Multipliers is used to control the data path. Implementing the control for the MIPS pipeline All the data hazards can be checked during the ID phase of the pipeline. For the MIPS integer pipeline , all the data hazards can be checked during the ID phase of the pipeline. We can determine what forwarding will be needed during ID and set the appropriate controls then. Once a hazard has been detected, the control unit must insert the pipeline stall and prevent the instructions. When we detect a hazard we need only change the control portion of the ID/EX pipeline register to all 0s and hazard detect by comparing some set of pipeline registers. Dealing with branches in the pipeline In some processors, branch hazards are even more expensive in clock cycles. 4. Multicycle Operations The floating point pipeline allows for a longer latency for all operations. EX cycle may be repeated as many times as needed to complete the operation. The number of repetitions can vary for different operations. There are four separate functional units in our MIPS implementation. 1) The main integer unit that handles loads and stores, integer ALU operations and branches. 2) FP and integer multiplier 3) FP adder that handles FP add, subtract and conversion. 4) FP and integer divider To describe a pipeline, we must define both the latency of the functional units and also the initiation interval or repeat interval. Latency means the number of intervening cycles between instructions. The initiation or repeat 6
interval is the number of cycles that must elapse between issuing two operations of a give type. The pipeline stages take multiple clock cycles, such as the divide unit and subdivided to show the latency of pipeline stages. The ID/EX register must be expanded to connect ID to EX, DIV, M1 and A1. A pipeline that supports multiple outstanding FP operations. FP divider, address and multiplier fully pipelines and it has a depth of seven and four stages. Pipeline control have separate register files for integer and floating point data. So, easily the pipeline accept the mulitcycle operations.
UNIT II INSTRUCTION LEVEL PARALLELISM WITH DYNAMIC APPROACHES
1. Concepts Pipelining is to overlap the execution of instructions and improve performance. This is called Instruction Level Parallelism. Instruction can be evaluated in parallel. To increase the instruction level parallelism among iterations of a loop. This type of parallelism is called loop-level parallelism. Every iteration of the loop can overlap with any other iteration, although within each loop iteration there is little of no opportunity for overlap. Data dependences Dependences are a property of programs. There are three different types of dependences. These are, 1. Data Dependence 2. Name Dependence and 3. Control Dependence An instruction j is data dependent on instruction I if either of the following holds: i) Instruction i produces a result that may be used by instruction j or ii) Instruction j is data dependent on instruction k and instruction k is data dependent on instruction i. If two instructions are data dependent, they cannot execute simultaneously or be completely overlapped. A data dependence conveys three things. These are, 1. The possibility of a hazard 2. The order in which results must be calculated and 3. An upper bound on how much parallelism can possibly be exploited. A dependence can be overcome in two different ways. i) Maintaining the dependence but avoiding a hazard ii) Eliminating a dependence by transforming the code Name Dependence It occurs when two instructions use the same register or memory location called a name, but there is no flow of data between the instructions associated with that name. there are two types of name dependences. 7
1. An antidependence between instructions i and instructions j occurs when instruction j writes a register or memory location that instruction i reads. 2. An outputdependence occurs when instruction i and instruction j write the same register or memory. Register renaming can be done either statically by a compiler or dynamically by the hardware. Control Dependence It determines the ordering of an instruction, i, with respect to a branch instruction so that the instruction i is executed in correct program order. For example, if P1 { S1; }; if P2 { S2; } S1 is control dependent on P1, and S2 is dependent on P2 but not on P1.
2. Dynamic Scheduling It offers several advantages. These are, 1. It simplifies the compiler 2. It allows the processor to tolerate unpredictable delays such as cache misses 3. To increase in hardware complexity 4. To run efficiently on a different pipeline and 5. Its processor cannot change the data flow. To gain the full advantage of dynamic scheduling we will allow the pipeline to issue any combination of two instructions in a clock, using the scheduling hardware to actually assign operations to the integer and floating point unit. It must be able to complete and commit multiple instructions per clock. A dynamically scheduled pipeline can yield very high performance. The performance can also be limited by the single Common data Bus (CDB). Each Common Data Bus must interact with each reservation station and the associative
tag-matching hardware. In Tomasulos scheme two different techniques are combined: 1. The remaining of the architectural registers to a larger set of registers and 2. The buffering of source operands from the register file There are three stages of Tomasulo Algorithm. These are, 1. Issue get instruction from FP OP Queue: If reservation station free (no structural hazard), control issues instructions and sends operands (renames registers) 2. Execute Operate on operands (EX): When both operands ready then execute; if not ready watch common data bus for result 3. Write result Finish execution (WB): Write on common data bus to all awaiting units; mark reservation station available.
Normal data bus : Data + Destination (go to bus) Common data bus : Data + Source (come from bus)
2. Hardware based Speculation It combines three key ideas. These are, i) Dynamic branch prediction - To choose which instructions to execute ii) Speculation - To allow the execution of instructions before the control dependences are resolved iii) Dynamic Scheduling To deal with the scheduling of different combinations of basic blocks Hardware based speculation follows the predicted flow of data values to choose when to execute instructions. This method of executing programs is a data flow execution. The most common implementation is based on a modification of the Tomasulos algorithm. The idea is to separate when an instruction writes its results from when whose results are committed to processor state. Instructions can execute out of order, but must commit in order. Results are stored in the reorder buffer (ROB) between instruction completion and commit instructions are tracked by the record buffer. The ROB holds the result of an instruction between the time the operation associated with the instruction completes and the time the instructions commits. The ROB is a source of operands for instructions. ROB supplies operands in the interval between completion of instruction execution and instruction commit. There are four stages of instruction execution. These are, 1) Issue 2) Execute 3)Write results and 4) Commit Each entry in the ROB contains 4 fields. These fields are, 1) Instruction Type, 3) The value field and 2) Destination Field, 4) Ready field
- 64 bits of data + 4 bits of RS source address - Write if matches expected RS (produces result) - Does the broadcast
Instruction Fetch and Decode Unit
Reservation Station
Reservation Station FP unit
Reservation Station Branch Unit
Reservation Station
ALU
Load / Store Unit
Commit Unit
r e g i s t e r file
1. Multiple Issue Lets assume we want to extend Tomasulos algorithm to support a two-issue superscalar pipeline with a separate integer and floating point unit, each of which can initiate an operation on every clock. Modern superscalar processors issue four or more instructions per clock often include both approaches. They both pipeline and widen the issue logic.
1) Instruction Type: It indicates whether the instruction is a branch, a store, or a register operation. 2) Destination Field: It supplies the register or memory address where the instruction result should be written. 3) The Value Field: It is used to hold the value of the instruction result until the instruction commits. 8
4) The ready Field: It indicates that the instruction has completed execution & the value is ready. Once an instruction commits, its entry in the ROB is reclaimed and the register or memory destination is updated. Advantages of hardware speculation i) Memory reference is very easy due to the pointers ii) Dynamic branch prediction can be better iii) Maintains precise exceptions iv) No fix up code is required and v) Legacy Code Reorder Buffer Entry Busy Instruction State Destination Value
Alias analysis by inspection does not make such of a difference. It cannot be perfect at compile time. There are three other models of memory alias analysis. These models are, a) Global / Stack perfect Best compiler based analysis b) Inspection Determine the compile time c) None All memory references are assumed to conflict 4. Case studies Suppose it act as a perfect model in the processor, these models are, 1) WAW and WAR hazards through memory Eliminated WAW and WAR hazards through register renaming, but not in memory usage. 2) Unnecessary dependences Certain code generation conventions introduce unneeded dependences 3) Overcoming the data flow limit Prediction worked with high accuracy, it could overcome the data flow limit.
5. Limitations of ILP i) Branch and Jump Prediction a) Good branch prediction by hardware or the software results in good amount of ILP exploited b) Accuracy of branch predictors dominates the branch frequency and c) Jump prediction through reduces the penalty on indirect jumps ii) Unrolling Accumulators are not re-associated so that the parallelism decreases iii) Window Size The set of instructions that is examined for simultaneous execution is called the window. Bigger the window, better the performance. The total window size is limited by the required storage, the comparisons and a limited issue rate. iv) Usage of discrete windows A window full of instructions would have less parallelism than a continuous window. v) Alias analysis and register renaming 9
UNIT III INSTRUCTION LEVEL PARALLELISM WITH SOFTWARE APPROACHES 1. COMPILER TECHNIQUES FOR EXPOSING ILP 1. Basic pipeline scheduling and loop unrolling A compilers ability to perform this scheduling depends both on the amount of ILP available in the program and on the latencies of the functional units in the pipeline. To keep a pipeline full, parallelism among instructions must be exploited by finding sequences of unrelated instructions that can be overlapped in the pipeline. To avoid a pipeline stall, a dependent instruction must be separated from the source instruction by a distance in clock cycles equal to the pipeline latency of that source instruction. We look at how the compiler can increase the amount of available ILP by transforming loops. This performance of the loop, we can use the parallelism to improve its performance for a MIPS pipeline with the latencies. Instruction producing result Instruction using result Latency in clock cycles
In this case, we can eliminate the data use stalls by creating additional independent instructions within the loop body. We want to use different registers for each iteration, increasing the required number of registers. The gain from scheduling on the unrolled loop is even larger than on the original loop. 2. Summary of the loop unrolling and scheduling Loop unrolling is used to increase the size LOC that can be achieved effectively. There are 3 types of limits that can be achieved by loop unrolling. These limits are, i) A decrease in the amount of overhead amortized with each unroll, ii) code size limitations and iii) compiler limitations For larger loops, the code size growth may be concern particularly if it causes an increase in the instruction cache miss rate. Register pressure arises because scheduling code to increase ILP causes the number of live values to increase. 2. STATIC BRANCH PREDICTION Branch prediction occurs at compile time. It is used to schedule the data hazards. Branch prediction depends both on the accuracy of the scheme and the frequency of conditional branches. Code scheduling of loop unrolling depends on branch prediction. A more accurate technique is to predict branches on the basis of profile information collected from earlier runs. An individual branch is always highly biased toward taken or untaken. The branch instruction of small memory index is called a branch-prediction buffer or branch-history table. The memory contains a bit that says whether the branch was recently taken or not. Branch predictors that use the behavior of other branches to make a prediction are called correlating predictors or two-level predictors. 3. VLIW (Very Long Instruction Word Processors) It is one of the multiple-issue processors. It is to allow multiple instructions to issue in a clock cycle. Most designers choose to implement either a VLIW or a 10
A single window scheme for increasing the number of instructions relative to the branch and overhead instructions is loop unrolling. Unrolling simply replicates the loop body multiple times, adjusting the loop termination code. Loop unrolling can also be used to improve scheduling. Because it eliminates the branch, it allows instructions from different iterations to be scheduled together.
dynamically scheduled superscalar because it issue a fixed number of instructions formatted or a fixed instruction packet with the parallelism. VLIW processors are scheduled by the compiler. Its advantage is to increase the maximum issue rate. It considered in the three types of scheduling techniques. These techniques are, 1. Local Scheduling Loop unrolling generates straight-line code 2. Global Scheduling Scheduling code across branches and 3. Trace Scheduling To identify the branch instructions A VLIW architecture places multiple instructions in a single word. A VLIW is constructed by the compiler, which places operations that may be executed in parallel in the same word.
In interleaved multithreading VLIW that it should provide similar efficiencies to those provided by interleaved multithreading on a superscalar architecture. In blocked multithreaded VLIW that it should provide similar efficiencies to those provided by blocked multithreading on a superscalar architecture. A B C D
A A
A A
N N
N N
Thread Switches
B B C
B N N
B N N
N N N
A
A A
Blocked Multithreading VLIW
A
latency
A A N N A N N N A A A A A A A N
IV. ADVANCED COMPILER SUPPORT Compiler is used to implement an instruction set. The first goal of the compiler is correctness. The second goal is speed of the compiled code. Other goals are fast compilation, debugging report and interoperability among languages. Dependencies
Language dependent; Machine independent
Single-threaded Scalar
VLIW
Function Front end per language

Transform language to common intermediate form
A B C D
A B C D A B
A B N D A N
N B N D N N
N N N D N N
Thread Switches
Somewhat language dependent; Largely machine independent
High-level optimizations
Ex: Procedure integration
Small Lg. dependencies; Global Machine dependencies Optimizer slight
including global & local optimizations + register allocation
Interleaved multithreading VLIW Highly machine dependent; Language independent
Code Generator
Detailed instruction selection and machine-dependent
11
optimizations; may include or be followed by assembler
This structure helps to manage the complexity of the transformations and it makes writing a bug-free compiler easier. The correct compiler complexity of limitation is code optimization. The multiple-pass structure helps reduce compiler complexity. Compilers choose to expand inline procedure calls before they know the exact size of the procedure call. This compiler problem is called phase-ordering problem. The register allocator will allocate the temporary to a register. 1. Classifications of Optimization Optimizations performed by modern compilers that it can be classified by the style of the transformation. These are, i) High-level Optimizations- Processor independent ii) Local Optimizations (Basic block) -Within straightline code iii) Global Optimizations - Across a branch iv) Register allocation v) Processor dependent Optimizations - Depends on processor knowledge The high level languages and the compiler of interaction is affects how programs use an instruction set architecture. There are three areas are used to allocate the data (variables). These areas are, i) Stack - It is used to allocate local variables. Here, values are pushed or popped on the stack ii) Global data area It is used to allocate global variables and constants iii) Heap It is used to allocate dynamic objects with pointers Processor-dependent optimizations are done in a code generator. Some guidelines are helpful to write a compiler easily that it will operate efficient and correct code. These guidelines, i) Provide Regularity 12
ii) Provide primitives, not solutions iii) Simplify trade-offs among alternatives iv) Provide instructions that bind the quantities known at compile time as constants 2. Compiler support for multimedia instructions Vector microprocessor architecture builds the vector register size into the architecture. Intel expands from 64 bits to 128 bits vectors called Streaming SIMD Extension (SSE). A major advantage of vector computers is hiding latency of memory access. The goal of vector addressing modes is to collect data scattered about memory. Vector computers added strided addressing and gather/Scatter addressing. These addressing are used to increase the number of programs that can be vectorized. Strided addressing skips a fixed a number of words between each access, so sequential addressing is called unit stride addressing. Since the data for multimedia applications are streams that start and end in memory, strided and gather/scatter addressing modes are essential to successful vectorization. A 64-bit-wide vector computer can calculate 8 pixels simultaneously. 3 vector loads (to get RGB) 3 vector multiplies (to convert R) 6 vector multiply adds (to convert G and B) 3 vector shifts (to divide by 32,768) 2 vector adds (to add 128) 3 vector stores (to store YUV)
V. HARDWARE SUPPORT FOR EXPOSING MORE PARALLELISM 1. Conditional or predicted instructions An instruction refers to a condition, which is evaluated as part of the instruction execution. If the condition is true, the instruction is executed normally. If the condition is false, the execution continues as if the instruction were no-op. 2. Compiler speculation with hardware support i) Hardware support for preserving exception behavior:
There are four methods for supporting speculation without introducing erroneous exception. The hardware and operating system cooperatively ignore exceptions for speculative instructions preserved exception behavior for correct programs, but not for incorrect ones. Speculation: Instructions that never raise exceptions are used and checks are introduced to determine when an exception should occur. A set of status bits, called poison bits, are attached to the result registers written by speculation instructions when the instructions cause exception. A mechanism is provided to indicate that an instruction is speculation and the hardware buffers the instruction result until if it certain that the instruction is no longer speculative. ii) Hardware support for memory reference speculation A special instruction to check for address conflicts It is left at the original location of the load instruction. It acts like a guardian. The load is moved up across one or more stores. When a speculated load is executed The hardware saves the address of the accessed memory location. If a subsequent store changes the location before the check instruction, then the speculation has failed. Speculation failure can be handled If only the load was speculated, then redo the load at the point of the check instruction. If additional instructions that depend on the load were also speculated then redo them.
Speculated instructions may slow down the computation when the prediction is incorrect, this difference is significant. One result of this difference is that ever statically scheduled processors normally include dynamic branch prediction. Hardware-based speculation maintains a completely precise exception model even for speculated instructions. Recent software-based approaches have added special report to allow this as well. Hardware based speculation does not require compensation or bookkeeping code, which is needed by ambitions software speculation mechanisms. Compiler-based approaches may benefit from the ability to see further in the code sequence, resulting in better code scheduling than a purely hardware-driven approach. Hardware-based speculation with dynamic scheduling does not require different code sequences to achieve good performance for different implementations of architecture. VII. CASE STUDY The purpose of this case study is to demonstrate the interaction of hardware and software factors in producing instruction level parallel execution and strongly illustrate the various limits on instruction level parallelism. This example of the interaction system is Hash table data structure.
VI. HARDWARE VERSUS SOFTWARE SPECULATION MECHANISMS To separate extensively, we must be able to disambiguate memory reference. This disambiguation allows us to move loads past stores at run time. Hardware based speculation works better, when hardware-based branch prediction is superior to software based branch prediction done at compile time.
13
UNIT IV MULTIPROCESSORS AND MULTICORE ARCHITECTURES By using multiple processors, both to increase performance and to improve availability dates back to the earliest electronic computers. In multiprocessor, there are four categories of computers. These are, 1. Single instruction stream, single data stream (SISD) This category is the uniprocessor 2. Single instruction stream, multiple data streams (SIMD) - Each processor has its own data memory, but there is a single instruction memory and control processor. 3. Multiple instruction streams, single data stream (MISD) 4. Multiple instruction streams, multiple data streams (MIMD) - It exploit thread-level parallelism A single main memory has a symmetric relationship to all processors and a uniform access time from any processor, these multiprocessors are called symmetric (shared-memory) multiprocessors (SMPs), and this style of architecture is sometimes called uniform memory access (UMA). All processors have a uniform latency from memory.
Processor + Cache
Processor + Cache
Memory
I/O
Memory
I/O
Interconnection Network
Memory Processor + Cache
I/O
Memory Processor + Cache
I/O
The basic structure of a distributed-memory multiprocessor
I. SYMMETRIC AND DISTRIBUTED SHARED MEMORY ARCHITECTURE Symmetric shared-memory machines support the caching of both shared and private data. Private data are used by a single processor, while shared data are used by multiple processors. 1. Multiprocessor cache coherence Two different processors have two different values for the same location. This difficulty is referred to as the cache coherence problem. Coherence defines the behavior of reads and writes to the same memory location. This memory system behavior of first aspect is coherence and second aspect is consistency. Consistency defines the behavior of reads and writes with respect to accesses to other memory locations. 2. Basic schemes for enforcing Coherence In a coherent multiprocessor, the caches provide both migration and replication of shared data items. Coherent caches provide migration, since a data item can be moved to a local cache and used there in a transparent fashion. This migration reduces both the latency to access a shared data item is allocated remotely and the bandwidth demand on the shared memory. Coherent caches also provide replication for shared data that are being simultaneously read, since the caches 14
Processor
Processor
Processor
Processor
One or more levels of cache
Main Memory
I/O system
The basic Structure of a centralized shared-memory multiprocessor
The second group consists of multiprocessors with physically distributed memory.
make a copy of the data item in the local cache. Replication reduces both latency of access and contention for a read shared data item. The protocols to maintain coherence for multi processors are called cache coherence protocols. Key to implementing a cache coherence protocol is tracking the state of any sharing of a data block. There are two classes of protocols, which use different techniques to track the sharing status. i) Directory Based - The sharing status of a block of physical memory is kept in just one location, called the directory. ii) Snooping - All cache controllers monitor or snoop on the medium to determine whether or not they have a copy of a block. 3. Snooping Protocols The simple protocol has three states. These are, i) Invalid ii) Shared and iii) Modified Snoop based cache is used to implement the broadcasts cache. There are two types of snooping protocols. These protocols are, i) Write invalidate protocol - The write requires exclusive access, any copy held by the reading processor must be invalidated. It two processor do attempt to write the same data simultaneously, one of them wins the race, causing the other processors copy to be invalidated. ii) Write Update or Write Broadcast Protocol - It is to update all the cached copies of a data item when that item is written. 4) Basic Implementation Techniques All processors continuously snoop on the bus, watching the addresses. The processors check whether the address on the bus is in their cache. If so, the corresponding data in the cache are invalidated. If the processor finds that it has a dirty copy of the requested cache block, it provides that cache block in response to the read request and causes the memory access to be aborted. These states are given the measurement of various types of cache action. These cache actions are normal hit, normal miss, replacement and coherence. 15
Request Read Hit Read Miss Read Miss Read Miss
Source Processor Processor Processor Processor
State of addressed cache block Shared or modified invalid shared Modified
Type of cache action Normal Hit Normal Miss Replacement Coherent
1. Normal Hit - Read/Write data in cache 2. Normal Miss -Place read/write miss on bus 3. Replacement - Address conflict miss: place read/write miss on bus 4. Coherence - Place invalidate on bus. These operations are called upgrade or ownership misses 5. No action - Allow memory to service read miss
Cache transitions based on requests from CPU
Cache transitions based on requests from bus
The state in each node represents the state of the selected cache block specified by the processor or bus request. These three states of the protocol are duplicated to represent transitions based on processor requests. All of the stages in this cache protocol would be needed in a uniprocessor cache, where they would correspond to the invalid, valid (Clean) and dirty states. Any transition to the exclusive state requires an invalidate or write miss to be placed on the bus.
Processor
Processor
Processor
Processor
Memory
Memory
Memory
Memory
A Multiprocessor with uniform memory access
6. Implementing snoopy cache coherence In a system without a bus, two processors that attempt to write the same block at the same time, a situation which is called a race, are strictly ordered. One write is processed and precedes before the next is start. In a snoopy system that a race has only one winner is ensured by using broadcast for all misses as well as some basic properties of the interconnection network. II. Performance of symmetric shared-memory multiprocessors The overall cache performance is a combination of the behavior of uniprocessor cache miss traffic and the traffic caused by communication. The misses arise from interprocessor communication are called coherence misses. It has two sources. i) True sharing misses - Resultant block is transferred ii) False sharing misses - The block is shared, but no word in the cache is shared 1) A commercial workload: Each processor has a three-level cache hierarchy and four-issue processor. These levels are, L1 - It consists of a pair of 8 KB direct - mapped on-chip caches, one for instruction and one for Data L2 - It is a 96 KB on-chip unified 3 - way set associative cache L3 - It is an off-chip, combined, direct-mapped 2 MB cache with 64 byte blocks also using write Back 16
Any memory block in the shared state is always up to date in the memory, which simplifies the implementation. They also have larger on-chip caches to
reduce bus utilization. 5. Limitations in symmetric shared - memory multiprocessors and snooping protocols In the simple case of a bus-based multiprocessor, the bus must support both the coherence traffic as well as normal memory traffic arising from the caches. To increase the communication bandwidth between processors and memory, designers have used multiple buses as well as interconnection networks. A snoopy cache coherence protocol can be used without a centralized bus that a broadcast be done a potentially shared cache block.
Memory Cycles per Instruction
Applications
% time user mode
% time kernel
% time CPU idle
0.04 0.03 0.02 0.01 0 1 2 3 Processor Count 4 5
The frequency of I/O increases both the kernel time and the idle time. 2) Performance Measurements of the commercial workload The measurement study consists of three applications. These are, i) OLTP workload ii) ADSS workload iii) A web index search (AltaVista) These workload consists of a set of client processes that generate requests and a set of servers that handle them. These measurements are instruction execution, L2 and L3 cache access, memory access, True sharing miss rate, false sharing miss rate and capacity miss rate. Suppose if we increase the block size that it should decrease the instruction miss rate and capacity miss rate and sharing miss rate. To increase the block size from 32 to 256 affects four of the miss rate components. These are, i) True sharing miss rate ii) Compulsory miss rate iii) The conflict / capacity misses and iv) The false sharing miss rate
120% 100% Performance of execution time 80% 60% 40% 20% 0% OLTP DSS Applications 300 Memory Cycles per Instruction 250 200 150 100 50 0 1 2 3 4 5 6 7 8 Cache Size (MB) AV
Misses per 1000 instructions
150 100 50 0 1 2 3 4 Block Size (Bytes)
3. A multiprogramming and OS workload Multiprogramming workload consists of both user activity and OS activities. This workload of compile phase consists of parallel concepts using eight processors. This workload has three distinct phases. These are, 1) Compute user activity and OS activity 2) Installing the object files and 3) Removing the object files For workload measurements, we assume the following memory and I/O systems. These are, i) Level 1 instruction cache ii) Level 1 data cache iii) Level 2 cache iv) Main Memory & v) Disk system These workload is,
CPU idle (waiting for I/O)
User Ex % instruction execution % execution time

Ex - Execution
Kernel Ex
Synchronization wait
17
Execution time is broken into four components. These are, i) Idle ii) User iii) Synchronization & iv) Kernel 4. Performance of the multiprogramming and OS workload User processes involves executing more instructions due to the two reasons. These are, i) Kernel initializes to allocate all pages and ii) Kernel shares the data
III. Distributed Shared Memory and Directory-Based Coherence

Processor Processor Processor
Memory
I/O
Memory
I/O
Memory
I/O
Memory
Directory
Directory
Directory
Directory
7% 6% 5% 4% 3% 2% 1% 0% 32
kernel miss rate
Directory
Directory
Directory
Directory
Miss rate
Memory
I/O
Memory
I/O
Memory
I/O
Memory
user miss rate 64 128 Cache Size (KB) 256
7% 6% 5% 4% 3% 2% 1% 0% 16
A directory keeps the state of every block that may be cached. The alternative to a snoop-based coherence protocol is a directory protocol that it used to reduce the bandwidth demands in a centralized shared-memory machine. Each memory block is the same size of L2 or L3 cache. 1) Directory-based cache coherence protocols: The basics In snooping protocol, there are two operations. These operations are, i) Handling a read miss and ii) Handling a write to a shared, clean cache block To implement these operations, a directory must track the state of each cache block. These states are, i) Shared ii) Uncached and iii) Modified These coherence measurements is, Message Type Read Miss Write Miss Source Local Cache Local Cache Local Cache Destination Home Directory Home Directory Home Directory Message contents P, A P, A A
kernel miss rate
Miss rate
user miss rate 32 64 Block Size (Bytes)
128
Data cache size Vs block size for Kernel and user components
Increasing the data cache size affects the user miss rate more than it affects the kernel miss rate. The kernel misses broken into three classes. These are, i) Compulsory ii) Coherence and iii) Capacity To increase the block size that it reduces the compulsory miss rate. 18
Invalidate
Invalidate Fetch Fetch / Invalidate Data value reply
Home Directory Home Directory Home Directory Home Directory
Remote Cache Remote Cache Remote Cache Local Cache
A A A D
For Example, MOV LL SC BEQZ MOV
R3, R4 R2, 0(R1) R3, 0(R1) R3, try R4, R2
Move exchange value Load Linked Store Conditional Branch store fails Put load value in R4
Data Write Remote Home A, D back Cache Directory P - Request Processor Number, A - Requested Address and D - Data Contents Home Node is the node where the meory location and the directory entry of an address reside. Local Node is the node where a request originates. The directory must be accessed when the home node is the local node, since copies may exist in a third node, called a Remote Node. A remote node may be the same as either the local or home nodes. In directory-based protocol, a message sent to a directory causes two different types of actions. These are, i) Updating the directory state and ii) Sending additional messages to satisfy the request IV. SYNCHRONIZATION ISSUES 1) Hardware Primitives Synchronization mechanisms are built with userlevel routines. It rely on hardware-supplied synchronization instructions. To implement the synchronization that it used atomic primitives. This primitive of synchronization operation is the atomic exchange. It interchanges a value in a register for a value in memory. Another synchronization operation is test-and-set, which tests a value and sets it if the value passes the test. Another atomic synchronization operation is fetch-and-increment that it returns the value of a memory location and atomically increments it. The hardware cannot allow any other operations between the read and write. At the time we use some alternate a pair of instructions. It has a special load called a Load Linked or Load Locked (LL) and a special store called a Store Conditional (SC). 19
These instructions are implemented by keeping track of the address specified in the LL instruction in a register, called the Link Register. 2) Implementing locks using coherence To implement the spin locks that a processor continually try to acquire the lock using an atomic operation of exchange and test whether the exchange returned the lock as free. To release the lock, the processor simply stores the value 0 to the lock. Each processor is trying to obtain the lock variable in an exclusive state. V. MODELS OF MEMORY CONSISTENCY The most straightforward model for memory consistency is called sequential consistency. Memory Consistency involves operations among different variables that is the two accesses ordered are to different memory locations. In the programmer view, sequential consistency has disadvantages. At this leave, it allows a highperformance implementation. A program is synchronized if all access to shared data is ordered by synchronization operations. A data reference is ordered by a synchronization operation. Support of the synchronization operation does not follow the ordered execution of data that its called data races and another name for synchronized programs datarace-free. In relaxed consistency models is to allow reads and writes to complete out of order, but to use synchronization operations to enforce ordering. There are variety of relaxed models are classified due to read and write orderings they relax. We specify the orderings by a set of rules of the form XY. Sequential consistency requires maintaining all four possible orderings. These are, i) RW ii) RR iii) WR iv) WW
The relaxed models are defined by four sets of orderings. These are, i) Relaxing the WR ordering a model known as total store ordering or processor consistency ii) Relaxing the WW ordering model known as partial store order iii) Relaxing the RW and RR ordering fields a variety of models including weak ordering, the PowerPC Consistency model and release consistency. VI. SOFTWARE AND HARDWARE MULTITHREADING Multithreading computers have hardware support to execute multiple threads. The threads have to share the resources of a single core 9the computing units, the CPU caches and the translation lookaside buffer (TLB). Its aim is to increase utilization of a single core by using thread level as well as ILP. These types are, 1) Block Multithreading 2) Interleaved Multithreading and 3) Simultaneous Multithreading 1) Software Multithreading: It describes the main concept where each thread is analyzed using the threadframes/nodes model. This implementation comes from three domains where code merging takes place to increase that available parallelism. The first one is the microarchitecture domain where the TLP is used as a means to fill-up the processor resources. The second one is the compiler domain, where automated compiler techniques expand the scope of the scheduling beyond the basic block. Finally there are source code transformations which merge code from different threads. CPU
FUs
2) Hardware Multithreading
CPU Thread 1 Thread 1 Thread 1 S/W SMT FUs
The thread mixing takes place at run-time on a hardware unit. Chip multithreading allows multiple hardware threads of execution (also known as strands) on the same chip, through multiple cores per chip, multiple threads per core, or a combination of both. Various techniques of hardware multithreading, 1) Multiple Cores per chip and 2) Multiple Threads per core - It has two types. These are, i) Vertical Multithreading and ii) Horizontal Multithreading Vertical Multithreading - Instructions can be issued only from a single thread in any give CPU cycle. It has two types. These are, i) Interleaved Multithreading and ii) Blocked Multithreading Interleaved Multithreading - The instructions of other threads is fetched and fed into the execution pipeline at each processor cycle. So context switches at every CPU cycle. A A B C D A B C D
Thread 1 Thread 1 Thread 1 S/W SMT Threads 1,2,3
A A A A
A A
A A
A B C D A B
A B D A
B D D
A A B B C
A A B B
Single threaded Super scalar
Interleaved multithreading superscalar
Blocked Multi Threading Super Scalar
The thread mixing takes place at compile time and the resulting code runs on a standard processor.
Blocked Multithreading - The instructions of other threads is executed successively until an event in current 20
execution thread occurs that may cause latency. This delay event induces a context switch. Horizontal Multithreading - Instructions can be issued from multiple threads in any given cycle. This is called simultaneous multithreading (SMT). Instructions are simultaneously issued from multiple threads to the execution units of a superscalar processor. 3. SMT SMT means Simultaneous Multithreading. It is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading.
Rename Table 1
Cycle M 1 2 3 4 5 6 7 8 9
One thread, 8 Units FX FX FP FP
BR
CC
Cycle M 1 2 3 4 5 6 7 8 9
Two threads, 8 Units FX FX FP FP
BR
CC
PC
Fetch
Decode & Rename
ROB
Commit
Physical register File
M - Load / Store, FX - Fixed Point, FP - Floating Point, BR - Branch and CC - Condition Codes
Branch Unit ALU ALU Branch Unit DS
Execute
It permits multiple independent threads of execution to better utilize the resources provided by modern processor architectures. SMT is one of the two main implementations of multithreading, the other form being temporal multithreading. In temporal multithreading, only one thread of instructions can execute in any given pipeline stage at a time. In simultaneous multithreading, instructions from more than one thread can be executing in any given pipeline state at a time. SMT is hiding memory latency, increasing efficiency and increasing throughput of computations per amount of hardware used.
4. CMP - Chip Level Multiprocessing or Multicore Microprocessor It integrates two or more processors into one chip, each executing threads independently. Ex: AMD Opteron The chip has two complete processor cores, sharing a bus to memory.
CPU 0 1 MB L2 Cache CPU 1 1 MB L2 Cache
System Request Interface Crossbar Switch
Memory Controller
HT 0
HT 0
HT 0
* HT - Have Turned 21
The second core is a complete duplication of the first, including its pipeline and caches. Software threads are scheduled onto the processor cores by the operating System - at least two threads are required to keep both cores busy. A multicore processor is a single computing component with two or more independent actual processors (called cores), which are the units that read and execute program instructions. These types are, 1. Dual-Core processor has 2 cores 9AMD Phenom II X2 and Intel Core Duo) 2. Quad-Core has 4 cores AMD Phenom II X4 and Intel 2010 Core 3. Hexa-Core Processor has 6 Cores 4. Octa - Core Processor has 8 Cores A multicore processor implements multiprocessing in a single physical package. These applications are embedded, Network, Digital Signal processing and Graphics. 5. Design Issues The processor is to sacrifice some throughput, when preferred thread stalls Larger register file needed to hold multiple contexts More instructions need to be considered It cannot easily identify the commit instructions but it does not affect the clock cycle time Limited to accessing the contiguous memory locations
The Coherence states are denoted M, S and I for Modified, Shared and Invalid. Snooping coherence latencies are, Implementation 1 Implementation 2
Parameter NMemory NCache Ninvalidate NWriteback
Direct Coherence Latencies are, Action send_Msg send_Data rev_data read_memory write_memory inv ack req_msg data_msg Implementation 1 Latency
VII. CASE STUDIES This unit of the following concepts is illustrated. These concepts are, 1. Snooping Coherence Protocol Transitions 2. Coherence Protocol Performance 3. Coherence Protocol Optimization and 4. Synchronization
22

CompArch Studcopy4units

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

CompArch Studcopy4units

Încărcat de

Drepturi de autor:

Formate disponibile

UNIT I

Fractionenhanced Execution timenew = Execution timeold X ( 1- Fractionenhanced) + Speedupenhanced

The overall speedup is the ration of the execution times:

Overall CPU time = ICi X CPIi

X Clock cycle time and

Expanded View New address Wait Instructions wait Instructions

UNIT II INSTRUCTION LEVEL PARALLELISM WITH DYNAMIC APPROACHES

Reservation Station FP unit

Reservation Station Branch Unit

Load / Store Unit

Blocked Multithreading VLIW

Function Front end per language

Somewhat language dependent; Largely machine independent

Ex: Procedure integration

Small Lg. dependencies; Global Machine dependencies Optimizer slight

including global & local optimizations + register allocation

Interleaved multithreading VLIW Highly machine dependent; Language independent

Detailed instruction selection and machine-dependent

optimizations; may include or be followed by assembler

Memory Processor + Cache

Memory Processor + Cache

The basic structure of a distributed-memory multiprocessor

One or more levels of cache

One or more levels of cache

One or more levels of cache

One or more levels of cache

The basic Structure of a centralized shared-memory multiprocessor

The second group consists of multiprocessors with physically distributed memory.

Request Read Hit Read Miss Read Miss Read Miss

Source Processor Processor Processor Processor

State of addressed cache block Shared or modified invalid shared Modified

Type of cache action Normal Hit Normal Miss Replacement Coherent

Cache transitions based on requests from CPU

Cache transitions based on requests from bus

One or more levels of cache

One or more levels of cache

One or more levels of cache

One or more levels of cache

A Multiprocessor with uniform memory access

Memory Cycles per Instruction

% time user mode

% time CPU idle

0.04 0.03 0.02 0.01 0 1 2 3 Processor Count 4 5

Misses per 1000 instructions

150 100 50 0 1 2 3 4 Block Size (Bytes)

User Ex % instruction execution % execution time

III. Distributed Shared Memory and Directory-Based Coherence

kernel miss rate

user miss rate 64 128 Cache Size (KB) 256

kernel miss rate

user miss rate 32 64 Block Size (Bytes)

Invalidate Fetch Fetch / Invalidate Data value reply

Home Directory Home Directory Home Directory Home Directory

Remote Cache Remote Cache Remote Cache Local Cache

For Example, MOV LL SC BEQZ MOV

R3, R4 R2, 0(R1) R3, 0(R1) R3, try R4, R2

Thread 1 Thread 1 Thread 1 S/W SMT Threads 1,2,3

Single threaded Super scalar

Interleaved multithreading superscalar

Blocked Multi Threading Super Scalar

One thread, 8 Units FX FX FP FP

Two threads, 8 Units FX FX FP FP