Processor Architecture - ZMP

1. CISC vs.
RISC Architecture:
RISC architectures lend themselves more towards pipelining than CISC architectures for many reasons. As RISC architectures have a smaller set of instructions than CISC architectures in a pipeline architecture the time required to fetch and decode for CISC architectures is unpredictable. The difference in instruction length with CISC will hinder the fetch decode sections of a pipeline, a single byte instruction following an 8 byte instruction will need to be handled so as not to slow down the whole pipeline. In RISC architectures the fetch and decode cycle is more predictable and most instructions have similar length. CISC architectures by their very name also have more complex instructions with complex addressing modes. This makes the whole cycle of processing an instruction more complex. Pipelining requires that the whole fetch to execute cycle can be split into stages where each stage does not interfere with the next and each instruction can be doing something at each stage. RISC architectures because of their simplicity and small set of instructions are simple to split into stages. CISC with more complex instructions are harder to split into stages. Stages that are important for one instruction may be not be required for another instruction with CISC. The rich set of addressing modes that are available in CISC architectures can cause data hazards when pipelining is introduced. Data hazards which are unlikely to occur in RISC architectures due to the smaller subset of instructions and the use of load store instruction to access memory become a problem in CISC architectures. CISC instructions that write results back to memory need to be handled carefully. Forwarding solutions used for allowing result written to registers to be available for input in the next instruction become more complex when memory locations which can be addressed in various modes can be accessed. Write after Read hazard must be taken care of where CISC instruction may auto increment a register early in the stages which may be used by the previous instruction at a later stage. CISC added complexity makes for larger pipeline lengths to take into account more decoding and checking. Using the Larson and Davidson equation from their paper Cost-effective Design of Special-Purpose Processors: A Fast Fourier Transform Case Study for calculating the optimum number of pipeline stages for a processor it can be shown RISC architectures suite smaller pipelines. Keeping the values for instruction stream length and logic gates for stages static it can be shown that the optimum pipeline length increases with the size of the fetch execute cycle. This is because with a large number of logic gates for a fetch execute cycle the additional gates required for stages have less impact. As RISC architectures have simpler instruction sets than CISC the number of gates involved in the fetch execute cycle compare will be far lower than this in CISC architecture. Therefore RISC architectures will tend to have smaller optimum pipeline lengths than more general processors. RISC architectures do suite pipelining more than CISC architectures and do lend themselves to smaller pipelines. This does not mean however that CISC architecture can not gain from pipelining or that a large number of pipeline stages are bad (although the flushing of a pipeline would become of concern). Other features, which are typically found in RISC architectures are:
Uniform instruction format (fixed size instructions), using a single word with the opcode in the same bit positions in every instruction, demanding less decoding; Instructions are executed in a single clock cycle.
Execution unit is much faster due to simple and uniform instructions Large number of general purpose registers, to avoid storing variables in a stack memory. Only load and store instructions to refer memory. Fewer simple instruction rather than complex ones. Simple and less number of addressing modes to simplify reference to operands. Few data types in hardware, some CISCs have byte string instructions, or support complex numbers; this is so far unlikely to be found on a RISC.
2. Von NeuMANN and Harvard Architecture:

The von Neumann architecture is a design model for a stored-program digital computer that uses a processing unit and a single separate storage structure to hold both instructions and data. A single bus used to transfer instructions and data, leads to the Von Neumann bottleneck. This limits throughput (data transfer rate) between the CPU and memory. This seriously limits the effective processing speed when the CPU is required to perform minimal processing on large amounts of data. The CPU is continuously forced to wait for needed data to be transferred to or from memory. Since CPU speed and memory size have increased much faster than the throughput between them, the bottleneck has become more of a problem.
Figure 1: Von Neumann architecture
The most obvious characteristic of the Harvard Architecture is that it has physically separate signals and storage for code and data memory. It is possible to access program memory and data memory simultaneously, thereby creating potentially faster throughput and less of a bottleneck. Typically, code (or program) memory is read-only and data memory is read-write. Therefore, it is impossible for program contents to be modified by the program itself. In a computer using the Harvard architecture, the CPU can both read an instruction and perform a data memory access at the same time, even without a cache. A Harvard architecture computer can thus be faster for a given circuit complexity because instruction fetches and data access do not contend for a single memory pathway.
Figure 2: Harvard architecture
2.1 SIMD Processing: Some DSPs have multiple data memories in distinct address spaces to facilitate SIMD and VLIW processing. SIMD exploits data-level parallelism by operating on small to moderate
number of data items in parallel. The True SIMD architecture contains a single contol unit(CU) with multiple processing elements(PE) acting as arithmetic units(AU). In this situation, the arithmetic units are slaves to the control unit. The AU's cannot fetch or interpret any instructions. They are merely a unit which has capabilities of addition, subtraction, multiplication, and division. Each AU has access only to its own memory. In this sense, if a AU needs the information contained in a different AU, it must put in a request to the CU and the CU must manage the transferring of information. The advantage of this type of architecture is in the ease of adding more memory and AU's to the computer.
Figure 3: SIMD processing
The disadvantage can be found in the time wasted by the CU managing all memory exchanges. Not all algorithms can be vectorized. For example, a flow-control-heavy task like code parsing wouldn't benefit from SIMD Currently, implementing an algorithm with SIMD instructions usually requires human labor; most compilers don't generate SIMD instructions from a typical C program, for instance. Vectorization in compilers is an active area of computer science research. (Compare vector processing.)
SIMD computers require less hardware than MIMD computers (single control unit). However, SIMD processors are specially designed, and tend to be expensive and have long design cycles. Not all applications are naturally suited to SIMD processing. Conceptually, MIMD computers cover SIMD need.
2.1 MIMD Processing: A MIMD computer has many interconnected processing elements, each of which have their own control unit, see fig 4. The processing unit works on their own data with their own instructions. Tasks executed by different processing units can start or finish at different times. They may send results to central location and may share memory space. They are not lockstepped, as in SIMD computers, but run asynchronously.
Figure 4: MIMD processing
3. Very long instruction word (VLIW) Processors:

It refers to a CPU architecture designed to take advantage of instruction level parallelism (ILP). A processor that executes every instruction one after the other (i.e. a non-pipelined scalar architecture) may use processor resources inefficiently, potentially leading to poor performance. The performance can be improved by executing different sub-steps of sequential instructions simultaneously (this is pipelining), or executing multiple instructions entirely simultaneously as in superscalar architectures.
Further improvement can be achieved by executing instructions in an order different from the order they appear in the program; this is called out-of-order execution. As often implemented, these techniques all come at a cost: increased hardware complexity. Before executing any operations in parallel, the processor must verify that the instructions do not have interdependencies. For example a first instruction's result is used as a second instruction's input. Clearly, they cannot execute at the same time, and the second instruction can't be executed before the first. Modern out-of-order processors have increased the hardware resources which do the scheduling of instructions and determining of interdependencies. Determining the order of execution of operations (including which operations can execute simultaneously) is handled by the compiler, the processor does not need the scheduling hardware that the three techniques described above require. As a result, VLIW CPUs offer significant computational power with less hardware complexity (but greater compiler complexity) than is associated with most superscalar CPUs. 3.1 Superscalar: CPU architecture implements a form of parallelism called instruction level parallelism within a single processor. It therefore allows faster CPU throughput than would otherwise be possible at a given clock rate. A superscalar processor executes more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to redundant functional units on the processor. Each functional unit is not a separate CPU core but an
execution resource within a single CPU such as an arithmetic logic unit, a bit shifter, or a multiplier.
Figure 5: Superscalar architecture employing ILP
While a superscalar CPU is typically also pipelined, pipelining and superscalar architecture are considered different performance enhancement techniques. The superscalar technique is traditionally associated with several identifying characteristics (within a given CPU core):

Instructions are issued from a sequential instruction stream CPU hardware dynamically checks for data dependencies between instructions at run time (versus software checking at compile time) The CPU accepts multiple instructions per clock cycle
3.2 Pipeline An instruction pipeline is a technique used in the design of computers and other digital electronic devices to increase their instruction throughput (the number of instructions that can be executed in a unit of time). The fundamental idea is to split the processing of a computer instruction into a series of independent steps, with storage at the end of each step. This allows the computer's control circuitry to issue instructions at the processing rate of the slowest step, which is much faster than the time needed to perform all steps at once. Fig.6 shows a pipeline system of five stages (Fetch, Decode, Memory access, Execute and Write Back) for processing instruction in microprocessor. After 1st three clock cycle, we get one processed instruction at every clock cycle. The speed of pipeline can be measure in terms of CPI (clock cycles per instructions) which should be ideally 1. But due to latency and pipeline hazards (will be explained later) the value of CPI is less than one in all practical cases.
Figure 6: Pipeline operation for five stages
Assume all stages in pipeline have same delay tC then Pipeline clock cycle time = tC Let there are k stages and n instructions to be executed Processing time: Tk = [k+(n-1)] tC (for pipeline processor) (for non-pipeline processor T1 = n k tC ) Speed up factor: Sk = T1 / Tk = nk / {k+(n-1)} Efficiency: Ek = Sk / k = n / {k+(n-1)} Pipeline throughput = No. of instructions executed per unit time = n / {k+(n-1)}tC If delays of stages of pipeline are unequal: Let delay of largest delay stage= tcw = pipeline clock cycle time Processing time to execure n instructions is Tk = [k+(n-1)] tcw . . . . (pipeline) T1 = n(d1 + d2 + . . . + dk ) . . . (Non-pipeline) Ideally, if there N stages in a pipeline architecture then Instruction Execution time (Pipeline) = Instr. Execution time (non-pipeline) / N Most modern CPUs are driven by a clock. The CPU consists internally of logic and memory (flip flops). When the clock signal arrives, the flip flops take their new value and the logic then requires a period of time to decode the new values. Then the next clock pulse arrives and the flip flops again take their new values, and so on. By breaking the logic into smaller pieces and inserting flip flops between the pieces of logic, the delay before the logic gives valid outputs is reduced. In this way the clock period can be reduced. For example, the classic RISC pipeline is broken into five stages with a set of flip flops between each stage. 1. 2. 3. 4. 5. Instruction fetch Instruction decode and register fetch Execute Memory access Register write back
1. Instruction fetch
The Instruction fetch on these machines had a latency of one cycle. During the Instruction Fetch stage, a 32-bit instruction was fetched from the cache. The PC predictor sends the Program Counter (PC) to the Instruction Cache to read the current instruction. At the same time, the PC predictor predicts the address of the next instruction by incrementing the PC by 4 (all instructions were 4 bytes long). This prediction was always wrong in the case of a taken branch, jump, or exception (see delayed branches, below). Later machines would use more complicated and accurate algorithms (branch prediction and branch target prediction) to guess the next instruction address.
2. Decode
Unlike earlier microcoded machines, the first RISC machines had no microcode. Once fetched from the instruction cache, the instruction bits were shifted down the pipeline, so that simple combinational logic in each pipeline stage could produce the control signals for the datapath directly from the instruction bits. As a result, very little decoding is done in the stage traditionally called the decode stage. If the instruction decoded was a branch or jump, the target address of the branch or jump was computed in parallel with reading the register file. The branch condition is computed after the register file is read, and if the branch is taken or if the instruction is a jump, the PC predictor in the first stage is assigned the branch target, rather than the incremented PC that has been computed.
3. Execute
Instructions on these simple RISC machines can be divided into three latency classes according to the type of the operation:
Register-Register Operation (Single cycle latency): Add, subtract, compare, and logical operations. During the execute stage, the two arguments were fed to a simple ALU, which generated the result by the end of the execute stage. Memory Reference (Two cycle latency). All loads from memory. During the execute stage, the ALU added the two arguments (a register and a constant offset) to produce a virtual address by the end of the cycle. Multi Cycle Instructions (Many cycle latency). Integer multiply and divide and all floating-point operations. During the execute stage, the operands to these operations were fed to the multi-cycle multiply/divide unit. The rest of the pipeline was free to continue execution while the multiply/divide unit did its work. To avoid complicating
the writeback stage and issue logic, multicycle instruction wrote their results to a separate set of registers.
4. Memory Access
During this stage, results of data processing instruction produced by execute stage is forwarded to the next stage. If the instruction is a load, the data is read from the data cache or data memory and if the instructions is store, the register data is written to data memory at address computed by execute stage.
5. Writeback
During this stage, both single cycle and two cycle instructions write their results into the register file. Pipeline Hazards: > Data hazards are when an instruction, scheduled blindly, would attempt to use data before the data is available in the register file. (e.g. an instruction depends on the results of a previous instruction.) > Control hazard occurs whenever there is a change in the normal execution flow of the program. Events such as branches, interrupts, exceptions and return from interrupts. A hazard occurs because branches, interrupts etc are not caught until the instruction is executed. By the time it is executed, the following instructions are already entered into the pipeline and need to be flushed out. > Structural hazards are when two instructions might attempt to use the same resources at the same time. Classic RISC pipelines avoided these hazards by replicating hardware. In particular, branch instructions could have used the ALU to compute the target address of the branch. Advantages of Pipelining: 1. The cycle time of the processor is reduced, thus increasing instruction issue-rate in most cases. 2. Some combinational circuits such as adders or multipliers can be made faster by adding more circuitry. If pipelining is used instead, it can save circuitry vs. a more complex combinational circuit. Disadvantages of Pipelining: 1. A non-pipelined processor executes only a single instruction at a time. This prevents branch delays (in effect, every branch is delayed) and problems with serial instructions being executed concurrently. Consequently the design is simpler and cheaper to manufacture. 2. The instruction latency in a non-pipelined processor is slightly lower than in a pipelined equivalent. This is due to the fact that extra flip flops must be added to the data path of a pipelined processor. 3. A non-pipelined processor will have a stable instruction bandwidth. The performance of a pipelined processor is much harder to predict and may vary more widely between different programs.

Processor Architecture - ZMP

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Processor Architecture - ZMP

Încărcat de

Drepturi de autor:

Formate disponibile

1. CISC vs.

2. Von NeuMANN and Harvard Architecture:

Figure 1: Von Neumann architecture

Figure 2: Harvard architecture

Figure 3: SIMD processing

Figure 4: MIMD processing

3. Very long instruction word (VLIW) Processors:

Figure 5: Superscalar architecture employing ILP

Figure 6: Pipeline operation for five stages

S-ar putea să vă placă și