Lecture 4 8405 Computer Architecture

ECE8405 Class Notes CHAPTER 5 - CPU IMPLEMENTATION The CPU consists of the datapath, control, cache, and I/O
peripherals and interfaces. This week, we will consider the first two, which form the backbone of the CPU. Today we will consider a non-pipelined implementation of the datapath. R-FORMAT: op(6) rs(5) rt(5) rd(5) shift(5) arith_op(6). Lets do it from the perspective of the instruction types. Starting with a register-type instruction, what elements do we need in our datapath? rs rt rd Registers data(rt) result(rd) data(rs) ALU
IR
regWrite
ALU_op
Instruction Register (IR) stores the instruction currently being executed. The operation can be almost completely combinatorial- updating the IR starts executing the instruction, the addresses flow to the register, data to the ALU, the ALUs result flows back to the registers and is written. Only problem- if dest_write_enable is enabled before the address of the destination has settled, can spuriously write a wrong location. Usually this is taken care of by enabling the write only during the later half of the clock cycle during which the operation is executed. Where does the instruction come from? Need circuitry to update addresses and read instruction (see next page). The driving force here is the clock, which allows the PC to update, which causes the next instruction to be read and placed into the IR. We can assume for now that the clock period is long enough to fetch the instruction AND execute it!
1
ADDER
PC
Program Memory
IR
Clock
Missing is the control circuitry, which decodes the op bits (IR26-IR31) and combined with the low ALUop bits (IR0-IR5), derives control bits for the ALU and registers. Note that shift bits shuffled to ALU also, in case op is a shift. I-FORMAT: I-Format instructions require us to make quite a few changes. First, note that each I-format instruction type (immediate, load/store, and branch) requires different hardware features. Lets treat them separately. Immediate (op6, rs5, rd5, immed16) To treat an immediate arithmetic operation, we have two problems: 1) The destination reg address rd is in different bits from R-format 2) The ALU needs to operate on the immediate data from the IR To solve the first, we need to use a mux to select whether rd comes from bits 11-15 or 16-20 of the IR. In any case, 11-15 can go to rt, because a read can be done whether or not the data is used! To solve the second, another mux can be used on the lower ALU data input to select whether the data source is the register read rt or the instruction. If the data is in the instruction, a sign extension operation is required (copy MS bit) to make the data 32 bit with the correct sign.
m X
m x
regDst bits 0-15 LOAD/STORE
regWrite sign extend
ALUsrc
To perform load/store, a few more blocks need to be added. The ALU is used to perform address calculations, using exactly the circuit above. The only difference is that data memory must be included in our system. Thus, we need another mux to select whether the data returned to the register is from memory or from ALU:
memToReg Addr MEM din memRead memWrite dout
Note that two memory accesses are required- instruction fetch and data read or write. If all is this is done in one clock cycle, then the clock rate will have to be very low (1-10MHz). This is truly the limiting factor on clock rate!
BRANCH To do a branch, consider what is required: two registers are compared, and the resulting flag is used to make the decision whether or not to branch. The ALU is used to make the comparison; it cant be used for adding the offset to the PC. A second adder is needed. A mux serves to decide whether the PC is updated with the branch or no-branch address:
ADDER ADDER 4
PC clock
MEMORY
IR
shift left2 cond_true sign ext. immed J-FORMAT (op6, addr26)
Branch
The last instruction type is the jump (j, jr, jal). J is easy, we just need to sign extend and shift as in the branch, and add another input to the MUX. The others require a little more thought, and will be left as an exercise.
CONTROL (p 360 in text) Note that there are remarkably few control lines in this implementation. Fig 5.19 on page 360 shows the control lines for the jump-less MIPS: RegDst allows mux to select register address from IR 11-15 or 16-20. Branch indicates that the instruction is a branch (enables offset add) MemRead allows the data memory to perform a read operation MemToReg controls the mux selecting readData or ALUdata ALUop multiple bits that may specify the ALU operation MemWrite specifies a data memory write ALUSrc specifies what is input to the lower ALU input RegWrite enables writing a result to register In MIPS, the value 000000 was selected as the op (highest 6 bits) to indicate an R-type instruction, which defers ALU control to the lowest 6 bits. Lets continue with that, since its easy to detect a zero word using just an OR gate. The books simplified ALU instruction set includes only 5 functions: 000=AND, 001=OR, 010=add, 110=subtract and 111=set on less than. Note that in this case only 3 control inputs are needed. Lets expand on that to implement the instruction set given on the inside of the back cover, but: Well skip the jump instructions for now. INSTRUCTIONS ARE: add, sub, addi, addu, subu, addiu, mfc0, mult, multu, div, divu, mfhi, mflo / and, or, andi, ori, sll, srl / lw, sw, lbu, sb, lui / beq, bne, slt, slti, sltu, sltiu/ j, jr, jal. So, how do we implement the control bits for these instructions? We need to consider which datapaths are needed for each instruction! Lets start with the easy cases: RegDst => high only for R-format instructions.! (IR = 0). NOR bits of op. Branch => high only for branch instructions (IR = 4 or 5). MemRead => high for lw, lb, etc (IR = 35) 100011 MemWrite => high for sw etc (IR = 43) 101011 MemToReg => same as MemRead The rest are more complex, because they are activated by a variety of instructions:
5
ALUSrc => High for R-format (IR=0) instructions, and: beq, bne, slt, sltu. RegWrite => all BUT sw, sb, beq, bne, j, jr, jal. NOTE that in some cases the control signal can be either since, for example, the instruction dataflow does not flow through the mux (e.g. MemToReg for a sw operation). As the text points out, this can be treated as a dont-care to simplify the select circuitry. Which leaves the worst: ALUsrc! Which of these instructions require individual ALU_op patterns? Add, addu, sub, subu, mfc0, mult, multu, div, divu, mfhi, mflo, and, or, srl, sll, beq, slt, sltu, lui. How many bits to represent? 5 (plus shift bits). We are assuming here that the coprocessor is close to the ALU on the layout, so that mfc0 can be easily implemented in the ALU, rather than externally with yet another mux. IF the instruction is R-format, need to take bits from the lowest 6 bits of the IR, otherwise must decode from IR op bits (26-31). IR31 IR26 IR2631 ALUop IR 0-5 So, how best to code these? Just pass IR0-5 through to ALU directly, and use a decoder to extract ALUop from the op field for non R-format instructs? That may work. Smart selection of op patterns may minimize control logic, too.
MULTICYCLE IMPLEMENTATION (Fig 5.33/5.34 p383/384) The single-cycle machine has a low clock speed that is limited by the instruction that takes the longest. That would probably be a load- requiring all functional units and including two memory accesses. While it has a CPI of 1.0, the clock rate would be severely limited. With a faster clock, we could then use as many clock cycles as needed for each instruction: an R-format instruction might execute in one cycle (two for mul and div), branches in two, and loads in 5. This would allow a much higher clock rate, and substantially improved throughput: Problem: if 10% of instructions are loads, and 80% simple R-format instructions (not mul/div), what is the speedup of the multicycle (MC) CPU? Since the single-cycle (SC) system is limited by the load, we can say that it has a CPI of 5 for all instructions (or, that the new clock rate is 5x the old): CPI(SC)/CPI(MC)=5/(0.1*5 + 0.8*1 + 0.1*2)=3.33 It should be noted that this topic is more theoretical/historical than practical, since today all high-performance CPUs are pipelined. Pipelining is more efficient than multicycle implementations. Note that most older/simple microcontrollers are multicycle, however. PRINCIPLE: execution of each instruction proceeds in a number of steps, where each step takes one clock cycle. Because a functional unit may be reused during different steps of an instruction, this allows us to: 1. Use a unified memory for program and data 2. Reuse the ALU for math/logic, updating PC, and branch calculations 3. Can wait more than one clock if necessary for slow memory 68000 In order to reuse functional units, we need to save the older results in registers while the units are being reused. This leads to: 1. Instruction Register 2. Memory data register 3. Register set output registers used also because one cycle used to read register values, another to do ALU operation. 4. ALU output register(s) 5. PC
To implement the multicycle datapath, we start from the single-cycle system, and make the following modifications: 1. Add the intermediate registers 2. Add multiplexers/MUX inputs to allow reusing the ALU for calculating instruction addresses 3. Relocate the memory, and add address mux to allow unified data/instruction access 4. Add control lines for all the new registers. What is the cost of this upgrade? Registers are reasonably cheap, VLSIwise. But the control circuitry is now very complex. Instead of a few combinatorial gates, we need a sophisticated state machine that knows the steps required for each instruction. How should we allocate clock cycles? The slowest step determines the max clock rate, so no step should be much longer than any other. Mem access may require more than one step, as may div or FP ops. Register access, simple ALU ops should each take one clock- these are our ratelimiting primitives. What are a consistent series of steps that serve to execute an instruction on this multicycle machine? 1. Instruction Fetch read the instruction and update the address at the input of the PC: IR = Memory[PC]; -- latch instruction in IR, uses memory PC = PC + 4; -- uses ALU 2. Instruction decode and register fetch, calculate branch offset just in case: A = Reg[IR[25-21]]; -- uses register B = Reg[IR[20-16]]; ALUout = PC + (sign-extend (IR[15-0]) <<2); uses ALU 3. Execute instruction by calculating result, memory address, or doing branch conditional (and PC update if true) Mem ref: ALUout = A + sign-extend(IR[15-0]); R-format: ALUout = A op B; Branch if(A==B) PC=ALUout; Jump: PC = PC[31-28] & (IR[25-0]<<2)
8
4. Memory access or write-back (completion) MemLoad: MDR = Memory[ALUout]; MemStore: Memory[ALUout = B]; R-format: Reg[IR[15-11]] = ALUout; I-type: Reg[IR[20-16]] = ALUout; -- immediate ALUs, slt 5. Memory read completion ONLY Reg[IR[20-16]] = MDR; MULTICYCLE CONTROL (Fig 5.42 p 396) The difficulty with control is that each TYPE of instruction requires an individual state machine! Last semester, we looked at several ways to implement logic, which included synthesis by gates, and table-lookup (e.g. by ROM). These are the two ways that multicycle control may be implemented. Lets start by considering the initial steps that ALL instructions require: Reset MemRead ALUSrcA=0 IorD=0 Irwrite ALUsrcB=01 ALUop=00 PCWrite PCSource=00 Return from FSMs
ALUSrcA=0 ALUSrcB=11 ALUOp=00 Mem R-type Branch Jump I-type
INDIVIDUAL STATE MACHINES
Go over FSMs with a few examples.
MICROPROGRAMMING The FSM in discussed previously is not all that complex, and so can easily be implemented with the techniques we covered last semester. When adding the MIPS floating point instructions, some instructions take up to 20 clock cycles. In cases where the control is horrendously complex- hundreds or thousands of states as is often the case with CISC processors- a technique called microprogramming is often used. Microprogramming is a processor within a processor. Each instruction is implemented as a series of MICROINSTRUCTIONS that specify the control signals needed for one state of that instructions FSM. Also, the microinstruction indicates which microinstruction must be executed next, if it is not the next one in the sequence (i.e. indicates a branch). As in programming, the key is to use an easily-understood assembly language that can be assembled into the control hardware. The hardware looks something like:
microprogram storage
control signals (to datapath)
addr Sequencing info Sequencer
The microprogram vectors give the control signals as well as sequencing bits that indicate which vector should be executed next.
10
Lets look at the MIPS microprogramming assembly language. The microinstruction contains eight fields: 1. ID string a label that identifies this instruction (for jump-tos) 2. ALU control ALU function (Add, Sub, Fn) 3. SRC1 source for first ALU operand (PC,A) 4. SRC2 source for second ALU operand (B,4,ext,ext_shift) 5. Register control source for write (read, write ALU, write MDR) 6. Memory (Read Addr=PC, Read ALU, Write ALU) 7. PCWrite control (ALU, ALUout-cond, Jump_addr) 8. Sequencing give next microinstruction label or seq So, for these 8 fields, the microprogram for the first two cycles is: 1 Fetch 2 3 Add PC Add PC 4 4 ExtShft 5 Read 6 ReadPC 7 8 ALU Seq Dispatch 1
Lets examine these: The first cycle specifies that the ALU adds, with ALU inputs PC and 4, the memory reads the instruction pointed to by the PC, and the PC is updated from the ALU output. The second cycle indicates that the ALU adds the PC (now updated) to the sign-extended and shifted immediate field (branch offset calculation). ALU write is deferred until instruction is decoded!
11
EXCEPTIONS AND INTERRUPTS (Handout: Figs 5.48, 5.49) What are exceptions and interrupts? Terminology differs between platforms! For MIPS, an interrupt is an external hardware request for action that is independent of instructions. An exception is caused by instructions (or data) or restart/reset. Exception types: Power-on reset Math errors (overflow, divide by zero) Software interrupt!!! (e.g. syscall) Memory access violation Non-existent instruction. Exception processing: Save program counter into special register (EPC) Update PC to special location (exception processing routine) Note that power-on reset/reset is handled differently: the processor initializes all internal registers and starts executing at standard address (often 0000). How are other exceptions/interrupts handled? One of three ways: 1) Cause of exception/interrupt is saved in special cause register, and all exceptions and interrupts go to same (service routine) vector address. PC = serviceAddress 2) Vectored- each exception/source has its own standard address for the service routine, usually spaced 4-32 bytes apart (jumps used to go to individual service routines). PC = serviceAddress[i] 3) Indirect- adjacent vectors contain ADRESSES of service routines (popular Motorola ploy). Efficient mechanism. PC = mem[vector[i]] MIPS implementation simple, just arithmetic overflow (cause = 1) and illegal instruction (cause = 0). Uses Method 1 above, where service routine is located at address 0xc0000000.
12
How can we extend model to incorporate these exceptions? Add cause and EPC registers Add Boolean control signal to update cause register bit 0 Add path to update EPC from PC upon exception Add means to load PC with vector address 0xC00000 Add logic to detect exceptions in ALU and control Add instructions that allow accessing cause and EPC Hardware modifications for the first 4 are shown in Fig. 5.48. Talk through these changes (note ALU subtracts 4 from PC!). The modifications to the state machine requires just the addition of two states, one for each exception. One happens when a signed R-type instruction causes an overflow, and the other when the decoded instruction out of state 1 is other than the implemented types. Examples: 1 (5.1) Describe the effect that a single stuck-at-0 fault would have on the multiplexors in the single-cycle datapath of Figure 5.19. Which instructions if any, would still work? Consider each of the following faults separately: RegDst=0, ALUSrc=0, MemtoReg=0, Zero=0. If RegDst =0, all R format instructions would not work properly since we will specify the wrong address to write to. If ALUSrc = 0, then all I format instructions except branch will not work because we will not be able to get the sign-extended 16-bits into the ALU. If MemtoReg=0, then loads will not work. If Zero=0, the branch instructions will never branch, even when it should. 2) (5.5) We wish to add the instruction addi to the single cycle datapath. Add any necessary datapaths and control signals to the single cycle datapath of Figure 5.19 No new additions are required. The new control is similar to the load word because we want to use the ALU to add the immediate to a register. So, RegDst=0, ALUSrc=1, ALUOp=00. The new control is also similar to an
13
R-format instruction because we want to write the result of the ALU to a register so MemtoReg=0, RegWrite=1 and since we arent using branches or memory, Branch=0, MemRead=0, MemWrite=0. 3) (5.7) Same as 2 but we want to add branch not equal (bne). One possible solution is to add a new control signal called Invzero that selects whether Zero or inverted Zero is an input to the AND gate used for choosing what the new PC should be (so this means a new mux). The new control signal Invzero would be a dont care whenever branch is used. Many other solutions are possible. 4) (5.10). A friend is proposing that the control signal MemtoReg be eliminated. The multiplexer that has MemtoReg as an input will instead use the control signal MemRead. Will this work? Consider both datapaths. MemtoReg and MemRead are identical except for sw and beq, for which MemtoReg is a dont care. Thus, the modification will work for singlecycle. For multi-cycle it will also work, assuming that the finite state machine is changed so that MemRead is asserted whenever MemtoReg is. 5) (5.17) We wish to add the instruction jump and link (jal) to the multicycle data path. Add any necessary datapath and control signals. We need to expand the two multiplexors controlled by RegDst and MemtoReg. The execution steps would be: Instruction fetch (unchanged) Instruction decode and register fetch (unchanged) Jal : Reg[31] = PC; PC=PC[31-28] || (IR[25-0]<<2); So we are writing the PC alter it has already been incremented by 4 (in the instruction fetch step) into register $ra so we need PC to be an input to the MemtoReg multiplexer and 31 needs to be an input to the RegDst multiplexer. We need to modify existing states to show proper values of RegDst and MemtoReg and add a new state that performs the jal instruction (and then returns to state 0). That state (say state 10) would have: PCWrite, PCSource =10, RegWrite, and the appropriate values for MemtoReg and RegDst.
14
Homework: 1) 5.2) Describe the effect that a stuck at 1 fault would have on the multiplexors in the single-cycle datapath of Figure 5.19. Which instructions, if any would still work. Consider RegDst=1, ALUSrc=1, MemtoReg=1, and Zero=1. 2) (5.6) We wish to add the jump and link (jal) instruction to the singlecycle datapath. Add any necessary datapaths and control signals necessary. You can just add these to the Figure 5.19. 3) (5.15) We wish to add the instruction addi to the multicycle datapath. Add any necessary datapaths and update the finite state machine. (Figures 5.33 and 5.42)
15

Lecture 4 8405 Computer Architecture

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Lecture 4 8405 Computer Architecture

Încărcat de

Drepturi de autor:

Formate disponibile

ECE8405 Class Notes CHAPTER 5 - CPU IMPLEMENTATION The CPU consists of the datapath, control, cache, and I/O

regDst bits 0-15 LOAD/STORE

regWrite sign extend

memToReg Addr MEM din memRead memWrite dout

shift left2 cond_true sign ext. immed J-FORMAT (op6, addr26)

ALUSrcA=0 ALUSrcB=11 ALUOp=00 Mem R-type Branch Jump I-type

INDIVIDUAL STATE MACHINES

Go over FSMs with a few examples.

control signals (to datapath)

addr Sequencing info Sequencer

S-ar putea să vă placă și