Processor Report - ECE 174

California State Univesity, Fresno
Final Project Report
Pipelined Processor
Author:
Abhijit Suprem
Instructor:
Dr. Tarek Elarabi
December 11, 2014
Contents
1 Statement of Objectives
1.1 Processor Design Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
2
2 Background
2.1 Verilog - Language and Workflow
2.2 Software Tools . . . . . . . . . .
2.3 Design Overview - Processor . . .
2.4 MIPS Architecture . . . . . . . .
2.5 Pipelined Architecture . . . . . .
.
.
.
.
.
2
2
3
3
3
3
3 Pipelining
3.1 ALU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1 Logic and Shift Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.2 Arithmetic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
7
7
8
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4 Verilog Code for Processor Units

4.1 Instruction Fetch . . . . . . . . .
4.2 Instruction Memory . . . . . . .
4.3 Instruction Decode . . . . . . . .
4.4 Registers . . . . . . . . . . . . . .
4.5 Execute . . . . . . . . . . . . . .
4.6 Stack . . . . . . . . . . . . . . . .
4.7 Data Memory . . . . . . . . . . .
4.8 Write Back . . . . . . . . . . . .
4.9 Processor . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8
9
9
10
11
12
14
14
15
15
5 Testbench Procedures
5.1 Testbench for Processor . .
5.2 Programming the Processor
5.2.1 Arithmetic program
5.2.2 Load/Store Program
5.2.3 Bubble Sort . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
17
17
17
17
18
19
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 Conclusions
21
7 Appendix A
7.1 Instruction Decode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
22
30
List of Figures
2.1
3.1
3.2
3.3
5.1
5.2
5.3
5.4
MIPS Pipeline block diagram . . . . . . . . . . . . . . . . . . . .

Pipelined processor block diagram . . . . . . . . . . . . . . . . .
Format for arithmetic, branching, and data memory instructions
Prototypical module for logical and shift operations . . . . . . .
Simulation of Arithmetic operations . . . . . . . . . . . . . . . .
Simulation of Load/Store operations . . . . . . . . . . . . . . . .
Sorting: Worst Case . . . . . . . . . . . . . . . . . . . . . . . . .
Sorting: Realistic . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
5
5
7
18
19
21
21
1.
Statement of Objectives
The objective of this final project is to conceptualize, design, and validate a pipelined processor based on
the MIPS architecture, but with simplifying modifications. An Arithmetic Logic Unit (ALU) with various arithmetic and logical operations is designed in conjunction. The processor is built from the ground
up in Verilog HDL with Quartus II development enviromnent and simulated under ModelSim-Altera RTL
Simulation Platform.
Included in this report are the following:
Overview of the processor through a block diagram
Detailed description of each of the processor stages and their implementation in Verilog
Simulation results of the processor with a few programs:
Arithmetic program that performs operations
Basic Load/Store program demonstrating memory accesses
8-number bubble sort combining these operations
Discussion of future work and possible improvements
1.1
Processor Design Objectives
The processor is designed under 8-bit register/data considerations. The instructions are 32-bits long. They
are divided into four 8-bit components (see Table 1).
Table 1: Instruction Length
Bits
8-bits
8-bits
8-bits
8-bits
Name
OPCODE
REG DEST, PC NEXT
REG 1, MEM
REG 2, LITERAL
Function
Operation code
Address of destination register, or address of next instruction (branching)
Address of 1st register or memory address
Address of 2nd register or literal value
The processor will have the standard 5-stage pipeline with Instruction Fetch, Instruction Decode, Execute,
Data Memory Access, and Write Back. Each of these will be discussed in Section 3. For the processor, the
instruciton set are first defined. The instructions are categorized into four sections: arithmetic instructions
(23 instructions), branch instructions (4 instructions), memory access instructions (4 instructions), and no
operation instructions (1 instruction).
2.
2.1
Background
Verilog - Language and Workflow
Verilog is a standardized (IEEE 1364) Hardware Description Language used for modeling electronic systems. Hardware Description Languages (HDLs) are used for various reasons, including readability, earlier
simulations, feasibility studies, and abstraction.
The Verilog design flow contains the following steps (summarized):
1. Specification
2. HDL Coding
3. Synthesis
4. Place and Route
5. Timing Analysis
2
6. FPGA implementation
This project will include the first 5 steps, plus ModelSim simulation in lieu of FPGA implementation.
Specification entails conceptualizing the problem and determining I/O pins and other requirements such as
timing constraints. HDL Coding is a high level, RTL abstraction of the design that can then be synthesized.
Synthesis entails converting the HDL code into a bit-code that can be programmed into the Field Propagated
Gate Array (FPGA). Placing and Routing involves selecting the actual location of placement on the FPGA
chip. The final, synthesized design can then by tested by software simulation through ModelSim or other
third party tools. The testing phase may include Formal Verification and/or Assertion Verification in order
to validate the modules function.
2.2
Software Tools
The following tools were used for this project: (i) Quartus II and (ii) ModelSim-Altera. A brief description
of each follows.
Quartus II Quartus II is produced by Altera Corporation. It is an FPGA platform for logic design and
synthesis for implementation. It is used to program the modules and view the resultant RTL circuits, PostMap Netlist, and State Machine Diagram, if it exists. Quartus also provides various hardware synthesizers
to simulate real-world implementation. Quartus II also performs the synthesis functions by converting the
HDL Code to FPGA bit-code.
ModelSim-Altera ModelSim-Altera is the simulation tool for validating the synthesized designs from
Quartus II. Timing information from Quartus II is used to run the simulation. ModelSim can be used to run
standalone simulations as well.
2.3
Design Overview - Processor
The processor, also known as the central processing unit, is the circuit that carries out the arithmetic,
i/o, and control operations as specificied by a user. The term has been used for over fifty years and yet,
the definition has not changed. The processor today contains two basic units: an arithmetic logic unit to
perform operations and a control unit to fetch instructions and execute them. A processor can also contain
peripherals such as registers to temporarily store data for faster access, cache to store recently used data, a
data memory unit to store long-term data and an i/o interface to communicate with human operators or the
environment and to deliver information.
2.4
MIPS Architecture
A processor is useful with an instruction set - a set of instructions the processor can perform. A simple
architecture useful for rudimentary processor design is the MIPS architecture, a RISC architecture widely
used in academia. It has a simple instruction set, which has been further simplified and reduced for this
project. The MIPS architecture also takes advantage of pipelining. The generalized five-stage pileline is
shown in Figure 2.1.
2.5
Pipelined Architecture
To understand a pipelined architecture, it is necessary to examine a non-pipelined architecture. In such an

architecture, a processor executes one instruction each clock cycle. Thus, future instructions wait for the
current instruction to be finished before they can continue. A processor can significantly increase efficiency
by splitting an instruction into several portions and executing each separately. If a processors execution is
split into five components, then in the ideal case, the processor will need 15 of its original clock cycle (In a
real scenario, different stages will take different amounts of time and the clock will be based on the worst
performing cycle). As such, the processor can begin the next instruction when the first stage of the current
instruction is completed. Table 2 shows an example. As seen, in the first clock, the first stage of the first
instruction is completed. In the second clock cycle, the first stage of the second instruction and the second
stage of the first instruction is completed. In the third stage, the first stage of the third instruction, the
second stage of the second instruction, and the third stage of the first instruction is completed. The latency
3
Figure 2.1: MIPS Pipeline block diagram

Table 2: Pipelined Instruction
Instruction
Instruction
Instruction
Instruction
Instruction
Instruction
1
2
3
4
5
6
Clock 1
Stage 1
Clock 2
Stage 2
Stage 1
Clock 3
Stage 3
Stage 2
Stage 1
Clock 4
Stage 4
Stage 3
Stage 2
Stage 1
Clock 5
Stage 5
Stage 4
Stage 3
Stage 2
Stage 1
Clock 6
Clock 7
Clock 8
Clock 9
Clock 10
Stage
Stage
Stage
Stage
Stage
Stage
Stage
Stage
Stage
Stage 5
Stage 4
Stage 3
Stage 5
Stage 4
Stage 5
5
4
3
2
1
5
4
3
2
remains the same (it in fact increases in a real-world scenario as the slowest clock is used for all five stages),
but the throughput increases. In ten clock cycles, a non pipelined processor can complete two instructions,
whereas a pipelined processor can complete six.
3.
Pipelining
The five stages of the MIPS pipeline are incorporated into this processor. The processor block diagram is
given in Figure 3.1. There are 8 main modules present: the five pipeline stage modules, an instruction
memory, a stack, and registers.
Folliwing are the descriptions of each module and their functionality
Instruction Fetch This module contains a register with the current PC address. This address is
incremented each clock cycle. The module sends the address to the instruction memory unit and receives the
32-bit instruction. This instruction is sent to the Instruction Decode unit. The current address is also sent.
In the case of a branch instruction, the address can be stored in the stack. Note also the inputs from the
Execute unit. The Jump Enable is a flag that overwrites the existing PC address with the next address. It
is asynchronous and once it occurs, the Instruction Fetch unit sends the new instructions to the Instruction
Figure 3.1: Pipelined processor block diagram

Decode unit.
Instruction Memory This memory stores 32-bit instructions. As the PC Address is 8 bits long, the
memory can carry 28 1 instructions, i.e. 255 instructions. The memory unit is combinational and selects
the memory contents based on the input.
Instruction Decode This unit controls the execute unit by providing the correct inputs based on the
received instruction. There are three instruction types: Arithmetic instructions, Branch/Jump instructions,
and Load/Store Instructions. Each has a different instruction format (Figure 3.2).
Figure 3.2: Format for arithmetic, branching, and data memory instructions
For arithmetic operations involving two registers, data is retrieved from the register units with the pro5
vided addresses in the instruction. For arithmetic instructions involving a register and an immediate/literal,
only one value is retrieved from the registers. The over value is already present in the instruction. For Branch
instructions, the unit sets the Stack flag, informing of the branch type (either branch with return, unconditional jump, or return to earlier instruction). For Load/Store instructions, the Memory address contained in
the third section of the instruction (Figure 3.2 must not be modified. Further, the memory read/write flag
is set. The ALU opcode is also determined.
Table 3 shows the ISA for this processor as well as derived values. For each instruction, the Decode
unit decides which operation to trigger in the Execute stage. There are five operation flags: aluOP for the
ALU, readOP and writeOP for the memory and jumpOP and branchOP for the stack (jump and branch
instructions).
Table 3: Pipelined Instruction
Opcode
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Type
NOP
Arithmetic
Arithmetic
Arithmetic
Arithmetic
Arithmetic
Arithmetic
Arithmetic
Arithmetic
Arithmetic
Arithmetic
Arithmetic
Arithmetic
Arithmetic
Arithmetic
Arithmetic
Arithmetic
Arithmetic
Arithmetic
Arithmetic
Arithmetic
Arithmetic
Arithmetic
Arithmetic
Branch
Branch
Branch
Memory
Memory
Memory
Memory
Branch
Instr.
NOP
add
addi
sub
subi
mult
power
slt
slti
srt
srti
and
andi
or
ori
not
nor
nori
xor
xori
xnor
xnori
nand
nandi
jump
beq
bgt
load
loadi
store
storei
ret
nop
addition
literal addition
subtraction
literal subtraction
multiplication
power of 2
shift left
shift left by literal
shift right
shift right by literal
bitwise and
literal bitwise and
bitwise or
literal bitwise or
bitwise not
bitwise nor
literal bitwise nor
bitwise xor
literal bitwise xor
bitwise xnor
literal bitwise xnor
bitwise nand
literal bitwise nand
jump to address
branch (r1 = r2) (push)
branch (r1 > r2) (push)
load from mem to reg
load # into reg
store reg to mem
store # in mem
pop from stack
alu
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
read
0
write
0
branch
0
jump
0
1
1
1
1
1
1
1
1
Registers The Register unit return memory contents upon a query. It receives queries from the Instruction Decode unit and the Write Back unit. For the Write Back queries, the register contents are modified.
Execute This unit contains the ALU and some extraneous blocks. It performs the arithmetic operations
for arithmetic instructions and comparision operations for the branch instructions. The execute unit is also
the liason for the Load/Store instructions; it passes on the data received from the Decode unit without
adjusting them.
Stack The stack module either pushes the current PC address in case of a branch instruction of pops
the last address in case of a return instruction. This address is subsequently sent by the Execute unit back
to the Fetch unit, along with the jump flag set at HIGH.
Data Memory For Load/Store instructions, the data memory either retrives 8-bit data in an address
for the former or stores data in a specified address for the latter. The R/W/Pass flag input determines
whether the memory reads, writes, or simply passes the inputs to the Write Back stage without modification.
The latter operation is for the operations where the data memory is not required.
Write Back The Write Back unit receives a register address, an 8-bit value for writing back, and
a write-enable flag. The flag determines whether anything should be written back to the registers. It is
necessary for store or branch instructions, where nothing need be written into a register.
3.1
ALU
The ALU performs comparision, arithmetic, shift, and logic operations. These operations were covered in
the pevious report. A brief overview is given here.
3.1.1
Logic and Shift Operations
There are several logical operations in Boolean Algebra: AND, OR, NOT, NAND, NOR XOR, XNOR. Shift
operations include left shift and right shift.
The prototypical module as synthesized by Quartus II is shown in Figure 3.3. Note the four ports that
are in all logic, shift, and equality modules. The input ports in1 and in2 are each 8-bit wires. The module
itself contains a basic function block - either one of the logic blocks or a left/right shift block. There are two
outputs - the self-explanatory out and the done flag. This flag is set when the module has completed its
operations. This lets the processor know the ALUs output is correct and that the processor can continue
with the pipeline.
Figure 3.3: Prototypical module for logical and shift operations

For the two shift operations - the left shift and the right shift - the module uses the << and the >>
operators. In the former, the number to be shifted is padded with zeros at the least significant bits. The
amount of padding depends on the second operand. For example, the binary number 00010110 shifted by
3 yields 10110000.
If there are HIGH bits at more significant positions, they may be lost during the shifting operation. Note
that 011010111 shifted by 3 yields 011 010111000, where the italicized sequence is lost (i.e. the resultant
sequence is 010111000). The right shift follows the same premised, except the shifting direction is reversed.
The shift operations are represented by the following operators: << in case of left shift and >> in case
of right shift. There is also a power-of-2 operation that calculates, for each input n, the result 2n . This is
essentially a left shift operation with one of the inputs hardwired as 1. Once this is shift n times, the result
is the nth power of 2.
In addition to shift and in tandem with logical operations, the ALU has a parity checker and a comparator.
The former checks whether two inputs are the same. The second checks whether Operand 1 is either greater
than or less than Operand B. A Less-Than operation is not included as this can be simulated by reversing
the inputs of the Greater-Than checker
3.1.2
Arithmetic Operations
There are four basic arithmetic operations in any ALU: addition, subtraction, multiplication, and division.
There are of course more complex operations such as exponentials, exponents, logarithms, and roots. These
are implemented through one of the following methods, in order of most expensive to least expensive (thus
fastest to slowest):
1. Dedicated hardware to compute operation results. This hardware may involve lookup tables and other
hardware-specific optimizations. Often the calculation is completed in single clock cycle.
2. Pipelined dedicated hardware. While only slightly slower than dedicated single cycle hardware, it is a
cheaper and practical alternative.
3. Software approach using extant operations. In this method, there is no dedicated hardware and programmers must develop algorithms to calculate the operation.
For this ALU, addition, subtraction, and multiplication are covered. Division is omitted due to inherent
complexity.
Binary addition follows decimal addition, except with only two possible digits. There may be overflow
bits, but these are not important for the ALU for this project., though commercial ALUs do contain an extra
output bit for carries or overflows.
00011011
+00100111
01000010
(3.1)
The simplest method to perform binary subtraction is through 2s complement. The steps with 2s complement are as follows, with the 2s complement steps boldfaced:
+1
AB A+B
(3.2)
So, the result of 00010101 minus 00001111 (2110 1510 required the 2s complement of the second
operand. So, 00001111 inverted is 11110000. Adding 1 yields 11110001. Finally, addition yields:
00010101
+11110001
1 00000110
(3.3)
The boldfaced bit is not required. The result is 00000110, which is 610 . So subtraction can be used with
an adder.
Binary multiplication also takes a similar form with decimal multiplication. There are, of course, various
optimized methods; however, the approach used for the ALU is a shift-add multiplier that uses the algorithm
given in Section 3.3 from [1]. In short, for each HIGH bit of the multiplicand, the multiplier is shifted by the
place of the multiplicand bit and added to the product.
For this ALU, the multiplier and multiplicand are both 8 bits. The result is a 16 bit output truncated to
8 bits to maintain a standardized output across all operations. This multiplier operates by shifting for each
bit and adding to the product if the nth value is HIGH.
Division Binary division is more difficult and complex that binary multiplication. In addition to the
quotient, there also needs to be additional hardware for the remainder. The algorithm is similar to the
multiplier; the differences are: (i) division also involves left shifts when subtraction prediction is incorrect,
(ii) quotient and remainder are on the same register, but in different locations, i.e. the remainder is HI and
the quotient is LO.
4.
Verilog Code for Processor Units
In this section, the Verilog code for each component will be elaborated.
4.1
Instruction Fetch
The Verilog code for the Instruction Fetch module is provided in Listing 1.
Listing 1: Verilog code for Instruction Fetch
1 // Fetch t h e i n s t r u c t i o n from I n s t r u c t i o n Memory b a s e d on c u r r e n t a d d r e s s
2 module I n s t r u c t i o n F e t c h ( c l k , pcIns , i n s R e c e i v e , insOut , pcOut , j I n , nextPc ) ;
3
// nextPC i s n e x t i n s t r u c t i o n from e x e c u t e c o n t a i n s a v a l u e when
4
// jump i n s t r u c t i o n i n p i p e l i n e . I t i s read when t h e j I n jump f l a g i s s e t
5
input [ 7 : 0 ] nextPc ;
6
input [ 3 1 : 0 ] i n s R e c e i v e ;
7
input j I n , c l k ;
8
// Outputs a r e i n i t i a l i z e d t o 0 and a r e t h e insOut ( i n s t r u c t i o n ) ,
9
// p c I n s ( a d d r e s s f o r i n s t r u c t i o n memory ) , and pcOut ( c u r r e n t a d d r e s s )
10
output reg [ 3 1 : 0 ] insOut = 0 ;
11
output reg [ 7 : 0 ] p c I n s= 0 ;
12
output reg [ 7 : 0 ] pcOut = 0 ;
13
// i n t e r n a l memory f o r c u r r e n t a d d r e s s i n c r e m e n t e d each c l o c k
14
reg [ 7 : 0 ] pcNow ;
15
// i n i t i a l i z e t o 0
16
i n i t i a l begin
17
pcNow = 0 ;
18
end
19
20
// a t c l o c k change or when t h e i n s t r u c t i o n i s r e c e i v e d from memory or
21
// when t h e c u r r e n t a d d r e s s i n c r e m e n t s
22
always @( i n s R e c e i v e , pcNow , c l k ) begin
23
// o u t p u t t h e r e c e i v e d i n s t r u c t i o n and t h e c u r r e n t i n s t r u c t i o n a d d r e s s
24
insOut <= i n s R e c e i v e ;
25
pcOut <= pcNow ;
26
end
27
28
// a t p o s i t i v e e d g e o f c l o c k or jump f l a g
29
always @( posedge c l k , posedge j I n ) begin
30
// i f f l a g i s s e t t h e n r e s e t t h e c u r r e n t a d d r e s s i n i n t e r n a l
31
//memory t h i s t r i g g e r s t h e i n s t r . mem t o s e t a new i n s t r u c t i o n b a c k
32
i f ( j I n ) begin
33
pcNow <= nextPc ;
34
p c I n s <= nextPc ;
35
end
36
// i f f l a g not s e t t h e n t h e c u r r e n t a d d r e s s i s i n c r e m e n t e d
37
e l s e begin
38
pcNow <= pcNow+1;
39
p c I n s <= pcNow+1;
40
end
41
end
42
43 endmodule
The IF module sends the current address to the instruction memory and when it receives the instruction,
it sends this instruction out to the Instruction Decode unit (Line 24). It increments its internal address and
stores it across clock cycles (Line 38). However, when a jump flag comes with a new address, the internal
address is replaced (Line 33).
4.2
Instruction Memory
The Verilog code for the Instruction Memory module (titled ProgramCounter) is provided in Listing 2.
9
Listing 2: Verilog code for Instruction Fetch

1 // Program Counter c o n t a i n s t h e i n s t r u c t i o n s
2 module ProgramCounter ( pcIn , InsOut ) ;
3
// i n p u t s i n c l u d e t h e r e c e i v e d a d d r e s s
4
// o u t p u t i s t h e i n s t r u c t i o n
5
input wire [ 7 : 0 ] pcIn ;
6
output [ 3 1 : 0 ] InsOut ;
7
// t h i s i s t h e i n t e r n a l memory
8
// i t i s a 31 b i t 255 e l e m e n t a r r a y
9
reg [ 3 1 : 0 ] pcMem [ 0 : 2 5 5 ] ;
10
// c o u n t e r f o r memory i n i t i a l i z a t i o n
11
integer i ;
12
// l o a d t h e memory w i t h b l a c k s , and t h e n add i n t h e i n s t r u c t i o n s
13
i n i t i a l begin
14
f o r ( i =0; i <8 d256 ; i=i +1) begin
15
pcMem [ i ] = 3 2 h00000000 ;
16
end
17
18
/
19
Programs go h e r e :
20
pcMEM[ 0 ] = 32 h1c010502 ;
21
pcMEM[ 1 ] = 32 h1b343001 ;
22
pcMEM[ 2 ] = 32 h01030403 ;
23
/
24
end
25
// a s s i g n s t a t e m e n t i s c o m b i n a t i o n a l o u t p u t i s c o n t i n u o u s
26
assign InsOut = pcMem [ pcIn ] ;
27 endmodule
The Memory contains 255 elements (Line 9). The unit initializes with zeros for all instructions - in effect,
NOPs for all instruction. The actual instructions are then added in lines 19-22. The instructions are in 32
bit, hex format, as per assembly code specifications. Finally, the input is constantly evaluated and so the
output is combinational.
4.3
Instruction Decode
The Verilog code for the Instruction Decode module is provided in Listing 3.
Listing 3: Verilog code for Instruction Decode
1 // Decoder s e t s f l a g s f o r t h e E x e c u t e and t h e ALU opcode
2 module I n s t r u c t i o n D e c o d e ( c l k , i n s I n , pcIn , pcOut , insOut , opcode , r2Out , r1Out ,
3 rDestOut , r1RegGet , r 1 R e c e i v e d , r2RegGet , r 2 R e c e i v e d ,
4 aluOp , readOp , branchOp , jumpOp , writeOp , wb , aluOpcode , stackOp ) ;
5
// I n p u t s a r e t h e i n s t r u c t i o n from t h e Fetch u n i t
6
input c l k ;
7
input [ 3 1 : 0 ] i n s I n ;
8
input [ 7 : 0 ] pcIn ;
9
// Outputs a r e t h e f l a g s f o r t h e e x e c u t e s t a g e as w e l l
10
// as r e g i s t e r a d d r e s s e s f o r r e g i s t e r r e t r i e v e
11
// and t h e d a t a r e t r e i e v e d from t h e r e g i s t e r i t s e l f
12
output reg [ 7 : 0 ] pcOut , opcode , r1Out , r2Out , rDestOut , r1RegGet , r2RegGet ;
13
input [ 7 : 0 ] r 1 R e c e i v e d , r 2 R e c e i v e d ;
14
output reg aluOp , readOp , branchOp , jumpOp , writeOp , wb , stackOp ;
15
output reg [ 3 : 0 ] aluOpcode ;
16
output reg [ 3 1 : 0 ] insOut ;
17
// temoporary memory u n i t s t o s t o r e t h e l i t e r a l v a l u e
10
18
// and t h e opcode
19
reg [ 7 : 0 ] r 1 L i t e r a l , r 2 L i t e r a l , opcodeReg ;
20
21 always @( posedge c l k ) begin
22
// s e t up t h e d i f f e r e n t v a r i a b l e s f o r o u t p u t , i . e .
23
// i d e n t i f y t h e l i t e r a l v a l u e s , t h e a d d r e s s e s
24
// and t h e opcode
25
insOut <= i n s I n ;
26
27
r1RegGet <= i n s I n [ 1 5 : 8 ] ;
28
r2RegGet <= i n s I n [ 7 : 0 ] ;
29
30
r 1 L i t e r a l <= i n s I n [ 1 5 : 8 ] ;
31
r 2 L i t e r a l <= i n s I n [ 7 : 0 ] ;
32
33
rDestOut <= i n s I n [ 2 3 : 1 6 ] ;
34
pcOut <= pcIn ;
35
36
// Determine t h e opcode
37
opcodeReg <= i n s I n [ 3 1 : 2 4 ] ;
38
opcode <= opcodeReg ;
39 end
40
41 // Whenever t h e s e temp r e g s change , t h e n s e t t h e f l a g s
42 // f o r t h e e x e c u t e s t a g e
43 always @( opcodeReg or r 1 R e c e i v e d or r 2 R e c e i v e d or pcIn ) begin
44
stackOp <=0;
45
case ( opcodeReg )
46
8 d0 : begin
//NOP
47
/ FLAG SET/
48
end
49
8 d1 : begin
//Add
50
/ FLAG SET /
51
end
52
/ . . . /
53
endcase
54 end
55 endmodule
The Instruction Decode code is provided in the appendix due to its length. A protytypical version is
actually provided in this Listing. The Decoder first separates the instruction into its component parts - the
opcode (Line 35), the register addresses or memory addresses (Lines 25-29), and the destination register or
the next instruction address, depending upon the instruction opcode (Line 31).
From Line 41, the flags (aluOP, writeOP, readOP, jumpOP, and branchOP) are set according to Table 2.
Due to spacing limitations, the code is provided in Appendix A.
4.4
Registers
The Verilog code for the Register module is provided in Listing 4.

Listing 4: Verilog code for Registers
1 // This module c o n t a i n s t h e r e g i s t e r memory f o r t h e p r o c e s s o r
2 module R e g i s t e r s ( r1Read , r2Read , wbRead , r1Send , r2Send , w b L i t e r a l , wbFlag ) ;
3
// There a r e 4 i n p u t s and two o u t p u t s
4
//The r e q u e s t e d r1 and r2 from t h e i n s t r u c t i o n
5
//The w r i t e b a c k f l a g , and t h e v a l u e t o be s t o r e d t h e r e
6
input [ 7 : 0 ] r1Read , r2Read , wbRead , w b L i t e r a l ;
11
7
input wbFlag ;
8
output reg [ 7 : 0 ] r1Send , r2Send ;
9
// t h e r e g i s t e r i n t e r n a l , memory
10
reg [ 7 : 0 ] regMem [ 0 : 2 5 5 ] ;
11
//When any i n p u t change , t h e n s e t t h e o u t p u t s t o rea d t h e r e g i s t e r
12
// i n t h e c a s e o f a w r i t e b a c k r e q u e s t ( Line 1 6 ) , w r i t e i n t o r e g i s t e r
13
always @( ) begin
14
r1Send = regMem [ r1Read ] ;
15
r2Send = regMem [ r2Read ] ;
16
i f ( wbFlag ) begin
17
regMem [ wbRead ] = w b L i t e r a l ;
18
end
19
end
20 endmodule
The registers return requested values from the instruction decode - the decoder can then decide whether
these values are actually necessary, i.e. they may be neccesarry for arithmetic or branch, where variables
from memory are necessary; however, for return or jump, they are not necessary. For a write back request,
i.e. a query from the Write Back module, the module writes into the registers (Line 17).
4.5
Execute
The Verilog code for the Execute unit is provided in Listing 5.

Listing 5: Verilog code for Execute Stage
1 // P r o t o t y p i c a l E x e c u t e module runs t h e CPU
2 module Execute ( c l k , aluOp , readOp , branchOp , jumpOp , writeOp , wb , stackOp , aluOpcode ,
3 i n s I n , pcIn , opcode , r1Val , r2Val , rDest , r3Dest , r e s u l t , jumpOut ,
4 pcNext , rwPass , s t a c k , memAccess , wbOut , pcNow ) ;
5 // i n p u t s a r e t h e f l a g s from t h e I n s t r u c t i o n decode as w e l l as t h e
6 // r e g i s t e r v a l u e s
7 input c l k ;
8 input aluOp , readOp , branchOp , jumpOp , writeOp , wb , stackOp ;
9 input [ 3 : 0 ] aluOpcode ;
10 input [ 3 1 : 0 ] i n s I n ;
11 input [ 7 : 0 ] pcIn , opcode , r1Val , r2Val , r D e s t ;
12 // Outputs a r e t h e r e s u l t and a few f l a g s :
13 output reg [ 7 : 0 ] r3Dest , r e s u l t ;
14 // These f l a g s a r e f o r jumping s e n t t o f e t c h u n i t
15 // rwPass i s f o r memory u n i t
16 //memAccess i s memory a d d r e s s f o r l o a d / s t o r e
17 output reg jumpOut , wbOut ;
18 output reg [ 7 : 0 ] pcNext , pcNow ;
19 output reg [ 1 : 0 ] rwPass , s t a c k ;
20 output reg [ 7 : 0 ] memAccess ;
21 wire aluDone ;
22 wire [ 7 : 0 ] a l u R e s u l t ;
23 //ALU i s i n s t a n t i a t e d
24 aluMain a l u 1 ( . i n 1 ( r1Val ) , . i n 2 ( r2Val ) , . out ( a l u R e s u l t ) , . done ( aluDone ) ,
25
. s e l e c t o r ( aluOpcode ) ) ;
26
28
// D e f a u l t
29
r 3 D e s t <= r D e s t ; // D e s t i n a t i o n r e g i s t e r
30
r e s u l t <= 0 ;
// R e s u l t o f e x e c u t e ( f o r wb , l o a d , s t o r e )
31
jumpOut <= 0 ;
// jump f l a g from e x e c u t e
32
pcNext <= 0 ;
// n e x t PC f o r jumping
12
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
rwPass <= 0 ;
// w h e t h e r read , w r i t e , or p a s s b y
memAccess <= 0 ;
// Address f o r memory a c c e s s
s t a c k <= 0 ;
//STACK NOP
wbOut <= wb ;
pcNow<=pcIn ;
i f (wb) begin
i f ( readOp ) begin
// Load from memory i n t o d e s t i n a t i o n r e g i s t e r
memAccess <= r1Val ; //Memory a d d r e s s
rwPass <= 2 b01 ; //Read f l a g
end
e l s e i f ( aluOp ) begin
// a l l a l u i n s t r u c t i o n s r e s u l t i s s e n t t o r e g i s t e r f o r w r i t i n g
r e s u l t <= a l u R e s u l t ; // R e s u l t o f add . . . and
rwPass <= 2 b00 ;
// p a s s f l a g
end
e l s e begin
// l o a d immediate
r e s u l t <= r2Val ;
// t a k e immediate v a l u e f o r r D e s t
rwPass <= 2 b00 ;
// p a s s f l a g
end
end
e l s e begin
i f ( jumpOp ) begin
i f ( stackOp ) begin
// u n c o n d i t i o n a l jump n e x t PC e n a b l e d , and jumop Enabled
pcNext <= r D e s t ; // g e t t h e a d d r e s s o f n e x t pc v a l u e
jumpOut <= 1 ;
// e n a b l e t h e jump f l a g f o r IF
end
e l s e begin
// r e t u r n t o p r e v i o u s v a l u e i n s t a c k p a s s on l s t s t a c k v a l
jumpOut <= 1 ;
// s e t jump f l a g f o r IF
s t a c k <= 2 b01 ;
// pop s t a c k v a l and send t o IF
end
end
e l s e i f ( writeOp ) begin
// s t o r e or s t o r e immediate from reg , l i t i n t o memory
rwPass <= 2 b10 ; // Write f l a g
r e s u l t <= r2Val ; // S t o r e from t h i s , e i t h e r imme or r e g
memAccess <= r1Val ; //Memory a d d r e s s
end
e l s e i f ( branchOp ) begin
// e i t h e r branch on e q u a l , g r e a t e r than ; s t o r e a d d r e s s i n s t a c k
i f ( a l u R e s u l t == 1 ) begin
jumpOut <=1;
pcNext <= r D e s t ;
s t a c k <= 2 b10 ; // push c u r r e n t pc i n t o s t a c k
end
e l s e begin
// no jump here , or n o t h i n g
end
end
e l s e begin
//Nop
// e v e r y t h i n g i s a l r e a d y s e t u p f o r nop
end
13
89
end
90 end
91 endmodule
The Execute stage determines which units to enable, i.e. the Data Memory, Write Back, or Stack. These
are done based on the flags received from the decode unit.
4.6
Stack
The Verilog code for the Stack unit is provided in Listing 14. It is long and so is located in Appendix A.
The Stack is controlled by the Execute unit. The current instruction address is pushed into the stack
when the instruction is a branch (See the code for the Execute unit). When the instruction is a return, the
last in address is popped. For a jump, the stack is ignored
4.7
Data Memory
The Verilog code for the Data Memory unit is provided in Listing 6.
Listing 6: Verilog code for Data Memory
1 // Data Memory s i m u l a t e s main memory and c ac h e
2 module DataMemory ( c l k , memAccess , r3Dest , rwPass , r e s u l t , wbCheck ,
3 rDestOut , l i t e r a l , w r i t e F l a g ) ;
4 //memAccess i s memory a d d r e s s ; o t h e r i n p u t s a r e t h e f i n a l r e g i s t e r
5 // i f t h e w r i t e b a c k
6 input [ 7 : 0 ] memAccess , r3Dest , r e s u l t ;
7 input wbCheck , c l k ;
8 input [ 1 : 0 ] rwPass ;
9 // w r i t e F l a g i s f o r w r i t e b a c k
10 output reg [ 7 : 0 ] rDestOut , l i t e r a l ;
11 output reg w r i t e F l a g ;
12
13 reg [ 7 : 0 ] MEMORY [ 0 : 2 5 5 ] ;
14
15 i n i t i a l begin
16
/ MEM I n i t i a l i z a t i o n /
17 end
18
20
w r i t e F l a g <=wbCheck ;
21
rDestOut <= r 3 D e s t ;
22
l i t e r a l <= 0 ;
23
case ( rwPass )
24
2 b00 : begin // p a s s b y
25
l i t e r a l <= r e s u l t ;
26
end
27
2 b01 : begin
// read from mem
28
l i t e r a l <= MEMORY[ memAccess ] ;
29
end
30
2 b10 : begin
// w r i t e
31
MEMORY[ memAccess ] <= r e s u l t ;
32
end
33
2 b11 : begin
// p a s s b y
34
l i t e r a l <= r e s u l t ;
35
end
36
endcase
37 end
38
14
39 endmodule
The Data Memory contains an internal memory similar to the register. For load instructions, Line 26 is
triggered. For store, where memory must be written, Line 29 is triggered. If the Data Memory is not used,
the data is passed through with either Line 23 or Line 32.
4.8
Write Back
The Verilog code for the Write Back stage is provided in Listing 7.
Listing 7: Verilog code for Write Back Stage
1 // Write b a c k i n t o r e g i s t e r s
2 module WriteBack ( c l k , RDest , L i t e r a l , wFlag , WBreg , RegAddress , RegLit ) ;
3 // c o n t r o l l e d by p r o p a g a t e d w r i t e b a c k f l a g from E x e c u t e and p a s s e s
4 // t h r o u g h Data Memory s t a g e
5 input c l k , wFlag ;
6 input [ 7 : 0 ] RDest , L i t e r a l ;
7 // Address and v a l u e f o r r e g i s t e r s t o r a g e
8 output reg [ 7 : 0 ] RegAddress , RegLit ;
9 output reg WBreg ;
10
11 // At each c l o c k , send t h e d a t a t o r e g i s t e r f o r w r i t i n g
13
WBreg <= wFlag ;
14
RegAddress<=RDest ;
15
RegLit<=L i t e r a l ;
16 end
17
18 endmodule
The Write back takes the writeback flag from the data memory that originated from the Execute stage
and uses it to determine register operation. The values are simply passed through to the Register at a clock
positive edge.
4.9
Processor
The Verilog code for the entire processor stage is provided in Listing 8.
Listing 8: Verilog code for Processor
1 //Main i n s t a n t i a t i o n u n i t f o r t h e p r o c e s s o r
2 module p r o c e s s o r ( c l k ) ;
3
input c l k ;
4
// A l l w i r e s d e c l a r e d h e r e : each w i r e c o n n e c t s modules
5
wire [ 7 : 0 ] ifToPCins ;
6
wire [ 7 : 0 ] ifPCout ;
7
8
wire [ 3 1 : 0 ] PCtoIFins ;
9
wire [ 3 1 : 0 ] IFinsOut ;
10 wire j I n ;
11
12 wire [ 7 : 0 ] RtoIDr1Received , RtoIDr2Received ;
13 wire [ 7 : 0 ] IDtoRr1RegGet , IDtoRr2RegGet ;
14
15 wire IDaluOp , IDreadOp , IDbranchOp , IDjumpOp , IDwriteOp , IDwb , stackOpID ;
16 wire [ 3 : 0 ] IDaluOpcode ;
17 wire [ 3 1 : 0 ] IDinsOut ;
18 wire [ 7 : 0 ] IDpcOut , IDopcode , IDr1Out , IDr2Out , IDrDestOut ;
15
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
wire [ 7 : 0 ] nextPc ;
wire [ 7 : 0 ] c u r r e n t P C I n f o ;
wire [ 1 : 0 ] s t a c k W i r e ;
wire [ 7 : 0 ] EpcOut ;
wire [ 7 : 0 ] ErDest , E r e s u l t ,EMem;
wire [ 1 : 0 ] ErwPass ;
wire wbOut ;
wire [ 7 : 0 ] MdestOut , M l i t e r a l ;
wire MFlag ;
wire [ 7 : 0 ] wbRead , w b L i t e r a l ;
wire wbFlag ;
ProgramCounter PC1(
. pcIn ( ifToPCins ) ,
. InsOut ( PCtoIFins ) ) ;
I n s t r u c t i o n F e t c h IF1 (
. nextPc ( nextPc ) ,
. i n s R e c e i v e ( PCtoIFins ) ,
. insOut ( IFinsOut ) ,
. p c I n s ( ifToPCins ) ,
. pcOut ( ifPCout ) ,
. jIn ( jIn ) ,
. clk ( clk ) ) ;
I n s t r u c t i o n D e c o d e ID1 ( . c l k ( c l k ) , . i n s I n ( IFinsOut ) , . pcIn ( ifPCout ) , . pcOut ( IDpcOut ) ,
. insOut ( IDinsOut ) , . opcode ( IDopcode ) , . r2Out ( IDr2Out ) , . r1Out ( IDr1Out ) ,
. rDestOut ( IDrDestOut ) , . r1RegGet ( IDtoRr1RegGet ) , . r 1 R e c e i v e d ( RtoIDr1Received ) ,
. r2RegGet ( IDtoRr2RegGet ) , . r 2 R e c e i v e d ( RtoIDr2Received ) , . aluOp ( IDaluOp ) ,
. readOp ( IDreadOp ) , . branchOp ( IDbranchOp ) , . jumpOp ( IDjumpOp ) , . writeOp ( IDwriteOp ) ,
. wb( IDwb ) , . aluOpcode ( IDaluOpcode ) , . stackOp ( stackOpID ) ) ;
R e g i s t e r s R1 ( . r1Read ( IDtoRr1RegGet ) , . r2Read ( IDtoRr2RegGet ) , . wbRead ( wbRead ) ,
. r1Send ( RtoIDr1Received ) , . r2Send ( RtoIDr2Received ) , . w b L i t e r a l ( w b L i t e r a l ) ,
. wbFlag ( wbFlag ) ) ;
Execute E1 ( . c l k ( c l k ) , . aluOp ( IDaluOp ) , . readOp ( IDreadOp ) , . branchOp ( IDbranchOp ) ,
. jumpOp ( IDjumpOp ) , . writeOp ( IDwriteOp ) , . wb( IDwb ) , . stackOp ( stackOpID ) ,
. aluOpcode ( IDaluOpcode ) , . i n s I n ( IDinsOut ) , . pcIn ( IDpcOut ) , . opcode ( IDopcode ) ,
. r1Val ( IDr1Out ) , . r2Val ( IDr2Out ) , . r D e s t ( IDrDestOut ) , . r 3 D e s t ( ErDest ) ,
. r e s u l t ( E r e s u l t ) , . jumpOut ( j I n ) , . pcNext ( EpcOut ) , . rwPass ( ErwPass ) ,
. s t a c k ( s t a c k W i r e ) , . memAccess (EMem) , . wbOut ( wbOut ) , . pcNow ( c u r r e n t P C I n f o ) ) ;
Stack S1 ( . c l k ( c l k ) , . stackOp ( s t a c k W i r e ) , . pcNext ( EpcOut ) , . pcOut ( nextPc ) ,
. pcNow ( c u r r e n t P C I n f o ) ) ;
DataMemory D1 ( . c l k ( c l k ) , . memAccess (EMem) , . r 3 D e s t ( ErDest ) , . rwPass ( ErwPass ) ,
. r e s u l t ( E r e s u l t ) , . wbCheck ( wbOut ) , . rDestOut ( MdestOut ) , . l i t e r a l ( M l i t e r a l ) ,
. w r i t e F l a g ( MFlag ) ) ;
WriteBack W1( . c l k ( c l k ) , . RDest ( MdestOut ) , . L i t e r a l ( M l i t e r a l ) , . wFlag ( MFlag ) ,
16
75 . WBreg( wbFlag ) , . RegAddress ( wbRead ) , . RegLit ( w b L i t e r a l ) ) ;

76
77 endmodule
It is clear analyzing this code will be difficult and tedious. It is enough to state that this instantiation
module follows the block diagram of the processor given in Figure 3.1. Each of the units has been instantiated
and wired as the block diagram shows. The processors inputs are only the clock. The operations are self
contained.;
5.
5.1
Testbench Procedures
Testbench for Processor
The testbench for the processor is provided in Listing 9.

Listing 9: Verilog code for Processor testbench
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
timescale 1 ns /1 ps
module processorTB ( ) ;
reg c l k =0;
p r o c e s s o r U0 ( . c l k ( c l k ) ) ;
i n i t i a l begin
clk = 0;
#80 ;
forever c l k = #20 c l k ;
end
i n i t i a l begin
#30000 ;
$stop ;
end
endmodule
It is a simple module. There is the instantiation and the clock setting (Line 10). The testbench itself
consists of a delay before the end of the simulation (Line 14). This delay is the computation time - during
this time the processor is computing the results.
5.2
Programming the Processor
The processor is programmed with the ISA provided in Table 3. The three programs discusssed were:
Arithmetic program that performs operations
Basic Load/Store program demonstrating memory accesses
8-number bubble sort combining these operations
5.2.1
Arithmetic program
This program performs several arithmetic operations and demonstrates the pipelined architecture of the
processor. The assembly pseudocode is as follows:
Listing 10: Assembly code for Arithmetic Instructions
1
2
3
l o a d i $01 , $0 , $5
l o a d i $02 , $0 , $10
17
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
nop
nop
nop
add $03 , $01 , $02
sub $04 , $02 , $01
xor $05 , $01 , $02
nop
add $07 , $01 , $03
jump $11 , $0 , $01
nop
nop
nop
The two load instructions store the numbers 5 and 10 into registers 1 and 2, respectively. The following
NOPs (Lines 4-6) allow the pipeline to finish this storage.
Next are some arithmetic operations. The format for the instruction is:
IN ST RREG DEST R1R2
(5.1)
where REG DEST = R1 OP ERAT OR R2. OPERATOR is determined by the instruction, i.e. addition, subtraction, etc. Note the NOP at Line 12 is present becase the next instruction, add at Line 14 would
cause a data hazard - it uses register 3, which is assigned in Line 8. The NOP allows the pipeline to finish.
The jump at Line 16 is a loop returning to the same address continuously.
Figure 5.1: Simulation of Arithmetic operations

This was simulated in ModelSim. The results of the simulation are displayed, showing the contents of the
registers in Figure 5.1. The figure is annotated. Note that before the last addition, there is a period when
no instruction is performed - this is the NOP that waits until the register 3 is loaded with the answer from
the first addition.
5.2.2
Load/Store Program
The program in Listing 12 shows the load/store program.

Listing 11: Assembly code for Load/Store Instructions
1
2
l o a d i $5 , $0 , $1
l o a d i $6 , $0 , $2
18
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
l o a d i $7 , $0 , $3
l o a d i $8 , $0 , $4
nop
nop
nop
s t o r e $0 , $1 , $5
s t o r e $0 , $2 , $6
s t o r e i $0 , $3 , $1
s t o r e i $0 , $4 , $2
nop
nop
nop
l o a d $9 , $3 , $0
l o a d $10 , $4 , $0
jump $17 , $0 , $0
The first fouer instructions in Lines 1-4 load into registers 5, 6, 7, and 8 the numbers 1, 2, 3, and
4, respectively. The $0 between the two operands designating the destination register and the number is
unused.
The next three NOPs (Lines 6-8) wait for the numbers to be loaded. Then, store stores into memory
address 1 and 2 the information in registers 5 and 6 (i.e. the numbers 1 and 2). Then, storei stores the
numbers 1 and 2 into memory addresses 3 and 4.
Finally, the two load instructions in Lines 19-20 load from the data memory addresses 3 and 4 (i.e. the
numbers in 3 and 4, which are 1 and 2) into the registers 9 and 10.
This is shown in Figure 5.2.
Figure 5.2: Simulation of Load/Store operations
5.2.3
Bubble Sort
Bubble Sort is accomplished by swapping numbers as one iterates through a list. It is very inefficient but
serves as an example for this processor. The bubble sort pseudocode is given in Listing ??.
Listing 12: Pseudo code for Bubble Sort
19
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Load MEM { 1 , 2 , . . . , 8 } > REG { 1 , 2 , . . . , 8 }

S e t Counter = 0
SwapCheck :
I f Reg1>Reg2
Goto swap ( Reg1 , Reg2 )
I f Reg2>Reg3
.
.
.
I f Reg7>Reg8
I f Counter = 0
Goto F i n a l
Counter = 0 ;
Goto SwapCheck
Swap :
Reg 10 = Reg1
Reg1 = Reg2
Reg2 = Reg 10
Counter = Counter + 1
Return
Final :
S t o r e REG { 1 , 2 , . . . , 8 } > MEM { 1 , 2 , . . . , 8 }
End
Line 1 in the pseudocode involves loading the set of numbers in the memory into registers. This allows
faster access for swap operations. The Counter variable at Line 3 is a detector for the sort - during each
iteration, for each swap in the bubble sort, the counter increments. So, in the last iteration, when the sorting
is complete, the Counter variable remains 0 and indicates that sorting is complete. From Lines 5-20 is the
SwapCheck procedure. For each two registers, SwapCheck determines whether they need to be swapped to
obtain the correct order. If swapping is necessary, then the Swap procedure at Lines 23-28 is called.
In the Swap procedure, the two registers are swapped and the Counter is incremented to indicate a swap
occured. THe Swap procedure returns to the SwapCheck procedure.
Back in the SwapProcedure, at Line 17, the Counter variable is checked. If it is 0, meaning sorting is
complete, the Final procedure is called. If Counter is not 0, it is reset in Line 19 and the sorting begins
again.
If Counter is 0, however, the Final procedure at Lines 30-32 stores the sorted list back in memory.
This is shown in Figure 5.3 and Figure 5.4. In the first, the unordered set is the worst case scenario:
9080706050403020
(5.2)
In the second figure (Figure 5.4), the undordered set is more realistic:
133435621126590
(5.3)
So, as can be seen, the unordered set has been ordered and restored to memory in both cases.
Note that the sorting in Figure 5.4 completed faster than the worst-case sorting, as expected. While
Bubble Sort is inefficient, this program demonstrates the capabilities of the processor.
20
Figure 5.3: Sorting: Worst Case
Figure 5.4: Sorting: Realistic
6.
Conclusions
In this project, a MIPS-based pipelined processor was built. The processors pipelined architecture was
demonstrated with three programs - one performing arithmetic operations, one performing memory access
operations, and one comprehensive sorting program that incorporates these as well as branching.
The designed processor accepts 8-bit data. The Verilog implementation includes registers and data
memory, but no cache. The Instruction Memory is kept separate from Data Memory to prevent structural
hazars. Data and Control Hazards are avoided with proper placement of NOP commands.
The processor is an arithmetic and logic based processor with 23 such commands. There are 4 memory
access commands and 4 branching commands, as well as the NOP command, bringing the total to 32
commands.
With regard to improvement, the instruction set can be modified to place more focus on branching and
memory commands by reducing the number of logic operations. As it is possible to implement logic with
software, there need not be so much focus on them. Further, a cache can be implemented that exists in place
of the data memory, with the memory as a completely separate element.
Of course, a processor can truly be built when using VLSI. It is possible to take this Verilog Implementation and convert it to a VLSI format that can be etched.
21
7.
7.1
Appendix A
Instruction Decode
The code for the instruction decode is provided. Note the case statement that performs actual decoding.
The values of the flags are set based on Table 3.
Listing 13: Verilog Code for Instruction Decode
1 //
2 module I n s t r u c t i o n D e c o d e ( c l k , i n s I n , pcIn , pcOut , insOut , opcode ,
3 r2Out , r1Out , rDestOut , r1RegGet , r 1 R e c e i v e d , r2RegGet , r 2 R e c e i v e d ,
4 aluOp , readOp , branchOp , jumpOp , writeOp , wb , aluOpcode , stackOp ) ;
5 input c l k ;
6 input [ 3 1 : 0 ] i n s I n ;
7 input [ 7 : 0 ] pcIn ;
8 output reg [ 7 : 0 ] pcOut , opcode , r1Out , r2Out , rDestOut , r1RegGet , r2RegGet ;
9 input [ 7 : 0 ] r 1 R e c e i v e d , r 2 R e c e i v e d ;
10 output reg aluOp , readOp , branchOp , jumpOp , writeOp , wb , stackOp ;
11 output reg [ 3 : 0 ] aluOpcode ;
12 output reg [ 3 1 : 0 ] insOut ;
13
14 reg [ 7 : 0 ] r 1 L i t e r a l , r 2 L i t e r a l , opcodeReg ;
15
17
insOut <= i n s I n ;
18
19
r1RegGet <= i n s I n [ 1 5 : 8 ] ;
20
r2RegGet <= i n s I n [ 7 : 0 ] ;
21
22
r 1 L i t e r a l <= i n s I n [ 1 5 : 8 ] ;
23
r 2 L i t e r a l <= i n s I n [ 7 : 0 ] ;
24
25
rDestOut <= i n s I n [ 2 3 : 1 6 ] ;
26
pcOut <= pcIn ;
27
// insOut = i n s I n ;
28
29
opcodeReg <= i n s I n [ 3 1 : 2 4 ] ;
30
opcode <= opcodeReg ;
31
32 end
33
34 always @( opcodeReg or r 1 R e c e i v e d or r 2 R e c e i v e d or pcIn ) begin
35
stackOp <=0;
36
case ( opcodeReg )
37
8 d0 : begin
//NOP
38
r1Out <= 0 ;
39
r2Out <= 0 ;
40
aluOp <= 0 ;
41
readOp <= 0 ;
42
branchOp <= 0 ;
43
jumpOp <= 0 ;
44
writeOp <= 0 ;
45
wb <= 0 ;
46
aluOpcode <= 0 ;
47
end
48
8 d1 : begin
//Add
49
r1Out <= r 1 R e c e i v e d ;
22
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
aluOp <= 1 ;
readOp <= 0 ;
branchOp <= 0 ;
jumpOp <= 0 ;
writeOp <= 0 ;
wb <= 1 ;
aluOpcode <= 0 ;
end
8 d2 : begin
//Add immediate
r2Out <= r 2 L i t e r a l ;
aluOp <= 1 ;
readOp <= 0 ;
branchOp <= 0 ;
jumpOp <= 0 ;
writeOp <= 0 ;
wb <= 1 ;
aluOpcode <= 0 ;
end
8 d3 : begin
// S u b t r a c t
aluOp <= 1 ;
readOp <= 0 ;
branchOp <= 0 ;
jumpOp <= 0 ;
writeOp <= 0 ;
wb <= 1 ;
aluOpcode <= 1 ;
end
8 d4 : begin
// S u b t r a c t immediate
aluOp <= 1 ;
readOp <= 0 ;
branchOp <= 0 ;
jumpOp <= 0 ;
writeOp <= 0 ;
wb <= 1 ;
aluOpcode <= 1 ;
end
8 d5 : begin
// M u l t i p l y
aluOp <= 1 ;
readOp <= 0 ;
branchOp <= 0 ;
jumpOp <= 0 ;
writeOp <= 0 ;
wb <= 1 ;
aluOpcode <= 2 ;
end
8 d6 : begin
// Power o f 2
23
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
aluOp <= 1 ;
readOp <= 0 ;
branchOp <= 0 ;
jumpOp <= 0 ;
writeOp <= 0 ;
wb <= 1 ;
aluOpcode <= 3 ;
end
8 d7 : begin
// S h i f t l e f t
aluOp <= 1 ;
readOp <= 0 ;
branchOp <= 0 ;
jumpOp <= 0 ;
writeOp <= 0 ;
wb <= 1 ;
aluOpcode <= 1 3 ;
end
8 d8 : begin
// S h i f t l e f t immediate
aluOp <= 1 ;
readOp <= 0 ;
branchOp <= 0 ;
jumpOp <= 0 ;
writeOp <= 0 ;
wb <= 1 ;
aluOpcode <= 1 3 ;
end
8 d9 : begin
// S h i f t r i g h t
aluOp <= 1 ;
readOp <= 0 ;
branchOp <= 0 ;
jumpOp <= 0 ;
writeOp <= 0 ;
wb <= 1 ;
aluOpcode <= 1 4 ;
end
8 d10 : begin
// S h i f t r i g h t immediate
aluOp <= 1 ;
readOp <= 0 ;
branchOp <= 0 ;
jumpOp <= 0 ;
writeOp <= 0 ;
wb <= 1 ;
aluOpcode <= 1 4 ;
end
8 d11 : begin
//And
aluOp <= 1 ;
24
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
readOp <= 0 ;
branchOp <= 0 ;
jumpOp <= 0 ;
writeOp <= 0 ;
wb <= 1 ;
aluOpcode <= 4 ;
end
8 d12 : begin
//And immediate
aluOp <= 1 ;
readOp <= 0 ;
branchOp <= 0 ;
jumpOp <= 0 ;
writeOp <= 0 ;
wb <= 1 ;
aluOpcode <= 4 ;
end
8 d13 : begin
//Or
aluOp <= 1 ;
readOp <= 0 ;
branchOp <= 0 ;
jumpOp <= 0 ;
writeOp <= 0 ;
wb <= 1 ;
aluOpcode <= 5 ;
end
8 d14 : begin
// or immediate
aluOp <= 1 ;
readOp <= 0 ;
branchOp <= 0 ;
jumpOp <= 0 ;
writeOp <= 0 ;
wb <= 1 ;
aluOpcode <= 5 ;
end
8 d15 : begin
// Not
aluOp <= 1 ;
readOp <= 0 ;
branchOp <= 0 ;
jumpOp <= 0 ;
writeOp <= 0 ;
wb <= 1 ;
aluOpcode <= 6 ;
end
8 d16 : begin
//Nor
aluOp <= 1 ;
readOp <= 0 ;
25
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
branchOp <= 0 ;
jumpOp <= 0 ;
writeOp <= 0 ;
wb <= 1 ;
aluOpcode <= 1 0 ;
end
8 d17 : begin
//Nor immediate
aluOp <= 1 ;
readOp <= 0 ;
branchOp <= 0 ;
jumpOp <= 0 ;
writeOp <= 0 ;
wb <= 1 ;
aluOpcode <= 1 0 ;
end
8 d18 : begin
//Xor
aluOp <= 1 ;
readOp <= 0 ;
branchOp <= 0 ;
jumpOp <= 0 ;
writeOp <= 0 ;
wb <= 1 ;
aluOpcode <= 7 ;
end
8 d19 : begin
//Xor immediate
aluOp <= 1 ;
readOp <= 0 ;
branchOp <= 0 ;
jumpOp <= 0 ;
writeOp <= 0 ;
wb <= 1 ;
aluOpcode <= 7 ;
end
8 d20 : begin
// Xnor
aluOp <= 1 ;
readOp <= 0 ;
branchOp <= 0 ;
jumpOp <= 0 ;
writeOp <= 0 ;
wb <= 1 ;
aluOpcode <= 8 ;
end
8 d21 :
begin
//Xnor immediate
aluOp <= 1 ;
readOp <= 0 ;
branchOp <= 0 ;
26
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
jumpOp <= 0 ;
writeOp <= 0 ;
wb <= 1 ;
aluOpcode <= 8 ;
end
8 d22 : begin
// nand
aluOp <= 1 ;
readOp <= 0 ;
branchOp <= 0 ;
jumpOp <= 0 ;
writeOp <= 0 ;
wb <= 1 ;
aluOpcode <= 9 ;
end
8 d23 : begin
//Nand immediate
aluOp <= 1 ;
readOp <= 0 ;
branchOp <= 0 ;
jumpOp <= 0 ;
writeOp <= 0 ;
wb <= 1 ;
aluOpcode <= 9 ;
end
8 d24 : begin
//Jump jump t o pc a d d r e s s i n r D e s t
r1Out <= 0 ;
r2Out <= 0 ;
aluOp <= 0 ;
readOp <= 0 ;
branchOp <= 0 ;
jumpOp <= 1 ;
writeOp <= 0 ;
wb <= 0 ;
aluOpcode <= 0 ;
stackOp <=1;
end
8 d25 : begin
// Branch i f e q u a l compare two r e g i s t e r s f o r
// e q u a l i t y
aluOp <= 1 ;
readOp <= 0 ;
branchOp <= 1 ;
jumpOp <= 0 ;
writeOp <= 0 ;
wb <= 0 ;
aluOpcode <= 1 2 ;
end
8 d26 : begin
// Branch i f g r e a t e r ( f o r l e s s than check ,
// j u s t s w i t c h p o s i t i o n s and c h e c k g r e a t e r than
aluOp <= 1 ;
27
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
readOp <= 0 ;
branchOp <= 1 ;
jumpOp <= 0 ;
writeOp <= 0 ;
wb <= 0 ;
aluOpcode <= 1 1 ;
end
8 d27 : begin // l o a d from Mem a d d r e s s i n r1 i n t o r D e s t
r2Out <= 0 ;
aluOp <= 0 ;
readOp <= 1 ;
branchOp <= 0 ;
jumpOp <= 0 ;
writeOp <= 0 ;
wb <= 1 ;
aluOpcode <= 0 ;
end
8 d28 : begin // Load immediate i n r2 i n t o r D e s t
r1Out <= 0 ;
aluOp <= 0 ;
readOp <= 0 ;
branchOp <= 0 ;
jumpOp <= 0 ;
writeOp <= 0 ;
wb <= 1 ;
aluOpcode <= 0 ;
end
8 d29 : begin // S t o r e from r e g i s t e r i n r2 i n t o mem a d d r e s s i n r1
aluOp <= 0 ;
readOp <= 0 ;
branchOp <= 0 ;
jumpOp <= 0 ;
writeOp <= 1 ;
wb <= 0 ;
aluOpcode <= 0 ;
end
8 d30 : begin // S t o r e l i t e r a l i n r2 i n mem a d d r e s s i n r1
aluOp <= 0 ;
readOp <= 0 ;
branchOp <= 0 ;
jumpOp <= 0 ;
writeOp <= 1 ;
wb <= 0 ;
aluOpcode <= 0 ;
end
8 d31 : begin // Return t o p r e v i o u s i n s t r
r1Out <= 0 ;
r2Out <= 0 ;
aluOp <= 0 ;
readOp <= 0 ;
28
386
branchOp <= 0 ;
387
jumpOp <= 1 ;
388
writeOp <= 0 ;
389
wb <= 0 ;
390
aluOpcode <= 0 ;
391
end
392
default : begin
393
r1Out <= 0 ;
394
r2Out <= 0 ;
395
aluOp <= 0 ;
396
readOp <= 0 ;
397
branchOp <= 0 ;
398
jumpOp <= 0 ;
399
writeOp <= 0 ;
400
wb <= 0 ;
401
aluOpcode <= 0 ;
402
end
403
404
endcase
405
406 end
407
408 endmodule
29
7.2
Stack
The code for the stack is provided. The stack pushes into its heap when a branch command comes. The stack
pops from stack when a return command comes. Finally, the stack passes the address it receives through
without modification in case of a jump command.
Listing 14: Verilog Code for Stack
1 // S t a c k p u s h e s and pops a d d r e s s b a s e d on E x e c u t e c o n t r o l
2 module Stack ( c l k , stackOp , pcNext , pcOut , pcNow ) ;
3 input c l k ;
4 // StackOP c o n t r o l s o p e r a t i o n
5 input [ 1 : 0 ] stackOp ;
6 input [ 7 : 0 ] pcNext , pcNow ;
7 // This i s t h e i n s t r u c t i o n memory a d d r e s s e i t h e r t h e one from t h e
8 // e x e c u t e u n i t or one popped from t h e s t a c k
9 output reg [ 7 : 0 ] pcOut ;
10 //The s t e a k s heap f o r a d d r e s s s t o r a g e
11 reg [ 7 : 0 ] stackHeap [ 0 : 7 ] ;
12 // I n i t i a l i z e t h e Heap t o 0
13 i n i t i a l begin
14 stackHeap [ 0 ] = 0 ;
15 stackHeap [ 1 ] = 0 ;
16 stackHeap [ 2 ] = 0 ;
17 stackHeap [ 3 ] = 0 ;
18 stackHeap [ 4 ] = 0 ;
19 stackHeap [ 5 ] = 0 ;
20 stackHeap [ 6 ] = 0 ;
21 stackHeap [ 7 ] = 0 ;
22 end
23
24
25 always @( stackOp or pcNext ) begin
26
i f ( stackOp == 0 ) begin // p a s s
27
pcOut<=pcNext ;
28
end
29
// pop from s t a c k most r e c e n t
30
e l s e i f ( stackOp == 1 ) begin // pop from s t a c k
31
// pop from e a r l i e s t
32
i f ( stackHeap [ 7 ] > 0 ) begin
33
pcOut <= stackHeap [ 7 ] ;
34
stackHeap [ 7 ] <= 0 ;
35
end
36
e l s e begin
37
38
39
40
end
41
e l s e begin
42
43
44
45
end
46
e l s e begin
47
48
49
stackHeap [ 4 ] = 0 ;
50
end
30
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
e l s e begin
i f ( stackHeap [ 3 ] >0) begin
end
e l s e begin
end
e l s e begin
end
e l s e begin
end
e l s e begin
end
end
end
end
end
end
end
end
end
// push i n t o s t a c k c h e c k i f t h e r e i s s o m e t h i n g i n s t a c k a l r e a d y
e l s e i f ( stackOp == 2 ) begin // push i n t o s t a c k
pcOut<=pcNext ;
i f ( stackHeap [ 0 ] == 0 ) begin
stackHeap [ 0 ] <= pcNow ;
end
e l s e begin
end
e l s e begin
end
e l s e begin
end
e l s e begin
i f ( stackHeap [ 4 ] ==0) begin
end
e l s e begin
31
107
108
109
end
110
e l s e begin
111
112
113
end
114
e l s e begin
115
116
117
end
118
e l s e begin
119
120
end
121
end
122
end
123
end
124
end
125
end
126
end
127
end
128
129
end
130
131
e l s e begin
// p a s s
132
pcOut<=pcNext ;
133
end
134 end
135 endmodule
32
References
[1] David Patterson and John Hennesy (2012). Computer Organization and Design. Elsevier Inc.
33

Processor Report - ECE 174

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Processor Report - ECE 174

Încărcat de

Drepturi de autor:

Formate disponibile

California State Univesity, Fresno

Final Project Report

December 11, 2014

4 Verilog Code for Processor Units

MIPS Pipeline block diagram . . . . . . . . . . . . . . . . . . . .

Processor Design Objectives

Verilog - Language and Workflow

Design Overview - Processor

To understand a pipelined architecture, it is necessary to examine a non-pipelined architecture. In such an

Figure 2.1: MIPS Pipeline block diagram

Figure 3.1: Pipelined processor block diagram

Logic and Shift Operations

Figure 3.3: Prototypical module for logical and shift operations

Verilog Code for Processor Units

Listing 2: Verilog code for Instruction Fetch

The Verilog code for the Register module is provided in Listing 4.

The Verilog code for the Execute unit is provided in Listing 5.

75 . WBreg( wbFlag ) , . RegAddress ( wbRead ) , . RegLit ( w b L i t e r a l ) ) ;

Testbench for Processor

The testbench for the processor is provided in Listing 9.

Programming the Processor

Figure 5.1: Simulation of Arithmetic operations

The program in Listing 12 shows the load/store program.

Figure 5.2: Simulation of Load/Store operations

Load MEM { 1 , 2 , . . . , 8 } > REG { 1 , 2 , . . . , 8 }

Figure 5.3: Sorting: Worst Case

Figure 5.4: Sorting: Realistic

S-ar putea să vă placă și