Mpsoc

SoCrates
- A Scalable Multiprocessor System On Chip
Authors
Mikael Collin, Mladen Nikitovic, and Raimo Haukilahti

fmci,mnc,rhig@mdh.se
Supervisors
Johan Starner and Joakim Adomat
Examinator
Lennart Lindh
Department of Computer Engineering

Computer Architecture Lab
Malardalen University
Box 883, 721 23 Vaster
as
Abstract
This document is the result of a Master Thesis in Computer Engineering, describing the analysis,
specication and implementation of The rst prototype of Socrates, a congurable, scalable and
predictable platform for System-on-chip Multiprocessor system for real-time applications. The design
time of System-on-a-Chip (SoC) is today rapidly increasing due to high complexity and lack of ecient
tools for development and verication. By combining all the functions into one chip the system
becomes smaller, faster, and less power consuming but increasing the complexity. To decrease the
time-to-market SoCs are entirely or partially build with IP-components. Thanks to SoC, a whole new
domain of products, like small hand held devices, has emerged. The concept has been around a few
years now, but there are still challenges that needs to be resolved. There is a lack of standards for
enabling fast mix and match of cores from dierent vendors. Further needs are new design methods,
tools, and verication techniques. SoC solutions needs special kind of CPUs that consumes less power,
is cheaper, smaller, but still has high-performance requirements. To fulll all these demands, they are
getting more and more complex as the number of transistors are rapidly growing which has led to the
emerging of multiprocessors systems-on-a-chip. Our initial question is to investigate if it is possible
to build these complex multiprocessors systems on a single FPGA and if these solutions can lead
to shorter time-to-market. The consumer demands for cheaper and smaller products makes FPGA
solutions interesting. Our approach is to have multiple processing nodes containing processing unit,
memory and a network interface all together connected on a shared bus. A central in-house developed
hardware real-time unit handles scheduling and synchronization. We have designed and implemented
a MSoC that ts on a single FPGA in only 40 days, which has to our supervisors knowledge not been
accomplished before. Our experience is that a tightly coupled group can produce fast results since
information, new ideas and bug reports propagates immediately.
SoCrates
stands for SoC for Real-Time Systems
Introduction
This report describes the design of the rst prototype of SoCrates, a generic scalable platform generator which creates a synthesizable HDL description of a multiprocessor system. The goal was to build
a predictable multiprocessor system on a single FPGA with mechanisms for prefetching data and an
in-house developed integrated hardware real-time unit.
The report consist of three parts where the rst part, Computer Architecture For System on Chip,
is a state of the art report introducing basic SoC terminology and practice with a deeper analysis in
CPUs, interconnects and memory hierarchies. The purpose of this analysis was to learn about state of
the art techniques on how to design complex multiprocessor SoCs. The design process resulted in part
two, SoCrates - specications, which describes the prototype and all the individual parts functionality
and specic demands. Part three, Socrates -implementation details, describes the implementation on all
parts, how to congure the system, and how to compilie and link the system software. We also present
synthesis results and suggest future work that can be done to improve the system.
SoCrates
-Document index
Document 1 Computer Architecture for System on Chip - A State of the Art Report
1.
2.
3.
4.
5.
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Embedded CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27
Memory System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Document 2 Socrates Specications

1.
2.
3.
4.
5.
6.
7.
8.
9.
System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . .1
CPU Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7
Network Interface . . . . . . . . . . . . . . . . . . . . . . . . . . 16
IO Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Arbitration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Boot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29
Memory Wrapper . . . . . . . . . . . . . . . . . . . . . . . . . . . .30
Document 3 Socrates -Implementation details

1.
2.
3.
4.
5.
6.
7.
8.
CPU . . . . . . . . . . . . . . . . . .
Network Interface . . . . . . . . . . .
Arbiter . . . . . . . . . . . . . . . .
Compiling & Linking the System Software
Configuring the Socrates Platform . . .
Current Results . . . . . . . . . . . .
Future work . . . . . . . . . . . . . .
Conclusions . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.1
25
30
31
35
37
40
42
Document 4 Appendix
1.
2.
4.
5.
6.
Demo Application . . . . . . . . . . .
I/O Routines . . . . . . . . . . . . .
Task switch routines . . . . . . . . .
Linker scripts . . . . . . . . . . . .
DATE 2001 Conference, Designers Forum,
. . . . . .
. . . . . .
. . . . . .
. . . . . .
publication
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 1
. 3
. 6
. 9
.12
Computer Architecture for System on Chip

- A State of the Art Report
Revision: 1.0
Authors

Supervisors

as
May 20, 2000
Abstract
This state of the art report introduces basic SoC terminology and practice with deeper analysis
in three architectural components: the CPU, the interconnection, and memory hierarchy. A short
historical view is presented before going into todays trends in SoC architecture and development. The
SoC concept is not new, but there are challenges that has to be met to satisfy customer demands for
faster, smaller, cheaper, and less power consuming products today and in the future. This document
the rst of three documents that forms a Master Thesis in Computer Engineering.
II
CONTENTS
Contents
1 Introduction
1.1 What is SoC ? . . . . . . . . . . . . . . . . . . . . . .

1.2 Soc Designs . . . . . . . . . . . . . . . . . . . . . . .
1.2.1 Intellectual Property . . . . . . . . . . . . . .
1.2.2 An Example of a SoC . . . . . . . . . . . . .
1.3 Why SoC? . . . . . . . . . . . . . . . . . . . . . . . .
1.3.1 Motivation . . . . . . . . . . . . . . . . . . .
1.3.2 State of Practice and Trends . . . . . . . . .
1.3.3 Challenges . . . . . . . . . . . . . . . . . . .
1.4 Introduction to Computer System Architecture . . .
1.4.1 Computer System . . . . . . . . . . . . . . .
1.5 Research & Design Clusters . . . . . . . . . . . . . .
1.5.1 Hydra: A next generation microarchitecture .
1.5.2 Self-Test in Embedded Systems (STES) . . .
1.5.3 Socware . . . . . . . . . . . . . . . . . . . . .
1.5.4 The Pittsbourgh Digital Greenhouse . . . . .
1.5.5 Cadence SoC Design Centre . . . . . . . . . .
2 Embedded CPU
2.1 Introduction . . . . . . . . . . . . . . . . . .
2.2 The Building Blocks of an Embedded CPU
2.2.1 Register File . . . . . . . . . . . . .
2.2.2 Arithmetic Logic Unit . . . . . . . .
2.2.3 Control Unit . . . . . . . . . . . . .
2.2.4 Memory Management Unit . . . . .
2.2.5 Cache . . . . . . . . . . . . . . . . .
2.2.6 Pipeline . . . . . . . . . . . . . . . .
2.3 The Microprocessor Evolution . . . . . . . .
2.4 Design Aspects . . . . . . . . . . . . . . . .
2.4.1 Code Density . . . . . . . . . . . . .
2.4.2 Power Consumption . . . . . . . . .
2.4.3 Performance . . . . . . . . . . . . .
2.4.4 Predictability . . . . . . . . . . . . .
2.5 Implementation Aspects . . . . . . . . . . .
2.6 State of Practice . . . . . . . . . . . . . . .
2.6.1 ARM . . . . . . . . . . . . . . . . .
2.6.2 Motorola . . . . . . . . . . . . . . .
2.6.3 MIPS . . . . . . . . . . . . . . . . .
2.6.4 Patriot Scientic . . . . . . . . . . .
2.6.5 AMD . . . . . . . . . . . . . . . . .
2.6.6 Hitachi . . . . . . . . . . . . . . . .
2.6.7 Intel . . . . . . . . . . . . . . . . . .
2.6.8 PowerPC . . . . . . . . . . . . . . .
2.6.9 Sparc . . . . . . . . . . . . . . . . .
2.7 Improving Performance . . . . . . . . . . .
2.7.1 Multiple-issue Processors . . . . . .
2.7.2 Multithreading . . . . . . . . . . . .
2.7.3 Simultaneous Multithreading . . . .
2.7.4 Chip Multiprocessor . . . . . . . . .
2.7.5 Prefetching . . . . . . . . . . . . . .
2.8 Measuring Performance . . . . . . . . . . .
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
1
1
1
2
2
2
3
3
4
4
5
6
6
6
6
7
8
9
9
10
10
10
10
10
12
14
14
14
15
15
15
16
16
17
17
17
17
17
18
18
18
18
18
20
21
22
23
25
III
CONTENTS
2.8.1 Benchmarking
2.8.2 Simulation . .
2.9 Trends and Research .
2.9.1 University . . .
2.9.2 Industry . . . .
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
..
..
..
..
..
3.1 Introduction and basic denitions . . . . . . . . . . . . . . .

3.2 Bus based architectures . . . . . . . . . . . . . . . . . . . .
3.2.1 Arbitration mechanisms . . . . . . . . . . . . . . . .
3.2.2 Synchronous versus asynchronous buses . . . . . . .
3.2.3 Performance metrics . . . . . . . . . . . . . . . . . .
3.2.4 Pipelining and split transactions . . . . . . . . . . .
3.2.5 Direct Memory Access . . . . . . . . . . . . . . . . .
3.2.6 Bus hierarchies . . . . . . . . . . . . . . . . . . . . .
3.2.7 Connecting multiprocessors . . . . . . . . . . . . . .
3.3 Case studies of bus standards with multiprocessor support .
3.3.1 FutureBus+ . . . . . . . . . . . . . . . . . . . . . . .
3.3.2 VME . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.3 PCI . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Point-to-point interconnections . . . . . . . . . . . . . . . .
3.4.1 Interconnection topologies . . . . . . . . . . . . . . .
3.5 Interconnect performance & scaling . . . . . . . . . . . . . .
3.5.1 Performance measures . . . . . . . . . . . . . . . . .
3.5.2 Shared buses . . . . . . . . . . . . . . . . . . . . . .
3.5.3 Point-to-point architectures . . . . . . . . . . . . . .
3.6 Interconnecting components in a SoC design . . . . . . . . .
3.6.1 VSIA eorts . . . . . . . . . . . . . . . . . . . . . .
3.6.2 Dierences between standard and SoC interconnects
3.7 Case studies of existing SoC-Interconnects . . . . . . . . . .
3.7.1 AMBA 2.0 . . . . . . . . . . . . . . . . . . . . . . .
3.7.2 CoreConnect . . . . . . . . . . . . . . . . . . . . . .
3.7.3 CoreFrame . . . . . . . . . . . . . . . . . . . . . . .
3.7.4 FPIbus . . . . . . . . . . . . . . . . . . . . . . . . .
3.7.5 FISPbus . . . . . . . . . . . . . . . . . . . . . . . . .
3.7.6 IPBus . . . . . . . . . . . . . . . . . . . . . . . . . .
3.7.7 MARBLE . . . . . . . . . . . . . . . . . . . . . . . .
3.7.8 PI-Bus . . . . . . . . . . . . . . . . . . . . . . . . . .
3.7.9 SiliconBackplane . . . . . . . . . . . . . . . . . . . .
3.7.10 WISHBONE Interconnect . . . . . . . . . . . . . . .
3.7.11 Motorola Unied Peripheral Bus . . . . . . . . . . .
3.8 Case studies of SoC multiprocessor interconnects . . . . . .
3.8.1 Hydra . . . . . . . . . . . . . . . . . . . . . . . . . .
3.8.2 Silicon Magic's DVine . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
..
..
..
..
..
..
..
3 Interconnect
4 Memory System
4.1 Semiconductor memories . . . . . .

4.1.1 ROM . . . . . . . . . . . .
4.1.2 RAM . . . . . . . . . . . .
4.2 Memory hierarchy . . . . . . . . .
4.3 Cache memories . . . . . . . . . .
4.3.1 Cache: the general case . .
4.3.2 The nature of cache misses
.
.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
25
26
26
26
26
27
27
27
28
29
29
29
29
30
31
31
31
31
32
32
33
34
34
35
35
36
36
36
37
37
38
39
40
40
40
40
41
41
42
43
43
43
44
45
45
45
46
48
48
49
49
IV
CONTENTS
4.3.3 Storage strategies . . . . . . . . . . .

4.3.4 Replacement policies . . . . . . . . .
4.3.5 Read policies . . . . . . . . . . . . .
4.3.6 Write policies . . . . . . . . . . . . .
4.3.7 Improve cache performance . . . . .
4.4 MMU . . . . . . . . . . . . . . . . . . . . .
4.5 Multiprocessors architectures . . . . . . . .
4.5.1 Symmetric Multiprocessors . . . . .
4.5.2 Distributed memory . . . . . . . . .
4.5.3 COMA Cache Only Memory Access
4.5.4 Coherence . . . . . . . . . . . . . . .
4.5.5 Coherence through bus-snooping . .
4.5.6 Directory-based coherence . . . . . .
4.6 Hardware-driven prefetching . . . . . . . . .
4.6.1 One-Block-Lookahead: (OBL) . . . .
4.6.2 Stream buer . . . . . . . . . . . . .
4.6.3 Filter buers . . . . . . . . . . . . .
4.6.4 Opcode-driven cache prefetch . . . .
4.6.5 Reference Prediction Table: (RPT) .
4.6.6 Data preloading . . . . . . . . . . .
4.6.7 Prefetching in multiprocessors . . . .
5 Summary
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
50
51
51
52
53
55
56
57
58
58
59
60
62
63
64
64
65
65
65
66
66
68
1 INTRODUCTION
1 Introduction
This State of the Art Report covers computer architecture topics with emphasis on System on Chip (SoC).
The reader is introduced to the basic ideas behind SoC and general computer architecture concepts before
presenting an in-depth analysis of three important SoC components: CPU, Interconnect and Memory
architecture.
1.1 What is SoC ?
SoC stands for System-on-Chip and is a term for putting a complete system on a single piece of silicon.
SoC has become a very popular word in the computer-industry, but very few agree on a general denition
of SoC [19]. There are several alternative names for putting a system on a chip, such as system on
silicon, system-on-a-chip, system-LSI, system-ASIC, and system-level integration (SLI) device [33]. Some
might say a large design automatically makes it a SoC, but that would probably include every existing
design today. A better approach would be to say that a SoC should include dierent structures such as
a CPU-core, embedded memory and peripheral cores. This still is a wide denition which could imply
that any modern processor with a on-chip cache should be included into the SoC-community. Therefore
a more suitable denition of SoC would be:
A complete system on a single piece of silicon, consisting of several types of modules including at least one
processing unit designated for software execution, where the system depends on no or very few external
components in order to execute its task.
1.2 Soc Designs
In the beginning, almost all SoC's were simply integrations of existing board-level designs [20]. This way
of designing a system looses many benets that otherwise could be taken advantage of if the system would
be designed from scratch. Another approach is to use already existing modules, called IP-components,
and to integrate them to a complete system suitable for a single die.
1.2.1 Intellectual Property
When something is protected through patents, copyrights, trademarks or trade secrets it is considered as
a Intellectual Property (IP). Only patents and copyrights is relevant for IP-components [13] (also referred
as macros, cores and Virtual Components (VC) [10]). An IP-component is a pre-implemented, reusable
module, for example a DMA-controller or a CPU-core. There are several companies that makes their
living by building, licensing
and selling IP-components, which the semiconductor companies pays both
fees and royalties for1. There exist three classes of IP-components with dierent properties regarding to
portability and protection characteristics. As the portability decreases through the classes, the protection
will increase.
Soft This class of IP-components have their architecture specied at Register-Transfer Level (RTL),
which are synthetizable. Soft IP's are functionally validated and are very portable and modiable.
Since they are not mapped to a specic technology, the behavior according to area, speed, and
power consumption will be unpredictable. Much work still needs to be done before the component
can be utilized and the end-result is dependent of the used synthesis tools.
Firm The rm class components are in general soft components that have been oorplanned and synthesized into one or several dierent technologies to get better estimations of area, speed, and power
consumption.
1 There are exceptions where one can acquire IP-components without any licensing or royalty fees. More information can
be found at http://www.openip.org/.
1.3 Why SoC?
Hard-IP's are further renement of rm components. They are fully syntesized into mask-level
and physicaly validated. Very little work has to be done in order to implement the functionality
in silicon. Hard IP's are not modiable nor portable, but the prediction of their area, speed, and
power consumption is very accurate.
Hard
1.2.2 An Example of a SoC
A typical SoC consists of a CPU-core, a Digital Signal Processor (DSP), some embedded memory, and a
few peripherals such as DMA, I/O, etc (Figure 1). The CPU could perform several tasks with the assistance of a DSP when needed. The DSP is usually responsible for o-loading the CPU by doing numerical
calculation on the incoming signals from the A/D-converter. The SoC could be built of only third-partyIP-components, or it could be a mixture of IP-components and custom-made solutions. More recently,
there has been eorts to implement a Multiprocessor System on Chip (MSoC) [6], which introduces new
challenges regarding cost, performance, and predictability.
Figure 1: An example of a SoC

1.3 Why SoC?
The rst computer systems consisting of relays and later vacuum tubes, used to occupy whole rooms and
their performance were negligible compared to todays standard workstations. The advent of the transistor
in 1948 enabled engineers to minimize a functional block to an Integrated Circuit (IC). These IC's made it
possible to build complex functions by combining several IC's onto a circuit board. Further development
of process technology increased the number of transistor on each IC, which led to the emerging of systems
on board. After this, there has been a constant battle between semiconductor companies to deliver the
fastest, smallest and cheapest products, resulting in today's multi-billion dollar industry. Even though
the SoC concept has been around for quite some time, it has not really been fully feasible until recent
years due to advances like deep sub-micron CMOS process technology.
1.3.1 Motivation
There are several reasons why SoC is an attractive way to implement a system. Todays rened manufacturing processes makes it possible to combine both logic and memory on a single die, thus decreasing
overall memory access times. Given that the application memory requirement is small enough for the
on-chip embedded memory, memory latency will be reduced due to elimination of data trac between
separate chips. Since there is no need to access the memory on external chips, the number of pins can
also be reduced and the use of on-board buses becomes obsolete. Encapsulation counts for over 50% of
the the overall process cost for chip manufacturing [15]. In comparison to a ordinary system-on-board,
SoC uses one or very few IC's, reducing total encapsulation cost and thereby total manufacturing cost.
These characteristics as well as less power consumption and shorter time-to-market enables smaller,
better, and cheaper products reaching the consumers in an altogether faster rate.
1.3 Why SoC?
1.3.2 State of Practice and Trends
Until now, much of SoC implementation has been about shrinking existing board-level systems onto a
single chip, with no or little consideration to all benets that could be gained from a chip-level design.
Another approach to SoC is to interconnect several dies and place them inside one chip. This kind of
modules are called Multi Chip Modules (MCM). The choice of implementation of the Hydra Multiprocessor
Project was at rst a MCM , which later evolved to a SoC [14, 6].
Today it is too time-consuming for companies to implement a system from scratch. Instead, a faster and
more reliable way is to use own or third party pre-implemented IP-components [3], which makes designing
a whole system more about integrating components rather than designing them. There exist three design
methodologies, each with it's own eciency and cost regarding SoC design [16, 18]. The vendor design
approach, which shifts the design responsibilities from the system designers to the ASIC vendors, can
result in the lowest die cost. But it can also lead to higher engineering costs and longer time-to-market.
A more exible method is the partial integration approach, which divides the responsibilities of the design
more equally. It lets the system designers produce the ASIC design, while the semiconductor vendors
are responsible for the core and integration. This method gives the system designers more control of the
working process in comparison to the vendor method. Yet more exible is the desktop approach which
leaves the semiconductor vendors only to design the core. This reduces time-to-market and requires low
engineering costs. A key property for IP-components in the future are parameterization of soft cores [16].
There is a continuous growth in the demand for \smart products" which is expected to make our lives
better and simpler. Recently, SoC products has begun to emerge on several markets in form of Application
Specic Standard Products (ASSP) 2 or Application Specic Instruction-set Processors (ASIP) 3 :
Set-top-boxes A Set-Top-Box (STB) is a device that makes it possible for television viewers to access
the Internet and also watch digital television (D-TV) broadcasts. The user has access to several
services: weather and trac updates, on-line shopping, sport statistics, news, e-commerce, etc. By
integrating the STB's dierent components into a SoC, it will simplify system design and be a more
competitive product with its shorter time-to-market, be less expensive and less power-consuming.
The Geode SC1400 chip is an example of a SoC used in a STB that meets the demands of delivering
both high-quality DVD video and Internet accessibility [34].
Cell phones A SoC in a cell phone will reduce its size and weight, make it cheaper and less power
consuming.
Home automation Many domestic appliances at home will be "smarter". For example, the refrigerator
will be able to notify its owner when a product is missing and place an order on the Internet.
Hand-held devices A new generation of hand-held devices is coming, that can send and receive email
and faxes, make calls, surf the Web etc. A SoC solution is especially suited for portable applications
such as hand-held PC's, digital cameras, personal digital assistants and other hand-held devices
because its built-in peripheral functions minimizes overall system cost and eliminates the need to
use and congure additional components.
1.3.3 Challenges
One of the emerging challenges is to standardize interfaces of IP-components to make integration and
composition easier. A lot of dierent on-chip bus standards has been created by the dierent design houses
to make it possible to fast integrate IP-components, which has resulted in noncompability caused by the
dierent interfaces. To solve this dilemma the Virtual Socket Interface Alliance (VSIA) was founded to
enable the mix and match of IP-components from multiple sources by proposing a hierarchical solution
2 High integration
3 A eld or mask
chip or chipset for a specic application [59].

programmable processors of which the architecture and instruction set are optimized to a specic
application domain [58].
1.4 Introduction to Computer System Architecture
that enables multiple buses [17]. Still some criticizes VSIA for only addressing simple data ows [11].
More can be read about dierent on-chip bus standards in section ??.
Since the time-to-market is decreasing, the testing and verication of the SoC must be done very fast.
By reusing IP-components it is possible that the test development actually takes longer time than the
work to integrate the dierent functional parts [12]. The fact that the components are from dierent
sources and may have dierent test methodologies complicates the test of the whole system. At the board
level design many of the components has their primary inputs visible which made the testing easier, but
SoC's contain deeply embedded components where there is no or very little possibility to observe signals
directly from one IP-component after manufacturing. Since the on-chip interconnect is inside the chip,
it is also hard to test due to the lack of observability.
As the future is lurking behind the door, integration is not likely to stop with IP-components and
dierent memory technologies, we are also likely to see a variety of analog functionality. Analog blocks
are very layout and process dependent, requires isolation and utilizes dierent voltages and grounds. All
these facts makes them the dicult to integrate in the design [10]. Are the limits to the integration
urge? As the process technologies becomes more sophisticated, transistor switching speed will increase
and the voltage for logical levels will decrease. Dropping the voltages will make the units more sensible
for noise. Analog devices with higher voltage needs can encounter problems working properly in those
environments [17].
Apart from the lack of eective design and test methologies [29] and all the technical problems with
mapping a complex design consisting of several IP-components from dierent design houses onto a particular silicon platform, there are complex business issues dealing with licensing fees and royalty payments
[30].
1.4 Introduction to Computer System Architecture
SoC is about putting a whole system on a single piece of silicon. But what is a system? This section serves
as a introduction to computer system architecture and tries do give the reader a better understanding of
what is actually put onto a SoC.
1.4.1 Computer System
In general, a typical computer system (gure 2) consists of one or more CPU's that executes a program by
fetching instructions and data from memory. To be able to access the memory, the CPU needs some kind
of interface and a connection to it. The interface is usually provided by the Memory Management Unit
(MMU) and the connection is handled by the interconnect. The local interconnect is often implemented
as a bus consisting of a number of electrical wires. Sometimes, the CPU needs assistance in fetching large
amount of data, in order to be eective. This work can be done in parallel with the CPU by the Direct
Memory Access (DMA) component. The system needs some means to communicate with the outside
world. This is provided by the I/O system. We proceed with a closer look at the important components
that comprises a computer system.
CPU The CPU is where the arithmetic, logic, branching and data transfer are implemented[8]. It consists
of registers, a Arithmetic Logic Unit (ALU) for computations and a control unit. The CPU can
be classied as a Complex Instruction Set Computer (CISC), if the instruction set is complex (e.g
has a lot of instructions, several addressing modes, dierent word-length on instructions etc). The
idea behind a Reduced Instruction Set Computer (RISC) is to make use of a limited instruction set
to maximize performance on common instructions by working with a lot of registers while taking
penalties on the load and store instructions. RISC has uniform length on all instructions and
very few addressing modes. This uniformity is the main reason why this approach is suitable for
instruction pipelining, in order to increase performance. There are other architectures that further
increase performance, for example superscalar, VLIW, and vector computers. A machine is called a
n-bit machine if it is operating internally on n-bit data[8]. Today a lot of the embedded processors
are still working with 8 or 16 bit words while the majority of workstations and PC's are 32 or 64
bit machines.
1.5 Research & Design Clusters

Address
Address
Address
Cache
CPU
Data
Main
Memory
Data
DMA
controller
DMA
device
Address
System bus
Data
Data
Figure 2: A typical computer system

The key to an eective computer system is to implement a ecient memory hierarchy,
this due to the latency for memory accesses has become a signicant factor when executing a memory
accessing instruction. In the last decade the gap between memory and CPU speed have been
growing. Memory sub-systems must be built to overcome it's shortcomings towards the processor
which otherwise results in wasted computational time from waiting for memory operations to be
completed.
The memory are often organized in a primary and a secondary part, where the primary memory
usually consists of one or several RAM-circuits. The secondary are for longterm storage like Hard
Disc Drives (HDDs), Disks, Tapes, Optic media, and sometimes FLASH memories. To overbridge
the gap between the memory and CPU, nearly all modern processors has a cache that makes use
of the inherent locality of data and code, that ideally can deliver data to the CPU in just one clock
cycle. Usually there exists several levels of cache between the main memory and CPU, each with
dierent sizes and optimizations. The memory system interfaces to the outside world (e.g processor
and I/O) via the MMU, that has the responsibility to translate addresses and to fetch data from
memory via the the memory bus. In multiprocessor systems there are issues whether the memory
should be local on every node or global.
Interconnect A computer systems internal components needs to communicate in order to perform their
task. To make communication possible between components, a interconnect is usually used. A
interconnect can be designed in variety of ways, called topologies. Examples of topologies are bus,
switch, crossbar, mesh, hypercube, torus, etc. Each topology has its own characteristics concerning
latency, scalability and performance.
I/O System The Input/Output system is the computer systems interface to the outside world, which
enables it to receive input and output results to the user. Examples of I/O devices includes HDDs,
graphic and sound systems, serial ports, parallel ports, keyboard, mouse, etc. The transferring of
I/O data is usually taken care of by the DMA component to o-load the CPU of constant data
transfer.
Memory System
There is a lot of research eort done on computer architecture, which of course is related in some degree
SoCs, since they all are actually computers. Unlike most research areas, SoC research is lead by the
industry and not by the universities. Of those universities that have SoC related research projects, very
few have reached the implementation stage.
1.5.1 Hydra: A next generation microarchitecture
The Stanford Hydra single-chip multiprocessor [6] started out as an MultiChipModule(MCM) in 1994
but evolved in 1997 to become a Chip MultiProcessor(CMP). The project are supervised by Associate
Professor Kunle Olukotun accompanied by Associate Professor Monica S. Lam and Mark Horowitz, also
incorporated in the project are a dozen students. Early development of the project was performed by
Basem A. Nayfeh nowdays a Ph.D. The Hydra projects focus on combining shared-cache multiprocessor architectures, innovative synchronization mechanisms, advanced integrated circuit technology and
parallelizing compiler technology to produce microprocessor cost/performance and parallel processor
programmability The four integrated MIPS-based processors will demonstrate that it is feasible for a
multiprocessor to gain better cost/performance than achieved in wide superscalar architectures for sequential applications. By using MCM, communication bandwidth and latency will be improved resulting
in better parallelism. This makes Hydra a good platform to exploit ne grained parallelism, hence a
parallelizing complier for extracting this sort of parallelism is under development. The project is nanced
by US Defense Advanced Research Projects Agency(DARPA) contracts DABT and MDA.
1.5.2 Self-Test in Embedded Systems (STES)
STES is a co-operational project between ESLAB, the Laboratory for Dependable Computing of Chalmers
University of Technology, the Electronic Design for Production Laboratory of Jonkoping University, the
Ericsson CadLab Research Center, FFV Test Systems, and SAAB Combitech Electronics AB. ESLAB
are responsible of developing a self-test strategy for system-level testing of complex embedded systems.
Which utilizes the BIST(Built In Self Test) functionality at the device, board, and MCM level. Except
the involved commercial participants the project are founded by NUTEK.
1.5.3 Socware
An international Swedish Design Center/cluster has recently been builded that will be in close cooperation with the technical universities in Linkoping/ Norrkoping, Lund and Stockholm/Kista. The
Socware, formerly known as Acreo System Level Integration Center (SLIC), aims to have nearly 40
employees/specialists in the beginning but this number will is grow to 1500 in the near future with a
special research institute located in Norrkoping. The Design Center will serve as an bridge between
the industry and research activity and the universities, enabling research results rapidly converting into
industrial products.
The focus of research and development will be directed to design of system components within digital
media technology. initially special focus will be on applications in mobile radio and broadband networking.
Project is nanced by the government, the municipality of Norrkotoping and other local and regional
agencies. More information can be found in [35].
1.5.4 The Pittsbourgh Digital Greenhouse
The Pittsburgh Digital Greenhouse is a SoC design cluster that focuses on digital video and networking
markets. The non-prot organization is an initiative taken by the U.S government, universities, and
industry that started in June 1999. It involves the Carnegie Mellon University, Penn State University,
University of Pittsbourgh, and several industry members like Sony, Oki, and Cadence.
Some ongoing research activities closely related to SoC are:

Congurable System on a Chip Design Methodologies with a Focus on Network Switching
Architecture and Compiler Power Issues in System on a Chip
This project focuses on development of design tools for hardware/software co-design as those required for next generation switches on the Internet and cryptography.
The program is focused on to create a software system that characterizes the power of the major components of a SoC design and allows the design to be optimized for lowest possible power
consumption.

MediaWorm: A Single Chip Router Architecture with Quality of Service Support
The research has focus on the design, fabrication, and testing of a new high performance switched
network router, called Mediaworm. It is aimed to be used in computer clusters where there are
demands on level Quality of Service (QoS) guarantees.
Lightweight Arithmetic IP: Customizable Computational Cores for Mobile Multimedia Appliances
Focus is made on complexity of multimedia algorithms and development of mathematical software

functions which provides the required level of computational performance at lower power levels.
The long range goal is to have SoC in a wide range of next generation products, from \smart homes"
to hand-held devices that allows the user to surf on the web, send faxes and receive e-mail. Further
goals is to provide venture capital, training and education and to assist start-up companies that uses the
research results and pre-designed chips created by the Digital Greenhouse for use in their products. More
information can be found at [37].
1.5.5 Cadence SoC Design Centre
In February 1998, there was an opening of Cadence Design Centre with the purpose of creating one of
the electronic industry's largest and most advanced SoC design facilities. The centre is located on The
Alba Campus in Livingston, Scotland and is the largest European design centre. The centre oers expertise within the spheres of Digital IC, Multimedia/Wireless, Analogue/Mixed Signal, Datacom/Telecom,
Silicon Technology Services, and Methodology Services. In 1999, The Centre became authorized as the
rst ARM approved design centre, through the ARM Technology Access Program (ATAP). Current research projects conducted at the centre involves a single-chip processor for Internet telephony and audio,
a exible receiver chip suitable for among other things pinpointing location by picking up high-frequency
radio waves transmitted by GPS satellites, and a fully customized wireless Local Area Network (LAN)
environment. There are three main pieces of the center, the Virtual Component Exchange (VCX), the
Institute for System Level Integration (SLI) and The Alba Centre. VCX opened in 1998, which is an
institution dedicated for establishing of structured framework of rules and regulations for inter-company
licensing of IP blocks. Members of VCX include ARM, Motorola, Toshiba, Hitachi, Mentor Graphics,
and Cadence. The SLI institute is an educational institution dedicated to system level integration and
research. The institute was established by four of Scotland's leading universities, Edinburgh, Glasgow,
Heriot Watt and Strathclyde. Finally, the Alba centre is the headquarter of the whole initiative and
provides a central point for information about the venture and assistance for interested rms.
2 EMBEDDED CPU
2 Embedded CPU
There are several dierent interpretations of the term CPU. Some say it is "The Brains of the computer"
or "Where most calculations take place", and that it "Acts as a information nerve center for a computer".
A more concrete denition is given by John L Hennessy and David A Patterson[8]:
Where arithmetic, logic, branching, and data transfer are implemented.
This chapter serves as an introduction to CPU's that are especially suitable for SoC solutions, namely
embedded CPUs. In this case, the term "embedded" does not only refer to how suitable these CPUs are
for embedded systems, or as stand-alone microprocessors, but also to how they are good candidates to be
"embedded" into a SoC. The purpose of this chapter is to look at the possibilities of embedded processors
as a SoC component and what aspects need to be considered when designing and implementing a solution.
Techniques for improving and measuring performance is discussed as well as where the research is today
together with a look at the future of embedded processors.
The chapter begins with an introduction to embedded CPUs that explains some of the factors behind
their popularity. Section 2.2 is a presentation of the building blocks of a modern embedded CPU. Section
2.4 discusses the major factors aecting the design. Section 2.3 looks at which paradigm is currently in
front regarding embedded CPUs. Section 4.3.7 considers options on how to implement an embedded CPU.
Section 2.6 presents case studies of embedded CPUs available in the market today. Section 2.7 shows
several techniques of how to improve the performance. Next, section 2.8 consider how the performance
of a embedded processor could be measured. Finally, section 2.9 looks at where the research is today and
what the trends are in the embedded processor market.
2.1 Introduction
The latest advances in process technology has increased the number of available transistors on a single
die almost to the extent that todays battle between designers is not about how to t it all on a single
piece of silicon, but how to make the most use of it. This evolution has also made it possible for designers
to put a complete processor, together with some or all of its periphal components on a single die, creating
a new class of products, called Application Specic Standard Products (ASSPs). The demand for ASSPs
has also created a new domain of processors, embedded 32-bit CPUs, that are cheap, energy-ecient,
especially designed for solving their domain of tasks.
Before getting into all the wonders of embedded CPU's, some clarifactions should be made about what
they are and what they are not. When CPU's are discussed, the thoughts often goes to the architectures
from Intel, Motorola, Sun, etc. These architectures are mainly y designed for the desktop market and
have dominated it for a long time. In recent years, there has been a increasing demand for CPU's designed
for a specic domain of products. Among those noticing this trend was David Patterson[21]:
Intel specializes in designing microprocessors for the desktop PC, which in ve years may
no longer be the most important type of computer. Its successor may be a personal mobile
computer that integrates the portable computer with a cellular phone, digital camera, and video
game player... Such devices require low-cost, energy-ecient microprocessors, and Intel is far
from a leader in that area.
The question of what the dierence is between a desktop and an embedded processor is still unanswered.
Actually, some embedded platforms arose from desktop platforms (such as MIPS, Sparc, x86), so the
dierence can not be in register organization, the instruction set or the pipelining concept. Instead, the
factors that dierentiate a desktop CPU from an embedded processor will be power consumption, cost,
integrated periphal, interrupt response time, and the amount of on-chip RAM or ROM. The desktop world
values the processing power whereas an embedded processor must do the job for a particular application
at the lowest possible cost[22].

This section serves as an introduction to the components of a modern embedded CPU. Readers that are
familiar with the basics of computer architecture and processor design might skip this section.
A CPU basically consists of three components: register set, ALU, and a control unit. Today, it is often
the case that the CPU includes a on-chip cache and a pipeline, in order to achieve an adequate level of
performance (Figure 3). The following text will give a brief introduction to the components function and
purpose in the CPU.
32-bit Address Bus
ALU Bus
Increment Bus
PC Bus
Address Register
Address
Incrementer
Instruction
Decoder
&
Control Logic
32-bit Registers
Cache
32 x 8 Multiplier
Barrel Shifter
Instruction
32-bit ALU
Pipeline
Write Data Register
32-bit Data Bus
Figure 3: A typical embedded CPU

2.2.1 Register File
The organization of registers or how information is handled inside the computer is part of a machines
Instruction Set Architecture (ISA)[8, 9]. An ISA includes the instruction set, the machine's memory,
and all of the registers that is accessible by the programmer. ISAs are usually divided into three main
categories regarding to how information is stored in the CPU: stack architecture, accumulator architecture,
and general-purpose register (GPR) architecture. These architectures dier in how an operand is handled.
A stack architecture keeps its current operands on top of the stack, while a accumulator architecture keeps
one implicit operand in the accumulator, and a general-purpose register architecture only have explicit
operands which can reside either in memory or registers. Following example shows how the expression A
= B + C would be evaluated in these three architectures.
stack architecture
PUSH C
PUSH B
ADD
POP
A
accumulator architecture general-purpose architecture

LOAD R1,C
LOAD R1,C
ADD
R1,B
LOAD R2,B
STORE A,R1
ADD
R3,R2,R1
STORE A,R3
The machines in the early days used stack architectures and did not need any registers at all. Instead,
the operands are pushed onto the stack and popped o into a memory location. Some advantages was
that space could be saved because the register le was not needed, and no operands were needed during
arithmetic operation. As the memories became slower compared to the CPU's, the stack architecture
10
also became ineective, due to the fact that most time is spent while fetching the operands from memory
and writing them back. This became a major bottleneck, which made the accumulator architecture a
more attractive choice.
The accumulator architecture was a step-up regarding performance by letting the CPU hold one of
the operands in a register. Often, the accumulator machines only had one data accumulator register,
together with the other address registers. They are called accumulators, due to their responsibility to act
as a source of one operand and destination of arithmetic instructions, thus accumulating data. The accumulator machine was a good idea at the time when memories were expensive, because only one address
operand had to be specied, while the other resided in the accumulator. Still, the accumulator machine
has it's drawbacks when evaluating longer expressions, due to the limited amount of accumulator registers.
The GPR machines solved many problems often related to stack and accumulator machines. They
could store variables in registers, thus reducing the number of accesses to main memory. Also, the
compiler could associate the variables of a complex expression in several dierent ways, making it more
exible and ecient for pipelining. A stack machine needs to evaluate the same complex expression from
left to right which might result in unnecessary stalling. Many embedded CPUs are RISC architectures
which means that they have lots of registers (usually about 32).
2.2.2 Arithmetic Logic Unit
The Arithmetic Logic Unit (ALU) performs arithmetic and logic functions in the CPU. It is usually
capable of adding, subtracting, comparing, and shifting. The design can range from using simple combinational logic units that does ripple carry addition, shift-and-add multiplication, and a single-bit shifts,
to no-holds-barred units that do fast addition, hardware multiplication, and barrel shifts[9].
2.2.3 Control Unit
The control unit is responsible for generating proper timing and control signals (usually implemented as
a state-machine that performs the machine cycle: fetch, decode, execute, and store to other logical blocks
in order to complete the execution of instructions.
2.2.4 Memory Management Unit
The Memory Management Unit (MMU) is located between the CPU and the main memory and is
responsible for translating virtual addresses into a their corresponding physical address. The physical
address is then presented to the main memory. The MMU can also enforce memory protection when
needed.
2.2.5 Cache
There are few processors today that don't incorporate a cache. The cache acts as a buer between the
CPU and the main memory to reduce access time, taking advantage of the locality of both code and
data. There are usually several levels of cache, each with their own purpose. The rst level is usually
located on-chip, thus together with the CPU. The cache is often separated into a instruction- and a
data-cache. Cache is especially important in a RISC architecture with frequent loads and stores. For
example, Digital's StrongARM chip devotes about 90% of its die area to cache[89]. The reader can learn
more about cache and how it is used in section 4.2.
2.2.6 Pipeline
As with the case of cache, there are very few processors today that doesn't use any kind of pipelining
in order to improve their performance. This section will serve as an introduction to pipelining and
the benets and drawbacks of using it. Pipelining is implementation technique that tries to achieve
11
Instruction Level Parallelism (ILP) by letting multiple instructions overlap in execution. The objective
is to increase throughput, the number of instructions completed at a time. By dividing the execution of
an instructions into several phases, called pipeline stages, an ideal speedup equal to the pipeline depth
could theoretically be achieved. Also, by dividing the pipeline into several stages the workload will be less
each stage, letting the processor run at a higher frequency[8]. Figure 4 shows a typical pipeline together
with it's stages. This particular pipeline has a length of ve and consists of unique pipeline stages, each
with their own purpose.
IF
ID
EX
MEM
WB
Figure 4: A General Pipeline

The Instruction Fetch cycle (IF) consists of fetching the next instruction from memory.
The Instruction Decode cycle (ID) handles the decoding of the instruction and reads the register le
in case one or several of the instruction's operand(s) is a register.

The Execution cycle (EX) evaluates ALU operations or calculates destination address in case of a
branch instruction.
The Memory Access cycle (MEM) is where memory is accessed when needed, or in case of a branch
the new program counter is set to the calculated destination address from the previous pipeline
stage4.
The Write-back cycle (WB) writes the result back to the register le.
The time it takes for an instruction to move from one pipeline stage to another is usually referred to
as a machine cycle. If one stage require several cycles to complete, it could be decomposed into several
smaller stages, resulting in a superpipline. Because instructions need to move at the same time, the length
of a machine cycle is dictated by the slowest stage of the pipeline. The designer challenge is to reduce
the number of machine cycles per instruction (CPI). If one would execute one instruction at a time, the
CPI count would be equal to the pipeline length. The optimal result in a linear pipeline would be to
reach a CPI count of 1.0, which means that a instruction is completed every cycle and that every pipeline
stage is fully utilized. This is not entireably achievable, due to the fact that a program usually consist of
internal dependencies, branches, etc. These pipeline obstacles are usually referred to as hazards and can
cause delays in the pipeline, called stalls. An execution of a program completely without
hazards would
execute its instructions with virtually no delays, resulting in a CPI count close to 1.05. Those hazards
that do cause pipeline stalls are usually classied as: structural, data and control hazards.
structural hazards are caused by resource con icts when the hardware cannot support certain combinations of overlapped execution. It can depend on just having one port to the register le, causing
con icts in the ID and WB stage for register requests. Another source of con ict might be that the
memory is not divided into code and data, thus causing con icts in the IF and MEM stage due to
instruction fetching and memory writing.
data hazards are caused by an instruction being dependent of another instruction in previous pipeline
stage so that execution must be stalled, or else written data can be inconsistent. These instruction
dependencies comes in three avors: Read After Write (RAW), Write After Write (WAW), and
4 In case of an conditional branch, the condition will be evaluated. If the instruction branches, the program counter is
set by previous calculation, otherwise the program counter is incremented to point at the next instruction.
5 The CPI count never reaches the ideal value of 1.0, thus cycles are always lost in the beginning because the pipeline
is initially empty and need to be lled with instructions. By the time the rst instruction reaches the WB phase, several
cycles is lost.
2.3 The Microprocessor Evolution
12
(WAR). RAW hazard are the most common ones and occurs when a write
instruction is followed by a read instruction and both instructions operate on the same register,
causing the one instruction to wait until the write has been issued in the WB stage. This can
be handled by forwarding, thus introducing "shortcuts" in the pipeline, so that instructions can
take part of results before the current instruction reaches the WB stage[8]. WAW hazards cannot
occur in pipelines like the one showed earlier (gure 4). The reason for this is that in order for
a WAW hazard to occur, either the memory stage has to be divided into several stages, making
several simultaneous writes possible, or some mechanism where an instruction can bypass an another
instruction in the pipeline. WAR hazards are rare and happens when a instruction tries to write
to a register read by an instruction that is ahead in the pipeline. As with WAW hazards, WAR
hazards cannot occur in a general pipeline because register contents are always read at the ID stage.
Some pipelines do read register contents late and can create a WAR hazard [8].
control hazards are caused by the instructions that changes the path of execution, called branches.
By the time a branch instruction calculates it's destination address in the EXE stage, instructions
following the branch has reached the IF and ID stage. If the branch was unconditional, the instructions that is in the IF and ID stage has to be removed, because the branch changes the program
counter and the new instructions have to be fetched from a new address, namely the destination
address of the branch. On the other hand, if there was an unconditional branch, the condition
need to be evaluated in order to decide if the branch should be taken or if the program counter
only should be incremented. One way of dealing with this problem is to automatically stall the
pipeline until condition is evaluated. These stalls are issued in the ID stage, where the branch is
rst identied. Also, in order to evaluate the condition of a conditional branch and calculate the
destination address simultaneously, extra logic for condition evaluation is added together with the
ALU in the ID stage. This way, only one stall cycle will be wasted when a branch instruction
occurs.
Most structural hazards can be prevented by adding more ports and dividing the memory into data
and instruction memory segments. The memory can also be improved by adding cache or increasing the
cache area. Data hazards can be handled by letting the compiler reschedule the instructions in order to
reduce the number of dependencies. Control hazards can be reduced by trying to predict the destination
of a branch. The prediction is based on tables storing historical information about whether the same
branch did or did not jump in earlier executions. Such tables, called Branch History Table (BTB) or
Branch Prediction Buer (BPB), are doing just that. Other tables such as Branch Target Buer (BTB)
acts as a cache storing the destination address of many previously executed branches. The interested
reader can continue it's reading in several books and articles addressing dierent branch penalty reduction
techniques[8, 64, 63].
Write After Read
This section serves as a "walk-through" of the dierent phases in microprocessor evolution (gure 5).
As this section may seem unrelevant to embedded processors, the embedded processor design has always
been in uenced by the microprocessor and may continue to do so in the future. The reader who feels
unfamiliar with the principles behind the RISC and CISC paradigms, should reread section 1.4.1 before
proceeding with this text.
In the early days, there was a limiting amount of transistors available for the CPU designer. Usually,
the chips where lled with logic that was seldom used (e.g decoding of seldom used instructions). CISC
computers used microcoding, which made it easier to execute complex instructions. As the years went
by, it became harder for CISC designers to keep up with Moore's law 6 . Building more complex solutions
each year was not enough. Some designers realized that the rule locality of reference is something that
needs to be taken into consideration. It states that A program executes about 90% of its instructions
6 The
capability of microprocessor doubles every 18 month.
13

CISC
instructions take variable time to complete
CISC
RISC
microcoding, more complex instructions
pipeline, simple instructions for speed
RISC/CISC
merging of architectures
Multithreaded Processors
Superscalar/VLIW
Single Chip Multiprocessor
duplicated HW resources (regs, PC, SP)
execute multiple instructions
duplicated processors
Simultaneous Multithreading
any context can execute each cycle
Figure 5: The evolution of microprocessors.

in 10% of it's code.
The RISC designers thought that if they could implement the 10 % of most used
instructions and throw out the other 90%, then there would be lots of free die area left for other ways of
increasing the performance. Some of the performance enhancing techniques are listed below.
Cache Memory references was becoming a serious bottleneck, and a way to reduce the access time is to
use the extra on-chip space for cache. With the on-chip cache, the processor did not need to access
the main memory for all memory references.
Pipelining By breaking down the handling of an instruction into several simpler stages, the processor
is able to run faster, resulting in higher frequency.
More registers When compiling a program into machine code, the handling of variables usually is
taken care of by registers. Sometimes, there are stalls in the pipeline, due to dependencies between
registers (e.g one can not use a register until it is available), which can be avoided by register
renaming. This is possible when increasing the number of registers.
Computers using some or all of these advantages include RISC I and IBM 801 [2]. These enhancements
gave the RISC designers the upper hand for several generations in the 80's and 90's. But when the
number of available transistors on a chip passed the million-mark the number of transistors as a limiting
factor dissapeared. The CISC designers could level the score by introducing more complex solutions that
increased their performance a couple of percent, with little concern to how much die are was used. Even
though the CISC processor was several factors more complex than the corresponding RISC processor,
it was still keeping up with the RISC. Nowadays, the RISC and CISC paradigms are merged together
and uses techniques from both of the original paradigms. Now, when there are 10, 20 million or more
transistors available, the problem the designer is facing is more about making the most use of all the
transistors than how to t it all on one die. A simple processor can now be realized on only a fraction of
the available space. There are limits in the performance gains when increasing the cache size, deepening
the pipeline and increasing the number of registers. So, the question is what to do with the available
space? To gain more performance, new architectures like Multithreading, Simultaneous Multithreading
(SMT), Very Long Word Instruction (VLIW), and Single Chip Multiprocessor (CMP) are emerging.
These architectures will be discussed in section 2.7.
2.4 Design Aspects
14
2.4 Design Aspects
The designers of embedded processors are under market pressure when it comes to producing cheap, low
power-consuming, fast processors[22]. To meet the market demand for a SoC solution, the designers of
an embedded processor need to look at several design aspects, listed below.
2.4.1 Code Density
The size of a program may not be an issue in the desktop world, but is major challenge in embedded
systems. The embedded processor market is highly constrained by power, cost, and size. For control
oriented embedded applications a signicant portion of the nal circuitry is used for instruction memory.
Since the cost of an integrated circuit is strongly related to die size, smaller programs imply cheaper
smaller dies is needed, which in turn means cheaper dies can be used for embedded systems [81, 82].
Thumb and MIPS16 are two approaches that tries to reduce the code density of programs by compressing
the code. Thumb and MIPS16 are subsets of the ARM and MIPS-III architecture. The instructions used
in the subset are either frequently used or does not require full 32-bits or are important to the compiler
for generating small code. The original 32-bit instructions are re-encoded to be 16-bits wide. Thumb and
MIPS16 is reported to achieve code reductions of 30% and 40%, respectively. The 16-bit instructions are
fetched from instruction memory and decoded to equivalent 32-bit instructions that is run as normally
by the core. Both approaches have drawbacks:
Instruction widths are shrunk at the expense of reducing the number of bits used to represent
registers and immediate values
Conditional execution and zero-latency shifts are not possible in Thumb
Floating-point instructions are not available in MIPS16
The number of instructions in a program grows with compression
Thumb code runs 15-20% slower on systems with ideal instruction memories
Both Thumb and MIPS16 are execution-based selection form of selective compression which is a technique that selects procedures to compress according to a procedure execution frequency prole. The other
form is miss-based selection which is invoked only on an instruction cache miss. All performance loss will
occur on a cache miss path. This way, miss-based selection is based on the number of cache misses and
not the number of executed instructions as in execution-based selection. Speedup can be achieved by
letting the procedures with the most cache misses to be in native code.
Jim Turley has a dierent view on the techniques for reducing code density[89]: Claimed advantages
in code density should be considered in light of factors such as compiler optimization (loop unrolling,
procedure inlining, etc), the addressing (32-bit vs. 64-bit integers or pointers), and memory granularity.
Finally, code density does little or nothing to aect the size of data space. Applications working with
large data sets requires much more memory than the executable, thus code reduction does little help
here.
2.4.2 Power Consumption
Many products using embedded processors use batteries as power supply. To preserve as much power as
possible, embedded processors usually operate in three dierent modes: fully operational, standby mode
and clock-o mode [22]. Fully operational means that the clock signal is propagated to the entire processor, and all functional units are available to execute instructions. When the processor is in standby
mode, it is not actually executing a instruction, but the DRAM is still refreshed, register contents is also
available. The processor returns to fully operational mode upon a activity that requires units that are
not available in standby mode, without loosing any information. Finally, in clock-o mode, the system
has to be restarted in order to continue, which almost take as much time as a initial start-up. Power
2.5 Implementation Aspects
15
consumption is often measured as milliwatt per megahertz (mW/MHz).

The simplest way of reducing the power consumption is to reduce the voltage level. Today, CPU core
voltage has been reduced to about 1.8V and is still decreasing. Embedded processors are starting to
incorporate dynamic power management into their design. One example is a pipeline that can shut o
the clock at various logic blocks that is not needed when executing the current instruction[98]. This type
of pipeline is usually referred to as a power-aware pipeline. Also, it is no longer sucient to only measure
the power consumption of the CPU, as it gets integrated with its peripherals in a SoC. Instead, a power
consumption measure of the entire system has to be done.
2.4.3 Performance
Unlike the desktop market, performance isn't everything in the embedded processor market. Instead
factors like price, power consumption is equally important. A typical embedded processor usually executes
about one instruction per cycle. Today, performance is still measured as Million Instructions Per Second
(MIPS) which basically only reveals the amount of instructions executed per second, not if they were
any useful instructions executed. MIPS is not a good way of measuring performance, and section 2.8.1
looks at other alternatives. Sometimes, the usual performance of one executed instruction per cycle for
an embedded processor is not enough and other alternative architectures must be considered in order to
increase the performance. Section 4.3.7 discusses possible alternative architectures.
2.4.4 Predictability
Architectures that supports real-time systems must have the ability to achieve predictability [84]. Predictability is dependent on the Worst Case Execution Time (WCET) which is in turn dictated by the
underlying hardware. Much focus is on improving an architectures performance, and little thought goes
to make it predictable. This has lead to architectures that includes cache, pipeline, virtual storage management, etc, all which has improved the average case execution time, but has worsen the prospects for
predictable real-time performance.
Caches have not been popular in the real-time competing community, due its unpredictable behavior.
This is true for multi-tasking, interrupt driven environments which are common in real-time applications [87]. Here, the individual task execution time can have dier from time to time due to interactions
of real-time tasks and the external environment via the operating system. Preemptions may modify
the cache contents and thereby cause a nondeterministic cache hit ratio resulting in unpredictable task
execution task times.
Pipelines also introduces similar problems to caches concerning worst case execution time. There are
eorts to achieve predictable performance of pipelines without using a cache and without the hazards
associated with them [88]. This approach, called Multiple Active Context System (MACS), uses multiple
processor contexts to achieve increasing performance and predictability. Here, a single pipeline is shared
among a number of threads and context of every thread is stored within the processor. On each cycle,
a single context is selected to issue a single instruction to the pipeline. While this instruction proceeds
through the pipeline, other contexts issue instructions to ll consecutive pipeline stages. Contexts are
selected in a round-robin fashion. A key feature of MACS architecture is that its memory model allows
the programmer to derive theoretical upper bounds on memory access times. The maximum number of
cycles a context will wait for a shared memory request is dictated by the number of contexts, the memory
issue latency, the number of shared memory competing threads, and the number of contexts scheduled
between consecutive threads.
2.5 Implementation Aspects
There are several options available for the designer who wants to integrate an embedded processor into
a SoC. Besides building a processor from "scratch", there are other options available. The rst option
16
2.6 State of Practice
is to acquire the processor core as an hard IP-component7 in form of a specic semiconductor fabrication process and are delivered as mask data. Several hard IP-cores will be examined in section 2.6.
The second option is to acquire the CPU as a rm IP-component which is usually delivered in form
of a netlist. The third and last option is to acquire a soft IP-component in form of VHDL or Verilog
code or to produce a synthesizable core with a parameterizable core generator. There has been several research eorts to develop generators of parameterizable RISC-cores[73, 76]. One conducted at the
university of Hanover has developed a parameterizable core generator that outputs fully synthesizable
VHDL code. The generated core is based on a standard 5 stage pipeline (Figure 4). The designer has
many choices when using the generator (e.g pipeline length, ALU and data width, size of register le, etc).
The generated cores are simple RISC-processors with a parameterizable word and instruction width.
Instruction and data memories are provided as a VHDL template le for simulations, but they are not
suitable for synthesis. Instead, they should be taken from a technology specic library. Since the cores
are based on RISC-principles, the instruction set consists of only few instructions and addressing modes.
A typical 32-bit RISC core with 32 bit data path and 8 32 bit registers can with a 3LM 0.5 micron.
standard-cell library deliver about 100 MHz achievable clock-frequency.
Commercial core generators are also available from Tensilica, ARC, and Triscend[100, 101, 99].
The 4, 8 and 16 bit microprocessors was and still are dominating the embedded control market. In
fact, it was forecasted that eight times more 8-bit than 32-bit CPU's will be shipped during 1999[89].
The 32-bit embedded processor market diers from the desktop market in that there are about 100
vendors and a dozen instruction set architectures to choose from. The thing that makes 32-bit embedded
CPU's attractive is their ability to handle emerging new consumer demands in form of ltering, articial
intelligence, multimedia, still maintaining a low level of power consumption, price, etc. Next will follow
a brief presentation of available embedded processors commonly used today.
2.6.1 ARM
The Advanced RISC Machines (ARM) company is a leading IP provider that licenses RISC processors,
periphals, and system-on-chip designs to international electronics companies. The ARM7 family of processors consists of ARM7TDMI and ARM7TDMI-S processor cores, and the ARM710T, ARM720T and
ARM740T cached processor macrocells.
An ARM7 processor consists of an ARM7TDMI or ARM7TDMI-S (S stands for Synthesizable and
means that it can be acquired as VHDL or Verilog code) core that can be augmented with one of
the available macrocells. The macrocells provides the core with 8KB cache, write buer, and memory
functions. ARM710T also provides a virtual memory support for operating systems such as Linux and
Symbain's EPOC32. ARM720T is a superset of ARM710T and supports WindowsCE.
When writing a 32-bit program for an embedded system there might be a problem to t the entire
program in the on-chip memory. This kind of problem is usually referred to as a code density problem.
In order to address the code size problem ARM has developed Thumb, a new instruction set. Thumb is
an extension to the ARM architecture, containing 36 instruction formats drawn from the standard 32-bit
ARM instruction set that have been re-coded into 16-bit wide opcodes. Upon execution, the Thumb
codes are decompressed by the processor to their real ARM instruction set equivalents, which are then
run on ARM as usual. This gives the designer the benets of running ARM's 32-bit instruction set and
reducing code size by using Thumb.
7 Those
who are not familiar with the dierent layers of IP-components, can read the section
SoC Design.
17
The ARM9 family is a newer and more powerful version of ARM7 and designed for system-on-chip
solutions due to its built-in DSP capabilities. The ARM9E-S solutions are macrocells intended for integration into Application Specic Integrated Circuits (ASICs), Application Specic Standard Products
(ASSPs) and System-on-chip (SoC) products.
CPU core
Die Area
Power
Frequency Performance
ARM7TDMI 1.0 mm2 on 0.25 m 0.6 mW/MHz @ 3.3V 66 MHz 0.9 MIPS/MHz
ARM9E-S 2.7 mm2 on 0.25 m 1.6 mW/MHz @ 2.5V 160 MHz 1.1 MIPS/MHz
2.6.2 Motorola
The Motorola M-CORE microprocessor, introduced in 1997, was targeting the market of analog cellular
phones, digital phones, PDAs, portable GPS systems, automobile braking systems, automobile engine
control, and automotive body electronics. The development of the M-CORE architecture was designed
from the ground up to achieve the lowest milliwatts per MHz. It is a 32-bit processor that has a 16-bit
xed length instruction format, and a 32-bit RISC architecture. The M-CORE minimizes power usage
by utilizing dynamic power management.
Motorola has also developed a modern version of the 68K architecture, the Coldre, which is positioned
between the 68K (low end) and the PowerPC (high end). This architecture is also known as VL-RISC,
because although the core is RISC-like, the instructions are variable length (VL). VL instructions help to
attain higher code density, Coldre has a four-stage pipeline consisting of two subpipelines: a two-stage
instruction prefetch pipeline and two-stage operand execution pipeline.
2.6.3 MIPS
MIPS Technologies designs and licenses embedded 32- and 64-bit intellectual property (IP) and core
technology for digital consumer and embedded systems market. The MIPS32 architecture is a superset
of the previous MIPS I and MIPS II instruction set architectures.
2.6.4 Patriot Scientic
Patriot Scientic Corporation was one of the rst developing a Java microprocessor, the PSC1000. The
PSC1000 is targeted for high performance, low-system cost applications like, network computers, set-top
boxes, cellular phones, Personal Digital Assistants (PDA's) and more. The PSC1000 microprocessor is
a 32-bit RISC processor that oers ability to execute both Java(tm) programs as well as C and FORTH
applications. It oers a unique architecture that is a blend of stack- and register-based designs, which
enables features like 8-bit instructions for reduced code size. The idea behind the PSC1000 is to make
Internet connectivity for low cost devices such as PDA's, set-top cable boxes and "smart" cell phones.
2.6.5 AMD
(AMD)'s 29000K was an early leader which was frequently used in laser printers and network buses. The 29K family comprises three product lines, including three-bus Harvardarchitecture processors, two-bus processors, and a microprocessor with on-chip peripheral support. The
core is built around a simple four-stage pipeline: fetch, decode, execute, and write-back. The 29K has a
triple-ported register le of 192 32-bit registers. In 1995, AMD cancelled all further development of the
29K to concentrate its eorts on x86 chips.
Advanced Micro Devices
2.6.6 Hitachi
Hitachi SuperH (SH) became popular when Sega chose the SH7032 for its Genesis and Saturn video game
consoles. Then, it expanded to cover consumer-electronics markets. Its short, 16-bit instruction word
gives SuperH one of the best code density compared with almost any 32-bit processor. The SH family
2.7 Improving Performance
18
uses a ve-stage pipeline: fetch, decode, execute, memory access, and write-back to register. The CPU
is built around 25 32-bit registers.
2.6.7 Intel
Intel i960 emerged early in the embedded market which made it successful in printer and networking
equipments. The i960 is well supported with development tools. The i960 combines a Von Neumann
architecture with a load/store architecture that centers on a core of 32 32-bit general-purpose registers.
All i960s have multistage pipelines and use resource scoreboarding to track resource usage.
2.6.8 PowerPC
The PowerPC is one of the best-known microprocessor name next to Pentium and is steadily gaining
ground in the embedded space. IBM and Motorola are pursuing dierent strategies with their embedded
PowerPC chips, with the former inviting customer designs and the latter leveraging its massive library
of peripheral I/O logic.
2.6.9 Sparc
Sun's SPARC was the rst workstation processor to be openly licensed and is still popular with some
embedded users. The microSPARC are built around a large multiported register le that breaks down
into small set of global registers for holding global variables and sets overlapping register windows. The
microSPARC's pipeline consists of an instruction-fetch unit, two integer ALUs, a load/store unit, and a
FPU.
Pipelining is a way of achieving a level of parallelism, resulting in a low CPI count. In order to be
even more eective, linear pipelining will not suce and other techniques have to be considered. These
techniques have the ability to execute several instructions at once, resulting in a CPI count below 1.0.
The most popular techniques includes Multiple-issue Processors (such as Very Long Instruction Word
(VLIW) and Superscalar Processors), Multithreading, Simultaneous Multithreading (SMT) and Chip Multiprocessor (CMP). Also, another technique will be discussed that tries to come to terms with the ever
growing memory-CPU speed gap. There is one technique, called prefetching or preloading, that hides
the memory latency by fetching and storing required data or instructions in a buer before it is actually
needed.
2.7.1 Multiple-issue Processors
Although there are techniques that can remedy most of the stalls in an ordinary pipeline, the ideal
result is still only a CPI count of 1.0, thus executing exactly one instruction for every machine cycle.
This performance is not always enough and other ways of achieving a higher level of parallelism need
to be considered. Multiple-issue processors tries to execute several instructions in a machine cycle, thus
achieving a higher rate of Instruction-Level Parallelism (ILP). There are mainly two types of processors
using these techniques, namely Very Long Instruction Word (VLIW) and superscalar processors. Also,
in addition to the two architectures, a third alternative processor, called Multiple Instruction Stream
Computer (MISC) will be discussed.
As the name implies, a VLIW processor issues a very long instruction packet that consists of several
instructions. An example of a instruction packet can be seen in (Figure 6), were there is room for two
integer/branch operations, one oating point operation, and two memory references. In the case of VLIW
processors, the task of nding independent instructions in the code is done by the compiler instead of dynamic hardware as in superscalar processors. Additional hardware is saved because the compiler always
19

Integer
Integer
FP
Memory
Memory
Figure 6: A VLIW instruction packet example.

executes independent instructions, thus eliminating the need for hazard-detecting hardware. The ideal
scenario of a VLIW processor using the example instruction packet would be that the compiler always
nds enough instructions to ll the entire packet at every cycle. Unfourtunatly, this is not the always the
case and techniques that resolves dependencies have to be used. Examples of such techniques are loop
unrolling and software pipelining. Both techniques will be demonstrated in section 2.7.5.
The drawback of VLIW processors, as in all multiple-issue processors, is that they suer from the
natural lack of ILP in programs [8]. There are several functional units in the VLIW processor that
requires several cycles to complete (e.g oating point, memory reference, branch), resulting in a multicycle latency that makes the task of keeping the functional units busy dicult. Another drawback is the
amount of hardware needed in order to both issue and execute multiple instructions per cycle. First, a
duplication of the functional units is needed in order to execute multiple instructions of the same type.
Secondly, an increase of bandwidth is needed to be able to access the memory and the register le, resulting in increased complexity and die area. An example of a successful VLIW processor is Crusoe from
Transmeta[102].
Figure 7: An example of a superscalar architecture. Source: Marc Torrant[97]

In a typical superscalar processor, the hardware might theoretically issue from one to eight instructions in a clock cycle [8]. Here, the issue rate varies, due to limited amount of functional units (e.g three
memory references can not be handled in a single cycle if there are only two memory functional units
available) and instruction dependencies. As explained earlier, the VLIW processor relies on its compiler
to nd independent instructions, thus relying on static issue capability, whereas the superscalar processor
has both static issue capability in form of a compiler and dynamic issue capability maintained by its
hardware. Looking at gure 7, the goal here is nd pairs of independent instructions that can execute in
parallel. Usually there is one or several piplines or pipeline-stages that are dedicated for oating point
operation, while the others are for integer operations. Assuming a superscalar with two issue slots, where
one slot is for oating-point operations, the ideal scenario would be that both pipelines could issues an
instruction each every cycle. This would result in an increase of total instructions executed per cycle,
thus lowering the CPI count below 1.0. There are less dependencies, because the oating-point operation
uses dierent registers than the integer operation. One potential hazard is to issue an integer operation
that is a oating-point load, which may cause a RAW hazard. The pairing of instructions is done by the
compiler and is considered as static scheduling. There are several algorithms that does dynamic scheduling, thus issuing instructions at run-time. Commonly used dynamic scheduling techniques are scoreboard
20
and Tomasulo's algorithm [8].

Finally, there is an alternative multiple-issue processor that tries to take advantage of the benets of
both VLIW processors and superscalar processors. This processor is referred to as Multiple Instruction
Stream Computer (MISC) [105]. It uses multiple asynchronous processing elements to separate a program
into streams that can be executed in parallel. To facilitate low latency intra-MISC communication, a
con ict free message passing system is integrated at the lowest level of processor design.
scheme
2.7.2 Multithreading
A process or thread is an abstract entity that performs tasks8 [1]. The aim of multithreading is to divide
a program into several threads that can switch among each other when a communication event occur.
Hopefully, events with long latencies can then be tolerated by switching to a new thread when they occur.
Multithreading implemented in hardware has several benets:
If the program is divided by the programmer into several threads, then no special software analysis
is needed.
It handles unpredictable situations (such as cache miss, and communication misses) well.
It does not reorder instructions like some of the other techniques, thus keeping the memory consistency.
Its drawbacks are that signicant changes has to be done to the microprocessor architecture, and it
has not been particularly successful in uniprocessor systems. Therefore this technique will be discussed
brie y and give room for another technique that is an extension of multithreading, called simultaneous
multithreading.
ThreadID
5
4
3
2
1
Time
Figure 8: An example of multithreading.

Multithreading begins by executing instructions of a thread. Then, when the thread encounters a longlatency event, it is switched out and lets a new thread start its execution (gure 8). The state of a thread
is called a context, so multithreading can also be described as multiple-context -switching. The state of a
thread consists of processor register contents, the program counter, the stack pointer, and condition codes.
When switching the thread, several actions are possible. A thread might need to ush the pipeline, or if it
suers from a long latency, store its context in memory before switching, and load it back when its needed.
Hardware multithreading is divided into two categories, depending on what event triggered the switch.
In blocked approach, a thread is switched when a long latency event occurs, whereas in the interleaved
approach, the switches occur on predened time boundaries.
8A
task is dened as a piece of work done by a program.
21
Figure 9: The partitioning of issue slots between superscalar,multithreading, and simultaneous multithreading. Source: Dean M. Tullsen [75]
2.7.3 Simultaneous Multithreading
Multiple issue processors has their limitations because they only rely on the parallelism of instructions,
which is inherently limited due to the natural dependencies within a program [74, 75]. A multiprocessor
consisting of several multiple issue-processors is a good solution that tries to combine ILP with Thread
Level Parallelism (TLP), but has limited performance when the threads run out of parallelism. Simultaneous multithreading is a processor design that combines the earlier techniques in order to exploit
parallelism at both instruction-level and thread-level as much as possible. From superscalars it uses the
ability to issue multiple instructions each cycle and as in multithreaded processors it contains hardware
state for several threads. The result is a processor that can issue multiple instructions from multiple
threads each cycle. Also, in addition to that it is capable of parallelism at both instruction and thread
level, it performs well when executing a single thread, because it has all the resources to it self, thus
matching the performance of a ordinary superscalar architecture. On the hardware side, SMT is simply
an add-on component that gives a conventional superscalar architecture the ability to handle multiple
threads.
The dierence between superscalar, multithreading and simultaneous multithreading can be seen in gure 9. Each row represents the issue slots for a single execution cycle. A lled box means that a processor
has found an instruction to execute in that issue at that particular cycle. Empty boxes indicates unused
slots. Figure (a) shows how a conventional superscalar might execute its instructions. A superscalar tries
to execute as many instructions as possible in a program or thread. When it is hindered by dependencies
between instructions, it must stall and wait for them to be resolved, resulting in both horizontal and
vertical waste. Figure (b) shows a multithreaded architecture executing the same program. Here the
architecture still suers form the dependencies between instructions, but can switch threads when long
latency events occur. This results in a similar horizontal waste as in the superscalar architecture, but
an improvement in vertical waste, due to it's ability to tolerate latency. Figure (c) shows how a SMT
architecture can issue several instructions from several threads at each cycle.It achieves ILP by choosing
instructions for several threads. If one thread has a high level of ILP, then it can ll the issue slots,
whereas if there is poor ILP among several threads, they can run together to ll the issue slots. This
results in little waste both vertically and horizontally.
The overall benets of SMT has been discussed, but one serious drawback is that SMT never has been
implemented[97]. The results from evaluating the architecture has been acquired by simulation. There
22
is a positive side to this, due to Digital Corporation that has announced in their roadmap that the next
Alpha processor will be using SMT[131]. Even though SMT haven't been implemented, there are many
research groups:
Washington - Simultaneous Multithreading
Illinois Urbana-Champaign - I-ACOMA
UC Irvine - Multithreaded Superscalar
UC Santa Barbara - Multistreamed Superscalar
Michigan - Simultaneous Multithreading
2.7.4 Chip Multiprocessor
To understand what multiprocessor systems are about, a "standard-taxonomy" among computers originally stated by Flynn is often used. Computers are classied by the way instructions and data is provided
to the system[2]:
SISD Single Instruction Single Data (SISD) computers includes the ordinary computer that decodes a
instruction at a time.
SIMD Single Instruction Multiple Data (SIMD) is the classical form of array processors. Here, several processor units are controlled by a single control unit. The processor units receive the same
instructions and addresses but they operate on dierent data.
MISD Multiple Instruction Multiple Data (MISD) usually referres to pipelined computers. The pipeline
stages can be seen as several instruction streams that ows through a gradually transforming data
stream.
MIMD Multiple Instruction Multiple Data (MIMD) applies to multiprocessor systems, on-chip or not.
The processor units in a multiprocessor system are coupled in order to be able to exchange data
and to synchronize with eachother.
Todays processor design often make use of sophisticated architectural features to try to nd independent instructions within a program or thread. Examples of such techniques are out-of-order execution and
speculative execution of instruction after branches predicted with dynamic hardware branch prediction
techniques. Future performance improvements will require "wider" uniprocessors that can issue more
instructions at a time. This will of course make life harder for people that have to design and verify
these processors. One way of exploiting multiple threads of control is to spread them out over several
simpler processors in a multiprocessor system. A Multiprocessor system implemented on a single die is
often referred to as Chip Multiprocessor (CMP) or Multiprocessor System-on-chip (MSoC). The advantage CMP has over the other architectures, is that it consists of duplicated simple cores. This way, the
designer only needs to verify a single core, which is much easier than verifying a SMT or multiple-issue
processor. Because CMP uses relatively simple single-thread level processor cores, they will not be able
to achieve the same level of ILP as a SMT architecture. Everything isn't simple about CMP, because it
must deal with the same issues as normal multiprocessor systems, namely cache coherence, consistency,
and synchronization.
An example of a CMP project is The Stanford Hydra (see Research section in Introduction for information).
23

2.7.5 Prefetching
Prefetching is a technique for hiding the memory latency that is growing every year, due to the growing
dierences in speed between the CPU and memory. The memory latency is hidden by overlapping it
with useful instructions. This technique can be implemented both in hardware and in software. The
hardware- approach often involves communication with or modications to the cache, therefore it will be
discussed in the section 4.6. Although there are techniques for both instruction and data prefetching,
most focus will be on data prefetching.
Software-directed prefetching is often done during the optimization phase and rely on the compiler to
do static program analysis in order to insert prefetch instructions at selected places in the code. The
strategy is to place the prefetch instruction early enough before the data is needed, so that the entire
latency is being hidden. The instruction can not be placed to early though, because if it stays there
to long, the data might be replaced in the cache just before it is needed. Early prefetches might also
replace currently used data, resulting in a cache pollution. The following example will show how several
prefetching techniques are applied [70].
The following code is presented by the programmer and the aim is to start prefetch the data, in this
case elements of array a, before it is needed. This code will also show how loop-based prefetching works.
for(i = 0; i < N; i++)
sum = a[i] + sum;
This loop sums the elements of an array. Assuming a cache block size of four words, this code segment
will cause a cache miss every fourth iteration. To try to avoid this cache miss a prefetch instruction is
inserted just before the computatation of the sum. The prefetched element is fetched one loop iteration
before it is needed.
for(i = 0; i < N; i++) {
fetch( &a[i+1]);
sum = a[i] + sum;
}
The observant reader can see in the code listed above that there are to many prefetching instructions
issued, because the cache block had a size of four words, thus four consecutive instructions is already
fetched when executing a single prefetch instruction. A solution could be to insert a predicate that
checks whether the loop has iterated four times since the last issued prefetch instruction. This results
into wasted cycles, so another technique, loop unrolling, is considered. With loop unrolling, the loop body
is replicated by the factor equal to the cache block size, in this case four times. The prefetch instruction
will now prefetch only the fourth element of the array, as shown in the code below.
for(i = 0; i < N; i+=4) {
fetch( &a[i+4]);
sum = a[i] + sum;
sum = a[i+1] + sum;
sum = a[i+2] + sum;
sum = a[i+3] + sum;
}
This will still result in unnecessary cache misses, because during the rst loop iteration, the prefetch will
never be used. Also, unnecessary prefetching is issued in the last iteration of the loop. Another technique
showed below, software pipelining, extracts the rst and last iteration of the loop into a prologue that
issues a prefetch and a epilogue that does not issue a prefetch.
fetch( &sum );
fetch( &a[0] );
24

for(i = 0; i < N-4; i+=4) {
fetch( &a[i+4]);
sum = a[i] + sum;
sum = a[i+1] + sum;
sum = a[i+2] + sum;
sum = a[i+3] + sum;
}
for( ; i < N; i++)
sum = a[i] + sum;
Now, all references are covered because a prefetch instruction is issued before the corresponding data
is used. One problem remains though, and that is the issue of the compiler having to be sure that the
prefetch will fetch the data fast enough so that it will be available when it is used. If the time it takes to
prefetch an instruction takes more than one cycle, as in the shown examples above, the prefetch distance
have to be calculated. The prefetch distance, , is calculated by dividing the average cache miss latency,
l, with the estimated number of cycles in the shortest possible execution path through one loop iteration,
s, including any prefetch overhead.
=d e
Assuming an average cache latency of 100 and a minimum loop iteration time of 45, the prefetch
distance will be 3. This means that the prefetch instruction have to be issued three prefetch iterations
earlier, shown in the code below.
l
s
fetch( &sum );
for(i = 0; i < 12; i +=4)
fetch( &a[i] );
for(i = 0; i < N-12; i+=4) {
fetch( &a[i+12]);
sum = a[i] + sum;
sum = a[i+1] + sum;
sum = a[i+2] + sum;
sum = a[i+3] + sum;
}
for( ; i < N; i++)
sum = a[i] + sum;
A natural question is to ask what performance gains can be expected by using these techniques? The
answer is that these techniques are usually restricted to loops in scientic programs. This kind of loops are
common in scientic programs and prefetching can nearly double the performance []. General applications
often use the same data in loops, resulting in high cache utilization, which diminishes the benets of
prefetching. prefetching has also some negative "sideects":
Increased total code size as a result of inserted prefetching instructions.
Executing prefetch instructions (especially to calculate the prefetch address) takes time, thus increasing total execution time.
Prefetching will also introduce new challenges in a multiprocessor environment[67]:
Prefetches increases memory trac.
25
2.8 Measuring Performance
Prefetching of shared data items can increase the coherence trac.

Invalidation misses are not predictable at compile time.
Dynamic task scheduling and migration policies are detrimental to the eciency of prefetching.
In conclusion, a short listing of the benets and drawbacks of each technique is shown below.

Technique
VLIW
Superscalar
Multithreading
SMT
CMP
Benets
Reduced architecture complexity
Can reach high ILP
Hides latencies well
Exploits both TLP and ILP
Simple cores. Good TLP, moderate ILP
Drawbacks
Complex compiler needed. Exploits only ILP.
Complex architecture. Hard to test and verify
Needs changes to architecture
Complex architecture. Not implemented yet
Must consider multiprocessor aspects
2.8 Measuring Performance
A performance improvement is not worth much if it can't be measured. It is also important to measure it
the right way, thus telling the truth about the characteristics of an embedded processor. There are several
options available, each with their advantages and drawbacks. Some of these options will be discussed
here.
2.8.1 Benchmarking
To measure the performance of an embedded processor has been problematic for a long time now. Benchmarks such as Whetstone and Dhrystone does not really give a true picture of a processor's performance.
There is no meaning in measuring a embedded processor's MIPS count, because it does not actually
say if those executed instructions was doing anything (the could be NOPs). As an even more serious
error is then to divide the MIPS count with the processors power consumption, resulting in MIPS/Watt
comparisons that should not be taken seriously. Moreover CISC and RISC architectures are dierent in
terms of the amount of "work" a single instruction performs [78].
To come to terms with this problem, the EDN Embedded Microprocessor Benchmark Consortium (EEMBC
- pronounced "embassy") was founded in April 1997 to develop performance benchmarks for processors
in embedded applications. EEMBC comprises a suit for Automotive/Industrial, Consumer, Networking,
Telecommunications, and Oce Automation and Telecommunications market. These benchmarks target
specic applications that include engine control, digital cameras, printers, cellular phones, modems, and
more. With the assistance of several industry experts, the consortium created 37 individual algorithms
the constitute EEMBC's Version 1.0 suite of benchmarks. Today, EEMBC consists of 30 semiconductor
company members and 3 compiler company members. The annual fee for a new member is $30 000. The
EEMBC Benchmark suite code is available for all board members.
One of the key components developed at the EEMBC is a portable benchmarking test harness that runs
on a variety of host platforms, interfacing to a number of "benchmark target" platforms. The test harness
consists of:
Standard API for benchmark support including File I/O download.
Benchmark loader
Consistent, repeatable execution environment
Framework for fully-automated execution of benchmarks
Standardized reporting, diagnostics, and log les
2.9 Trends and Research
26
The test harness has been ported to little an big endian target boards, 16-64 bit boards. It includes
support for both Z-modem and uuncoding uploads/downloads, and takes up very small amount of memory on the target. An initial look at the rst benchmark results reveals that several results from a specic
processor is available, depending on which compiler is used. Performance is seems to be measured as
iterations/sec (no explanation found) and the code size in Kb. The processor who reaches the most iterations per second and has the smallest code size is on top of the list, according to the benchmark results.
Other benchmark organizations include Berkeley Design Technology (Independent DSP Analysis and
Optimized DSP Software), Nullstone Corporation (Automated Compiler Performance Analysis), Standard Performance Evaluation Corporation (SPEC) (Develops standardized set of relevant benchmarks
and metric for performance evaluation of modern computer systems), Business Application Performance
Computing (BAPC) (develop and distributes a set of objective performance benchmarks based on popular
computer applications and industry standard operating systems).
2.8.2 Simulation
A more inexpensive and exible way of measuring the performance of an embedded processor might be to
make a simulator. Most (or all?) available simulators seem to be custom made to suit a specic purpose.
Simulation is a good way to evaluate dierent options when designing a system, but should not be used
to draw major conclusions concerning performance. After all, simulation is done in software and cannot
give a 100% true picture. An example of a simulator is the SMTSim Multithreading Simulator 9 written
by Dean M. Tullsen which has been used in two of his published articles concerning multithreading.
2.9 Trends and Research
This section takes a look at what might lie ahead for the embedded processor of tomorrow and where
there are research eorts to continue its development.
There has been an increase in DSP functionality among embedded processors in recent years. This is
happening because many embedded systems today often include a CPU and a DSP, where the DSP helps
the CPU with numerical calculations. As the CPUs get faster, the can take over the job of the DSP and
replace it in the future. On the other hand, DSPs are getting faster and more complex by the day and
there are arguments that suggests that the DSP should include "CPU functionality". The future will
show who wins this battle or if there will always exist separated CPUs and DSPs[?].
The steady improvement of process technology indicates that the future embedded processor might
reach a core voltage below 1.0 in the near future, resulting an a very low power consumption. More
companies are also beginning to realize the benets of selling embedded processors as IP-components.
This will likely contribute to a increasing competition and hopefully cheaper products.
2.9.1 University
The Stanford Hydra project has already been discussed in the introduction but will only be mentioned
here as one of the largest research projects conducted on CMP architectures. Other smaller CMP eorts
are conducted at the Wisconsin Multiscalar, Carnegie-Mellon Stampede, and MIT M-Machine.
2.9.2 Industry
ARM has a Technology Access Program (ATAP) where they help several companies or research groups in
their design of embedded processor and SoC's. Companies currently in cooperation with ARM through
ATAP are: Barco Silex, Cadence, Sican, and Wipro Infotech. IBM announced plans to make CMPs in
form of IBM Power4. This is the rst commercially proposed CMP targeted at servers and other systems
that already makes use of conventional multiprocessors. Also, Sun has plans to build a CMP with the
Sun MAJC. It has a shared primary cache and is designed to support Java execution.
9 SMTSIM
is available at http://www-cse.ucsd.edu/ tullsen/smtsim.html.
27
3 INTERCONNECT
3 Interconnect
This chapter focuses on the communication channels that ties the components (CPUs, memory, peripherals etc.) together in a computer system, called the interconnect. This topic is vast and a number of
books and articles have been written about interconnects, so concentration is made on the architectural
part of the interconnect, restricted to single processor computers and small scale parallel computers with
emphasis on SoC. First some basics about the bus based interconnection are presented, before going into
more complex switched interconnection networks. At the end of the chapter, SoC interconnection details
are presented before an overview of existing SoC interconnections.
3.1 Introduction and basic denitions
Since the dierent components in a computer system need to communicate with each other, there is need
for communication channels between them. Communications can be divided into hierarchical layers [33].
The lowest layer, the physical layer deals with the physical wires and drivers. Timing of information at
this level is of key importance. At the next layer, the transfer layer, there is a set of rules called a protocol
that sets up how the transactions are performed. A transaction layer can be introduced to enable pointto-point communications between components. At the top, the application layer, there is no information
of how data is transferred, only that it is sent from the source and delivered to the destination. The
two main architectures for interconnections are shared media and switched media interconnections[8].
Designs using shared media approach are often called bus-based solutions and switched media is referred
as point-to-point designs. In point-to-point architectures the communication channel is always shared
between two devices, in contrast to the bus based architectures where there is possibility for more than
two to share a channel. The gure 10 shows the dierent architectures. Point-to-point interconnections
will be further examined in section 3.4.
B
A
Figure 10: The left gure shows a bus-based system where the devices A, B and C communicates via
a common bus. In the true point-to-point approach shown on the right side, any of the devices can
communicate with another without interference on the communication channels between them.
3.2 Bus based architectures
A bus consists of a number of electrical wires and rules that deals with how transfers of data is done on
the bus. In completely parallel buses each bit used in the protocol has its own dedicated wire in contrast
to completely serial buses where all the information is multiplexed onto a single wire. Multiplexing means
that a line is shared by two or more individual signal sources sending information over the line at dierent
point of times. Which of the sources that are using the line is controlled by a multiplexer. For example,
using multiplexing to form only one bus from two separate 32-bit address and data buses, reduces the
number of signal pins from 64 to 32. In practice most computer buses are parallel, where in some cases
multiplexing is used on a subset of the signals. Transfers on a bus is initiated by a bus master. The
28
master writes or reads data from other units attached to the bus, called slaves. An simple example is
a CPU (acting as a master) reading data from memory (slave). A generic term for masters and slaves
is devices. The sequence of actions, from retaining bus ownership to transferring data and breaking the
connection is called a transaction. Several devices may have the ability to be a master on a bus, but
only one is allowed to be in control of the bus at any time. Therefore there is a need to resolve potential
contention between them via an arbitration process handled by an arbitration unit.
3.2.1 Arbitration mechanisms
There are two major aspects concerning arbitration. One is the algorithm that decides which device will
get the master-ship over the bus. The second one is where the arbitration is done. The basic arbitration
algorithms are explained below.

Round-robin
The round-robin
Priority based arbitration
Hierarchical round robin
Time-shared
A time-shared bus
time-slot, in which
or fair algorithm will serve bus ownership requests sequentially. It behaves like a
FIFO-queue where every device that wants to become a master is put at the end of a queue. The
device at the front of the queue will retain bus ownership and when the transactions of data is
nished the entry will be removed from the queue.
Priority algorithms grants the bus to the device that has been given the highest priority. Depending
on implementation, a lower priority ongoing transaction may be interrupted if a device with higher
priority wants to use the bus. Priorities can be static or dynamic.
A combination of the priority based algorithm and round-robin is the hierarchical round-robin
where several levels of FIFO-based queues exist. If there are any devices waiting for ownership at
the highest level, it will be granted the bus. Otherwise a master is selected from the nearest level
below containing a non-empty queue.
uses arbitration where every device has been given a specic time-interval, a
it will be a bus master. Every device will be mastering the bus at a dierent
point of time, thus eliminating contention. The schedule of ownership is repeated at given rate as
the example shows in g 11.
T
bus mastership
A B X C A B X C A B X C
time
Figure 11: The three devices A, B, and C has been given an own time-slot in which they act as a master.
The slot marked X is an unused slot. The schedule is repeated periodically with the period T.
The centralized arbitration architecture has a single arbitration unit receiving separate request signals
from each master. After requesting the bus, competitive devices must wait for the arbiter to set the
corresponding grant signal to the device before using the bus. By separating the arbitration and transaction buses, arbitration is permitted to occur in parallel with data transfers and thus increasing bus
performance. Distributed arbitration is used in many newer buses. In this solution most the arbitration
circuitry is located at each master. During the arbitration each competitor for the bus sets its own request
line and then listens to the other request lines before taking decision whether master-ship is retained or
not.
29
3.2.2 Synchronous versus asynchronous buses
After a device has gained control over the bus, it can begin to carry out bus cycles. A bus cycle is exchange
of information between master and slaves, including data and timing information. In synchronous bus
architectures the bus includes a clock in the control lines and a protocol that samples the lines at xed
times with respect to the clock [8]. All the devices must obey the same distributed clock which results in
a limitation of the bus length to avoid clock-skewing. The duration of a bus cycle is constrained by the
rate of the clock which often is constrained by the slowest unit attached to the bus [60]. By introducing a
wait protocol decrease of overall performance can be avoided. An asynchron bus does not have a clocked
line at all. They instead use handshaking protocols between master and slave by using timing signals
to indicate the validity for the information. Handshaking protocols in asynchronous system takes full
advantage of the speed of fast responding devices without having to be concerned about slower devices.
Asynchron buses does not have the same restrictions on length since there is no worry for clock skewing
or synchronization problems.
3.2.3 Performance metrics
Bus performance is a function of bus width, bus arbitration, clock speed, bus protocol, and the application
using the bus. Two very common performance measure for buses is bandwidth (also known as throughput )
and latency. Bandwidth is often measured in Megabytes per second (MB/s) and dened as the amount
of data transmitted or received per unit time. When talking about bandwidth people usually mean burst
throughput or peak throughput, which is the theoretical rate at which reads can be performed using the
largest possible burst cycle. A burst transfer is characterized by only sending one address on the bus
followed by multiple data transfers. This mechanism is also called the block transfer mechanism [55] and
is for example implemented in PCI-bus and FutureBus+ (see section 3.3). In addition to throughput,
latency is also an important measure especially for multimedia systems. Bus latency is the response time
from the point of time where a device wants read (or send) data until the rst data is read (or sent).
The fact that CPUs are getting faster quicker than the memories has made the initial latency a dominant
factor in bus usage [55]. More time is spent on waiting for the initial latency than transferring the data
from memory to CPU. After transferring the last data, some buses require time before leaving the bus
in an idle state. This time is referred as turn-around latency. By decreasing the latency throughput on
the bus can be increased signicantly.
3.2.4 Pipelining and split transactions
When an address is transferred between a master and a slave the address bus is occupied until the
slave accepts the address and completes the handshake. The situation is similar for the data bus. By
implementing a latch into the bus interfaces at both sides, decouples the bus from the device. This allows
an address, address packet n to hold by the latch at the target, freeing the initiator of the transaction to
to send the next address, address packet n+1 time. At the same time the initiator interface can send the
address packet n. Using this technique means that the actions are serialized but several requests are in
the air. This pipelining technique is applicable to both data, address and arbitration. The arbitration
unit may respond to a requesting device: \you may use the bus next time it goes idle".
In a split-transaction bus the read request is separated from the data transfer, enabling devices to
make read requests during the initial latency for the rst request. Patterson and Hennessey does not
separate split-transaction buses from pipelined buses [8]. Using split-transactions does not decrease the
initial latency, but it increases the utilization of the bus. A drawback with these two techniques are the
increased complexity.
3.2.5 Direct Memory Access
A Direct Memory Access (DMA) controller allows devices to transfer data from one location to another
without intervention from the CPU. This means that the DMA controller must act as a master on the
bus. The DMA controller has at least four register, which all can be loaded by software from the CPU:
30
the memory address to be written, count of how many bytes to be transferred, the I/O devices number
or its address space, and the direction of the data (from memory to I/O or from I/O to memory). The
DMA controller has two modes of operation:
Using the burst mode means that during the DMA data transfer the CPU can not use the bus. The
CPU will be blocked until the DMA has been nished but data will be moved very fast.
In the cycle stealing mode the DMA is only allowed to make transfers when the CPU is not using
the system bus. Using this method the DMA transfer rate is limited to the bus width per CPU
instruction cycle.
To initiate a DMA transfer the CPU must set up the registers. After DMA has nished the transaction
it interrupts the CPU [8]. In order to prevent the DMA from monopolizing the bus and causing very high
latency to other devices, bus accesses has maximum DMA block size is typically placed on the amount
of data that can be transmitted as a single DMA block. Lahiri, Raghunuthan and Dey [130] observed
that DMA block transfer size can signicantly aect system performance and that the optimal value for
the size depend heavily on the characteristics of the trac seen on the bus.
3.2.6 Bus hierarchies
Choosing only one single bus for the whole system can be a performance bottleneck since only one master
can communicate with one or more slaves at a given point of time. When extending the number of buses
to more than one, parallelism is introduced and therefore potentially higher performance. These kind
of buses are called multisegment buses [60] and the buses in the system are called segments. A system
of n segments could have n masters making n transactions simultaneously. An example of a 2 segment
bus i shown in gure 12. The two segments are interconnected via a segment interconnect, often called
a bridge.
M
S
Segment 1
B
Segment 2
Figure 12: A multisegment bus example. The devices marked with M and S are masters and slaves
respectively. Segments are interconnected via a bridge marked B.
The hierarchical structure of multiple buses within a system is typically organized by bandwidth
demands. An example is the gure 12, where the rst segment could be the system bus connecting
CPU to high performance units. The second segment is a slower bus for less peripherals demanding less
bandwidth. Splitting the bus into several segments has great benets:

Lower power consumption
By splitting the bus, smaller drivers can be used which results in signicantly power consumption [57]. Using a lower clock frequency on the less demanding bandwidth buses, decreases the
consumption even more since power consumption is proportional to the frequency.
3.3 Case studies of bus standards with multiprocessor support

31
Lower noise problems
Reducing the bus wire length reduces the amount of capacitive coupling noise [57] (this type of noise
occurs also sometimes in telephony systems during a call, another call is heard in the background
at low volume).
Hsieh and Pedram [57] found that using a split-architecture, power consumption can be reduced from
16 to 50 percent in comparison with a single bus structure. Instead of together the two buses via a bridge
it is also possible to have them totally separated from each other by connecting them to two dierent
interfaces at the CPU.
3.2.7 Connecting multiprocessors
A multiprocessor system where all the processor nodes are connected to a common bus is called a shared
multiprocessor. The shared bus is a very popular solution to interconnect several processors as
evidenced by the number of commercial multiprocessor systems using this solution [5]. Examples of
shared bus based computers are Silicon Graphics (SGI) Challenge and Sun Enterprise. There are several
reasons for the popularity. First of all, a shared bus is very easy to implement and expand and it makes
a natural medium for broadcasting. This is also a preferable choice for Shared Memory Multiprocessors
(see section 4.5.1), since the bus structure enables implementation of snoopy cache coherence protocols
[5]. Bus snooping is for example implemented in the FutureBus+ described in section 3.3.1. To allow the
system software to build complex locking mechanisms for mutual exclusion, a bus can be extended with
a with a bus locking mechanism in hardware [55].
bus
3.3 Case studies of bus standards with multiprocessor support
In this section three bus standards and their properties are brie y reviewed. As the heading indicates
the buses has some form of support for multiprocessing, which of course does not prevent to be used in
single processor systems. Using standard buses simplies the design and implementation of a complex
system.
3.3.1 FutureBus+
The work on the original IEEE FutureBus standard began already in 1979, which later evolved to the
FutureBus+. FutureBus+ is an architecture, processor, and technology independent standard with no
technology upper limits [55]. As technology improves the asynchronous bus can go faster and faster
since the only physical limitation is speed of light. The specication includes split transactions and a
message passing protocol for ecient multiprocessor communication. Fault tolerance has also been in
mind, resulting in parity checks on all lines and distributed arbitration to reduce the risk of a single
point failure. FutureBus+ basically contains two individual buses. One is a 64-bit wide multiplexed
bus responsible for all address and data transfers. The second bus is an arbitration bus that can be
used parallel with the other bus with hides some of the latency associated with a distributed arbitration
protocol for gaining bus ownership. A bus locking mechanism is also supported. To provide support
for cache coherence FutureBus integrates a MESI cache coherent protocol (see section 4.5.5). Bus-tobus bridges are used to connect FutureBus to other well used buses as VME, Multibus II and Scalable
Coherent Interface (SCI). FutureBus+ was used as a global bus in the implementation of shared memory
multiprocessor DICE [62].
3.3.2 VME
(VME) [55] is a standard bus used in high performance multiprocessor servers

and high end embedded systems. VME systems are typically based on a backplane with one or more CPU
and I/O cards. The asynchronous I/O structure supports up to 21 I/O cards. VME cards is delivered
in three sizes: 3U, 6U, and 9U. The dierence between them is the number of connectors and pins. The
actual bus width is set dynamically during a transaction. VME supports pipelining, data widths from
Versa Module Eurocard
32
3.4 Point-to-point interconnections
16 up to 64 bits, and address widths from 24 up to 64 bits. The burst throughput is 80 MB/s. All
popular microprocessors are supported by VME including Motorola 680X0-series, the SPARC, ALPHA,
and X86-based processors [55].
3.3.3 PCI
The Peripheral Component Interconnect (PCI)[55] is a local bus that solves the compability problem
in a elegant way. By interfacing to PCI instead of the CPU bus, peripherals remain useful even if the
CPU is exchanged. The system can incorporate with an ISA, EISA or MicroChannel bus and adapters
compatible with these buses. In PCI a basic transfer is a burst. Addresses and data are multiplexed onto
the same 32-bit address/data bus. With PCI its possible to build several hierarchies of buses connected
to each other via bridges. Arbitration is done centrally but the arbitration algorithm can be specied by
the user. A master on the bus may remain as owner of the bus as long as no other device request the
bus. This feature is called bus parking. The PCI-bus is used in the Scalable Architecture For Real-time
Applications (SARA), a current research project in scalable parallel systems [90].
Between the two extreme interconnection architectures, the single shared bus10and a fully connected pointto-point interconnect where a node has a link to directly to all other nodes , a number of interconnect
topologies has been examined in the literature. Figure 13 shows an abstract view of a general interconnection network. In this section the unknown content of the interconnection network and some design
aspects will be brie y discussed.
M
INTERCONNECTION NETWORK
Figure 13: An abstract view of an interconnection network. The modules marked M, P and H are
memories, processors and peripherals respectively.
In contrast to the shared bus approach, using switches that allows communication directly from source
to destination makes it possible for several pairs of nodes to communicate simultaneously. The technique
is used in shared memory multiprocessors where the switches routes requests from a processor to one
or several dierent memory modules. Moving from the shared bus architecture to a more complex
interconnection network introduces many new terms and design considerations. The design considerations
concerns the topology, switching strategy, routing algorithm and ow control mechanism:

10 A
Topology
The interconnect structure at the physical level, i.e how the nodes are connected to each other is
called its topology. The topology is a major concern when designing parallel systems.
node consists of either a processor, a memory module, or a switch.

Routing algorithm
Switching strategy
When using the circuit switching strategy
Flow control
33
The routing algorithm decides how data is forwarded in the network. There exist many routing algorithms with dierent properties. For further information refer to Parallel Computer Architecture
[1].
a direct connection is established between two nodes and
the path is reserved until the communication has been nished. This strategy is used telephone
networks where a direct connection is established between the caller and callee. A packed switched
network slices the data to send to packets of data which are individually routed from one node to
another. For further information refer to Parallel Computer Architecture [1].
Flow control is necessary, when two or more data packets tries to move through the same route at
the same time. The ow control mechanism determines when the message will be moved along its
path to the destination. For further information refer to Parallel Computer Architecture [1].
An interconnection network can be reliable or unreliable. In a reliable interconnection network, a
message sent from one node to another is guaranteed to arrive at the destination node in contrast to
unreliable interconnect where messages may be lost and must be retransmitted. A typical message consist
of routing and control information along with the payload consisting of data. The largest research eort
is done on large scale multiprocessors consisting of several hundreds or thousands processors resulting in
less research results considering design of point-to-point solutions suitable for small-scale multiprocessors,
especially those put on a single die.
3.4.1 Interconnection topologies
The collection of nodes in a system communicate via point-to-point links, that typically are unidirectional.
A number of topologies has been proposed and examined in the literature and a number of them are
presented below. Before describing a set of important topologies in static networks, basic terms needs to
be discussed. Two nodes are considered as neighbors if there is a link connecting them and the degree of
a node is the number of neighbors attached to the node. The diameter of a topology is the longest path
between any two nodes.

Fully connected topology
Linear arrays
In a fully connected topology every node has a link to its neighbor. The diameter of such an
interconnect is always xed to one, independent of how many nodes are connected. A major
drawback with fully connected networks is that they scale poorly since the complexity of the crossbar
increases nonlinearly.
Linear arrays are the simplest networks with bidirectional links between the nodes. The bisection
width for linear arrays is 1 link.
Rings
Meshes
Trees
Hypercubes
Rings is an extension of linear arrays with the two edge nodes connected hence forming a ring.
A 2D mesh is a matrix of nodes each with connections to its nearest neighbors.
Trees has the nice property of increasing routing distance logarthmially.
Communication in hypercubes are based on the binary representations of a nodes identity which
leads to a simple routing algorithm. Sending messages in a n node hypercube can be done in n
cycles. Hypercubes are scalable and has been used in several parallel machines.
34
3.5 Interconnect performance & scaling

Torus
A torus is a 2D mesh where the top of the grid has connections to the bottom of the grid. It can
also be visualized as a cylinder.
A fully connected network can be implemented by using a crossbar switch [8]. Using a fully connected
network enable a node to communicate with each other at any given point of time. A network using the
crossbar to connect processors, placed at one edge of the network of switches, and memories on the other
edge is shown in g 14. In order to send information from one side to another the switches are congured
so a path is setup between the edges [124].
P
Figure 14: Crossbar switch. The processors (P) on the left side are connected to the memories (M) at the
bottom via the cross-point switch elements (S). No contention arises while all the processors are trying
to access dierent memories.
An interconnect is characterized by a number of parameters as bandwidth, latency, area cost, power

consumption, and scaling aspects. The interconnection must provide communication between the components and nodes to a reasonable cost. The cost-eectiveness of a network design depends on a number
of factors as the characteristics of the application (frequency of communication, interprocessor communication patterns and message length), the desired speed of processor data transfers and the cost of the
actual interconnection implementation, the number of processors and memory hierarchies in the system
[61]. Because there are so many factors no interconnection network can be considered best for all possible
situations.
An ideal scalable system is a system where performance increases corresponding to the capability of
the added resource. When adding a second CPU to a system with only one CPU, the performance
should ideally double. Scaling does not only concern performance, but also interconnection bandwidth
and latency, and cost of a system.
3.5.1 Performance measures
Patterson and Hennessy [8] dened six performance measures for an interconnection network: bandwidth,
time of ight, transmission time, transport latency, sender overhead and receiver overhead. The total
latency is the time that it takes to transfer a number of bytes from source to destination. An additional
well used measure for dening worst case performance for a parallel systems is bisection bandwidth. It is
calculated by dividing the interconnect into two roughly equal parts, each containing half of the nodes
and then summarizing the bandwidth links that the imaginary dividing line crosses [8].
35
3.5.2 Shared buses
The benets of low cost and easy implementation may be traded against a major drawback with the
shared bus approach: the limited bandwidth that comes from the serialization of all communications on
it. This serialization gives this kind of system poor scalability since all processors must compete for the
single bus. The performance curve attens when there are more than 10 processors11 attached to the bus
[56]. Adding an 11th processor hardly increases performance at all.
As microprocessors become faster and faster, they demand more bandwidth which makes the shared
bus become an even more serious bottleneck in a system. The last ten years indicate that the situation
will only get worse [62]. To reduce the problem there are three solutions:
development of a faster and wider bus
use of smart bus protocols (pipelining)
serve memory requests locally (use of caches/buers)
Pipelining signals on bus can increase the available bandwidth from 2 to n times, depending on a
number of system parameters, where n is the number of processors attached to the bus [5]. Introduction
of caches moves the \knee" of the performance curve to be around 20 or 30 processors [56], but the
technique introduces other diculties as the cache coherence problem[1]. A performance curve comparing
a non-cached bus-based architecture and the same architecture extended with caches is shown in g 15.
Figure 15: Performance knee curve. Using caches with a hit ratio around 90 percent, move the knee
right from being around 10 processors to be settle somewhere near 30 processors (Source: Computer
Architecture [56]).
3.5.3 Point-to-point architectures
This strategy takes the advantage of handling multiple transfers in parallel giving these much higher
aggregate bandwidth. Point-to-point communication is also faster because there is no arbitration process.
A benet is also the electrically simpler interface[8].
Contention still occurs in if more than one processor wants to use the route at the same time. The
application that will run on the system has great in uences the decision of network, since dierent
application has dierent types of bandwidth and communication needs. Expecting a network under a
heavily load and assuming worst case patterns, preferably leads to a high-dimensional network where all
the links are short. If looking at communication patterns where each node is communication with only
11 Where
the knee is located depends of course of the application and the actual implementation of the system.
36
3.6 Interconnecting components in a SoC design
one or two near neighbors, a low dimensional network is preferred since only a few of the dimensions are
used[1].
Assumed locality of data has also in uence the design issues. A performance study of hierarchical ringand mesh connected wormhole routed shared memory processors shows that with little locality meshes
scale better than ring networks caused by the limited bisection bandwidth [125]. For workloads with
some memory access locality, the hierarchical rings outperform meshes by 20-40% for system sizes up
to 128 processors. The study also shows that by putting 1- it size buers in the routers rings perform
better than mesh regardless of the mesh router buer size for systems up to 36 processors.
There are number of tradeos to consider when choosing an point-to-point architecture. For example,
bandwidth of the links may be traded against the complexity of the switches. To use the fully connected
architecture is unrealistic for systems where the number of nodes is large since it needs N*(N-1) number
of links, where N is the number of nodes[61]. This is one reason why almost all multiprocessors uses
topologies between the two extremes.
The table below shows the performance (as bisection bandwidth) and the cost (number of ports per
switch and total number of links) for dierent topologies for a system with 64 nodes:
Measure
Bisection bandwidth
Ports per switch
Total number of links
Source:
Bus
1
N/A
1
Ring
2
3
128
2D Torus
16
5
192
Fully connected
1024
64
256
Computer Architecture, A Quantitative Approach [8]
3.6 Interconnecting components in a SoC design
The evolution in the semiconductor industry has made it possible to for whole systems to t on a single
piece of silicon which moves the interconnection between functional modules from being on a PCB to be
inside a single chip. To meet the time to market requirements, and as earlier mentioned, SoC designers can
use IP-components. In order to eectively use them a design methodology including hardware/software
co-simulation techniques and a design friendly On-Chip Bus (OCB) is required [4]. Interconnecting the
components is not an easy task. Telecom and multimedia applications require low latency and high
bandwidth communications [123] and there is a wide variety of components with dierent interfaces
which implies that glue logic must be used to adapt a component to a specic bus. This is nothing new
since the method with glue logic has successfully been used in the world of PCB [16].
3.6.1 VSIA eorts
To enable the mix and match of IP-components onto a SoC, the Virtual Interface Socket Alliance (VSIA)
tried to specify an on-chip bus standard. The mission appeared soon to be infeasible so focus is now instead
on a standard that separates the bus interface logic from the IP-components internal behavior through
a bus wrapper [80]. A bus wrapper (see gure 16) is a component that is physically located between the
bus interface and the IP-component and used for communication between them. This strategy enables
IP-components to easily to be adapted to dierent designs because changes is only needed into its bus
wrapper. Separating the internal behavior from the bus wrapper can introduce extra latency when the
on-chip bus and the internal bus are very dierent. Lysecky, Vahid and Givarrgis has proposed techniques
to reduce this latency by introducing prefetching in the buers [80]. VSIA is also standardizing the bus
between the bus wrapper and the IP-component, called the Virtual Component (VC) interface.
3.6.2 Dierences between standard and SoC interconnects
Existing buses such as PCI and ISA were designed to connect discrete components at the board level.
At this level a key issue is to minimize the number of signals because pin count directly translates into
package and PCB costs [4]. When moving to on-chip solutions, signal routing consumes silicon area but
does not aect the size or cost if packages, PCB:s and connectors. SoC architectures have a rich set of
3.7 Case studies of existing SoC-Interconnects
37
Figure 16: The buswrapper and the VC interface. Source: www.design-reuse.com

interconnections, sometimes called routing resources, available to them [94]. For this reason the number
of signals used for a bus can be greater than i systems-on-board solutions, indirectly making multiplexing
of address- and data-buses onto a single bus obsolete.
Integration is an major economic factor since o-chip interconnects are 103-104 times more expensive
than on-chip interconnects [65] and as earlier mentioned the power consumption is also much lower in
the on-chip case. In PCB buses, electrical characteristics are often a signicant portion of specications.
This is not the case for on-chip buses since they can be accurately modeled by ASIC-development tools
[77]. Having an dierent set of design constraints and tradeos for on-chip buses, a bus designed for use
on PCB will not provide the best on-chip solution [4].
Standard microcomputer bus slaves usually fully decode a xed address. This requirement can be
obmitted since at the design of a system the address decoders can be tailored to the application without
using the whole address. This speeds up the system and reduces redundant logic [94]. The fact that
the on-chip bus is embedded inside the chip makes them very dicult to test in contrast to standard
microcomputer bus modules that can be tested at card level.
To support a wide variety of IP-components and embedded systems the interconnect must be suciently
exible and robust. The proposed on-chip buses uses a variety of dierent design choices. In the following
section a number of todays existing on-chip buses will be studied.
3.7.1 AMBA 2.0
The Advanced Microcontroller Bus Architecture (AMBA) [?] is an open standard, processor and technology independent on-chip bus specication. The rst version was released in 1995, but here the newer
AMBA Revision 2.0 will be studied. The AMBA architecture consists of a system buses and one peripheral bus shown in gure 17 and explained below:

AMBA System Bus
The system bus has two specications, namely the Advanced High-performance Bus (AHB) and the
Advanced System Bus (ASB).
The Advanced Peripheral Bus (APB)
This bus is aimed for slower general purpose peripherals such as timers, UARTs, interrupt controllers, I/O ports etc. Connection to the main system bus is made via a bridge.
Embedded processors are connected to high-performance peripherals, on-chip memory, and interface
functions via the system bus. The ASB supports multiple masters, pipelining and burst transfers. The
38
AHB specication is extended with split-transactions using separate read and data buses with data width
supported from 32 up to 1024 bits. The peripheral bus is a simple, single master bus controlled by the
APB bridge that connects the buses. The APB has an unpipelined architecture and a low gate count.
ARM Primecell IP-components can be directly attached to the buses without using any glue logic.
Figure 17: A typical AMBA based SoC (Source: ARM webpage [38])
Commercially AMBA has been implemented by several companies in a variety of products, including cell
phones, set top boxes, digital cameras, and general purpose microcontrollers. AMBA was was originally
an ARM bus, but has evolved into a license and royalty-free specication compatible with other CPU
architectures as well. More detailed information can be found in the AMBA bus specication[?] and at
the ARM homepage [38].
3.7.2 CoreConnect
The IBM CoreConnect [54] is an on-chip bus architecture available under a no-fee, royalty free license
agreement12 from IBM. Components that are connected to the bus must be compliant with IBM Blue
Logic designs. The hierarchical architecture provides three synchronous buses (see gure18):
Processor Local Bus (PLB)
On-Chip Peripheral Bus (OPB)
Device Control Register (DCR) Bus
The PLB architecture is very similar to the AMBA High Performance Bus with separate read and
write data buses, allowing multiple bus masters, pipelining, split transactions, burst and line transfers.
Its purpose is to provide high performance, low latency, and design exibility when connecting high
bandwidth devices, such as CPU, external memory interfaces and DMA controllers. The data width is
32 or 64 bits wide extendable to 128 and 256 bits. The PLB and OPB have dierent structures and
signals so IP-components attached to the buses has dierent interfaces. The concurrent read and write
transfers yields a maximum bus utilization of two data transfers per clock cycle. Controllable maximum
latency is supported by the architecture by using master latency timers. The PLB devices can increase
bus throughput by using long burst transfers. When the bandwidth on a single PLB bus exceeds the
limits of its capability, a possibility is to place high data rate masters and their target slaves on separate
buses. An IP-component, PLB Cross-Bar Switch (CBS) can be utilized for this purpose as shown in
gure 19. The CBS allows multiple simultaneous data transfers on both PLB buses which uses priorities
to handle multiple requests on a common slave port. A high priority request interrupts an ongoing lower
priority transaction. CoreConnect can be used to build multiprocessors systems.
12 The license agreement includes the PLB arbiter, OPB arbiter and PLB/OPB bridges designs including bus model
toolkits and bus functional compilers for the buses.
39
Figure 18: Coreconnect architecture. Source: CoreConnect Bus Architecture [54].
Figure 19: PLB Crossbar switch. Source: CoreConnect Bus Architecture [54].
The CoreConnect architecture is used in the PowerPC 405GP embedded controller, to connect an
PowerPC 405 CPU core, PCI Bridge, and SDRAM controller.
3.7.3 CoreFrame
PALMCHIPs CoreFrame [40] is a processor and foundry independent integration architecture developed
by PALMCHIP Corporation. Designs targeted are set-top boxes, digital cameras, communications, mass
storage, printing, intelligent I/O, and networking. It is inter-operable with the AMBA peripherals, which
means that peripherals designed for the AMBA bus can be directly attached to the CoreFrame making
the portfolio of IP-components larger. The CoreFrame architecture diers from many of the other on-chip
buses by using point-to-point signals and multiplexing instead of shared three-state lines. It uses a shared
memory architecture with simple protocols to reduce the design and verication time. Some feature are
listed below:
400 MB/s bandwidth at 100 MHz
Support for 128, 64, 32, 16 and 8-bit buses
Unidirectional buses only
Positive edge clocking only
40
A central, shared memory controller

A separate peripheral I/O and DMA buses
Designs can be synthezised to any process. The architecture has been proven in silicon with the
GreenLite HDD-1010 disk controller with an ARM7TDMI core. Further there are implementations of
mobile phone controllers, an Ethernets print server and a general-purpose ARM-based microcontroller
[38].

3.7.4 FPIbus
The Flexible Peripheral Bus (FPI Bus) is an on-chip bus designed for memory and I/O mapped data
transfers and it is a part of the Inneon Technologies TRICORE architecture[91]. The TRICORE 32-bit
microcontroller/DSP architecture is designed to be used in real-time embedded systems. The FPI-bus
connects the Tricore CPU/DPS to memory, other CPUs, external and internal peripherals. Up to 16
master devices are supported by the synchronous bus that uses a exible bus protocol which can be
tailored to specic application needs. Address and data buses are demultiplexed with up to 32 address
bits and 64 data bits. Peak throughput is 800 MB/s at 100 MHz. There is no upper limit to the number
of peripheral connected to the FPI Bus. Arbitration is done centrally by the FPI Bus controller which
supports both single and multiple data transfers.
3.7.5 FISPbus
The FISPbus is available under license from Mentor Graphics and is delivered as fully synthetizable
VHDL-RTL source code with a VHDL test bench including functional test vectors [92]. FISPbus architecture consist of a single bus supporting multiple masters and distributed arbitration, which implies
that all IP-components attached to the bus must contain the FISPbus State Machine. The FISPbus
interface is a generic microprocessor interface specication specially developed for soft IP-components.
Microprocessor cores non-compatible with the architecture can be attached to the bus by using adapted
softcores as a bridge between FISPbus and the microprocessor bus. There exists a number of ready-to-use
soft IP-components that can be directly attached to the bus.
3.7.6 IPBus
The Integrated Devices Technology (IDT) Peripheral Bus (IPBus) [83] is an synchronous high speed onchip bus running at more than 100 MHz and providing processor independence. No license is available so
implementations are still only in products from IDT. It is a multimaster bus that provides DMA support
and features as multiplexed address and data, pipelining and burst capability. The IPBus interface is
a small piece of code that is part of all functional cores. The gate count for the interface adapted to
slaves is about 500 gates and for a master/slave core the implementation can be done under 1000 gates.
Standard design tools works well with the IPBus.
3.7.7 MARBLE
The Manchester Asynchronous Bus for Low Energy (MARBLE)[132] is on-chip bus which provides an
interconnection for asynchronous IP-cores. MARBLE uses pipelining of the arbitration, address and data
cycles. In addition to basic bus functionality it supports bus-bridging to interconnect asynchronous and
synchronous subsystems on the same chip. Arbitration is done centrally using two separate arbiters, one
for data and one for address. All transfers are tagged with an unique identier of the initiator of the
transfer. Figure 20 shows a typical MARBLE system.
If the embedded on-chip memory is not large enough, an external memory bridge can be used for direct
connection to SRAM or DRAM.
41
Figure 20: A typical MARBLE system [132]

3.7.8 PI-Bus
The Peripheral Interconnect Bus (PI-Bus) was developed within the European Union ESPRIT Open
Microprocessor Initiative (OMI) project. It has been incorporated as OMI on-chip bus Standard OMI3243D with the ve companies, ARM, Philips Semiconductors, SGS THOMSON Microelectronics, Siemens,
and Temic/Matra MHS owning the patent rights. Licenses has been available since 1995. The purpose of
the bus is to be used in modular highly integrated microprocessors and is designed for memory mapped
data transfers between its bus agents. A bus agent are on chip functional blocks connecting the cores to
bus. A PI-Bus agent can both act as a bus master and a slave. In order to operate the PI-bus requires
a central bus controller which performs arbitration, address decoding and time-out control. A typical
architecture using PI-Bus is shown in gure 21. The peak transfer rate is 200 MB/s at 50 MHz. More
features of the PI Bus are listed below:
Processor independent
Demultiplexed operation
Clock synchronous
Address and data bus scalable up to 32 bits
Multimaster capability
8, 16, and 32 bit data access
Missing features in the specication are cache coherency support and broadcasts. A toolkit for analyze
and integration of cores to the PI-Bus has been developed at the University of Sussex. The toolkit
contains models of the PI-bus agents, together with test frameworks. Most models are compatible with
the IEEE 1076-1993 VHDL standard. The documentation and Toolkit can be downloaded from the
Sussex University's webpage [85].
3.7.9 SiliconBackplane
The SiliconBackplane [123] is a highly congurable SoC communication system developed by Sonics and is
a part of the Sonics Integration Architecture (SonicsIA). The architecture consists of a pair of proprietary
protocols, an open IP-component interface, and supporting EDA tools. The Open Core Protocol (OCP)
is a point-to-point interface that provides a standard set of data, control and test signal ows enabling
the cores to communicate. SiliconBackplane diers from many of the conventional SoC interconnections
by using only one single bus where all the components are attached via the SiliconBackplane agents. An
42
Figure 21: A typical PIBus architecture [85]
Figure 22: SiliconBackplane architecture. Source: Sonics homepage [122].

agent translates between the OCP and SiliconBackplane protocols. O chip communication is provided
by the Multichip Backplane. Figure 22 shows the architecture of SiliconBackplane.
The bus data width is user congurable to 32, 64 or 128 bits. The bandwidth is 50MB to 4GB/s
and the bus is fully pipelined with interleaved transfers. To guarantee bandwidth to core with realtime constraints the bus uses TDMA bandwidth allocation per core [123]. Bandwidth allocation per
IP-component can be set to a x value at design time or left recongurable, so it can be set during chip
operation. The number of IP-components that can be attached to the bus has no architectural limit. The
SiliconBackplane agents enables to control latency per core and allowing fast and slow IP-components to
co-exist without performance degradation while guaranteeing real-time deadlines.
3.7.10 WISHBONE Interconnect
WISHBONE is a open standard on-chip specication developed by Silicore Corporation [93] and available
on a no cost basis. The standard can be used to interconnect soft, rm and hard IP-cores with any target
architecture (such as FPGA or ASIC devices), and it is independent of development tools. The speed
of the interconnect is limited by semiconductor technology. A nice property with WISHBONE is that
3.8 Case studies of SoC multiprocessor interconnects
43
the interconnection topology supports both shared bused and crossbar switches. Additional feature are
listed below[95]:
Multiprocessing capabilities
Processor independent
Full set of popular data transfer bus protocols
Supports both BIG ENDIAN and LITTLE ENDIAN data ordering
Master/Slave architecture
Arbitration algorithm dened by the user
3.7.11 Motorola Unied Peripheral Bus
Motorola has developed an on-chip CPU core independent peripheral bus, the Unied Peripheral Bus
(IP Bus) [96]. The IP Bus interface Specication, bus functional models and bus monitors are publicly
available. IP Bus supports datawidths from 8 bits up to 64 bits and address width up to 64 bits. It is
aimed to connect peripherals to the local processor bus via a bus bridge.
This section describes the communication systems in two on-chip-multiprocessors.

3.8.1 Hydra
The HYDRA multiprocessor architecture [6] is a research project at Stanford University. HYDRA is
composed by four superscalar processors with individual L1-instructions and data caches. The four L1caches are supported by one unied on-chip L2 cache an o-chip cache and external memory as shown
in gure 23.
Figure 23: Schematic overview of HYDRA. The caches are reading and writing their data via the 256
bit wide read and the 64 bit wide write buses respectively. Both buses uses pipelining and centralized
arbitration that occurs at least a cycle before the use of the bus. L1-caches are connected to the CPU
via a 32-bit wide bus. Source: Stanford University [6]
44
The general-purpose read bus matches the cache line size in both caches in order to allow entire lines
to be transmitted across the chip at once. It is used to move cache lines between all the on-chip caches
and the o-chip interfaces. Hammond and Olutkun found that even though HYDRA uses a shared read
bus solution, it is typically occupied less than 50 percent of the time and thus not a bottleneck in the
system[6]. The write bus has one specic task: to transmit write-through data from the CPUs to the
L2 caches. Since it transfers data from one CPU at a time, it only needs to be 64 bits bits wide (64
bits is the widest CPU instruction). The write bus is not scalable at all for larger number of processors.
Benchmarks indicate that even in the worst cases, the read and write bus slows performance only by a
few percent over a perfect crossbar.
3.8.2 Silicon Magic's DVine
DVine, described in Microprocessor Report[79], is a SoC multiprocessor, consisting of several compute

modules and embedded DRAM. A rst implementation included 6 compute modules, each one consisting
of a RISC-CPU and a vector engine, and 4MB of DRAM. The architecture allows the number of compute
modules to be matched to the need of the specic application.
The on-chip interconnect consists of multiple 128-bit Data Communications Channel (DCC) buses,
that run at the speed of the chips processor cores. The buses are connected to all modules on the chip.
Transfers are managed by a Data-Flow Controller (DFC) unit that also includes central resources, as
timers and semaphore registers. At 200 MHz the total bandwidth is 6.4MB. An additional channel for onchip communication is provided via a 32-bit wide ring-bus. It is used to send control and communication
information between the modules. Peripherals may be connected to the DCC buses, ring bus, or to the
8-bit Congurable I/O (CIO) bus. Even combination of these buses can be used to connect a peripheral
unit. The CIO operates at 54MHz to achieve 423MB/s peak throughput. O-chip communications are
provided through the external 64-bit bus that is connected via an interface unit to the dual DCCs. The
peak bandwidth to the 128-bit wide DRAM banks operating at the core clock rate is 3.2 MB at 200MHz.
Read access to DRAM takes six clock cycles at the best rate.
4 MEMORY SYSTEM
45
4 Memory System
Nothing is stronger than it's weakest part, this also applies to computer systems and the processor/memory speed gap. An arbitrary processor can not execute instructions faster than they can be
obtained from the memory system, which often are the most severe bottleneck in computer systems. The
average performance of the whole system is dominated by parameters concerning the memory system[23].
Todays high performance processors, requires more data per time unit throughput, than memory chips
can provide[24].
The improvement rate in processors are much higher than for memories, CPU performance improves
with 50% every year, whereas memory access time only increase with 5-10%[26]. The divergence between
CPU and memory are known as the cpu-memory speed gap. As processor performance increases, the
number of idle cycles encountered for a continuous memory reference, latency will inherently grow as the
divergence continues [27]. The ideal solution would be for researchers and engineerers to provide technologies for memory chips to scale with the performance of processors. Since no method to achieve this
currently exist, the challenge lies in architectural improvements and well designed hierarchical solutions
[28, 24, 23].
In the folowing sections semiconductor memories will be coved, followed by an introduction to memory
hierarchy. Cache memories, their functionality and improvements that can be made are coverd. A breif
look at memory management units are followed by multiprocessor systems and data prefetching.
4.1 Semiconductor memories
Semiconductor memories can be split it to two separate categories non-volatile and volatile. Non-volatile
memories are able to retain information without power supply, while volatile memories will lose the
information after power is turned o[7]. Both classes shows similar reading characteristics but nonvolatile memories suers severe delays with writings. This makes the non-volatile memories most suitable
for Read Only Memories (ROM), whereas volatile memories are usable for Random Access Memory 13
(RAM), issuing both reads and writes.
4.1.1 ROM
Read Only Memories contains information that has been preprogrammed, and possess dierent levels of
erasability and re programmability, from permanent contents to byte level erasability. ROM's are often
used to store low-level information for the hardware or the operating system. They are also applicable
when stored information only requires infrequent updates in order to maintain functionality like channel
information for a TV or video set[9]. ROM's can also be used to store logic congurations for recongurable hardware(FPGAs). Dierent types of ROM's are classied based upon their programmability and
erasability see table below. The table also contains manufacturing, price, and performance information
of the dierent ROM types.
Masked programmed ROM (ROM): The true Read Only Memory are programmed at manufacture, the customer provides the semiconductor vendor with a specic conguration le, containing
the information to be stored. The le will be interpreted into a photo mask, whereas each binary
one in the data le corresponds to a transistor on the mask. The mask is then used to process the
actual chip, which will be delivered to the costumer after several weeks.
Programmable ROM (PROM): PROM's are also program once circuits, the dierens are that they
are programmed by the user at the user. This enables the manufacturer to mass produce standard
PROM's for arbitrary use. The chip is programmed via a special programming device burner, that
13 The term Random Access generally means that every memory cell can be accessed in equal time regardless of their
physical location [7, 9], which apply to both RAM and ROM. Therefor in this document Random Access are refered to
Read/Write memories RAM.
46
will burn the connection of those transistors that are not to be regarded as ones according to the
pattern of ones and zeros the user want to program.
Erasable PROM (EPROM): As the name indicates, the border between read only and random
access starts here. This are memory circuits that can be programmed and re-programmed several
times. The memory array consists of a special MOS-transistors with a oating gate which threshold
can be altered in order to program ones into the chip[7]. EPROM circuits do not have their die area
covered, instead they have a quartz window which will let ultraviolet light of the right wavelength
through, which will erase the contents of the memory.
FLASH: Flash circuits consists of of a special stacked gate transistor with the upper transistor control
gate behaves like a ordinary transistor whereas the lower one are the oating gate. Flash devices
contains embedded algorithms and functionality to perform the dierent tasks of programming and
erasing. Programming the cell is done by negatively charge it's oating gate which will increase the
threshold of the transistor[25]. Erasing is done by electricly let the negative charged oating gate,
release the electrons.
Electrical Erasable PROM (EEPROM): Implied by the name this memory are to be erased by
the use of electricity. Like the FLASH memory it also make use of a oating gate that electricly are
altered. Unlike Flash EEPROM programming and erasing can only be performed at single bytes
and they are the most expensive of the non-volatile memories[9].
Type
rom
prom
eprom
flash
eeprom
Cost
Programmability
Programingtime
Very inexpensive
Inexpensive
Moderate
Expensive
Very expensive
Once in factory
Once, by user
Many times
Many times
Many times
Weeks
Seconds
Seconds
100s
100s
4.1.2 RAM
Source:
Erase-time
N/A
N/A
20minutes
1second
10ms
computer systems design and architecture [9]
Erasable size
N/A
N/A
Entire chip
Blocks
Byte
Although the read time of ROM's are in the vicinity of nanoseconds, their write time can be found in the
area from minutes down to a hundred micro seconds, which for todays processors are unreasonable slow.
The lack of homogeneity between read and write in non-volatile memories makes them an inadequate
alternative for main memory in computers. Semiconductor memories with uniform read and write time
in the area of nanoseconds, can expressly be made of volatile memories [7].
There exist several types of RAM-circuits
In this section Static RAM (SRAM) and Dynamic RAM (DRAM) are the main concern, only a brief
look at others will also be included.
SRAM: The memory cell in SRAM make use of transistors to store information, the active element
resembles an SR-latch [7]. As long a power is not turned o, stored information will remain intact.
By utilizing transistors to store information, the bit-cell constantly consumes power hence the prex
static. Two types of SRAM cells are currently in use, a four transistor cell (4T) or a six transistors
(6T). Since the transistors in SRAM basicly are the same CMOS process used for ordinary logic.
There exist more advanced processes invoking self-aligned contacts or local interconnects in order
to reduce size. SRAM are often integrated into several logic devices as fast buers, registers, and
on-chip caches for processors.
Some problem have unfortunately appeared concerning stability of the memory cell for 4T memories
when voltage goes below 2,5 volts [31], which 6T cells manage to overcome. For this reason and
the improved process- technologies can produce 6T cells the size of ordinary processed 4T cells are
to be the dominating SRAM cell until new improvements are achieved [31].
47
The primary objective of DRAM is to provide the market with the largest memory capacity
chips, at the lowest cost. This is mainly achieved through process optimization for lower cost,
highest density with the lowest cost/bit and high production volume.
In 1968 the one-transistor DRAM storage emerged, utilizing a capacitor to store bits and make
use of one transistor for cell selection[28]. The amounts of charge in the capacitor represent the
binary states of 1 & 0. It is only possible for the capacitor to retain a charged value for a few
milliseconds, It periodicly need to be refreshed, reload previous contents approximately every 4 to
50 ms[7, 9, 28], refresh are also performed when a value is read. Power consumption in an DRAM
cell is lower than SRAM cell are due to the capacitor, which only consumes power when refresh.
The need to refresh the capacitors requires special logic for refresh administration, which slightly
complicates the design. Enlarges the chip or in some solutions a external refresh chip is used. The
timing overhead forced by the actual refresh should be known, but have insignicant implications
for the user of the circuit, since it's a very small fraction of the whole refresh cycle.
Worse is the latency in cycle time and data rate experienced with DRAM's, which are dealt with
in dierent ways yielding dierent models and special purpose DRAM circuits.
Extended Data Out DRAM (EDO): In EDO, or sometimes called hyper page mode the output buer have been provided with a extra pipeline step in order to improve the data rate
for column addressing. This type of memory gives improved system performance with minor modications to conventional memory controllers[28], due to the preserved asynchronous
interface.
Burst EDO (BEDO): BEDO memories are enhanced EDO RAM to allow much faster access
time, allowing faster busses to be used. This is achieved by combining the pipeline with special
latches(counters). This makes it possible to make four reads or writes in one bus cycle, hence
only every forth byte need to be addressed.
Synchronous DRAM (SDRAM): This type of DRAM employs a synchronous interface, the
circuit is synchronized with the bus clock. The data path is pipelined and data are bursted
out on the bus, this for improved data rate. By interleaving multiple memory banks random
access performance can be improved[32]. SDRAM access time is not measured in nanoseconds
but in MHz.
Double Data Rate SDRAM (DDR SDRAM): DDR SDRAM have much in common with
SDRAM, but DDR SRAM doubles the bandwidth of the memory. The Double Data Rate is
achieved by transferring data on both edges of the clock.
DRDRAM: DRDRAM or Direct Rambus DRAM. The DRDRAM works more like an internal bus
than a memory subsystem, it is based on the Direct Rambus Channel. The Rambus Channel is
a high-speed memory interface that are able to operate 10 times faster than ordinary DRAM
interfaces[126].
Synchronous-Link DRAM (SLDRAM): SLDRAM represent the next step in DRAM evolution[129]
from EDO, SDRAM to DDR and nely SLDRAM. the technique is base on SDRAM and DDR
with the addition of a packetized address/control protocol.
Cached DRAM (CDRAM): With the integration of small amounts of SRAM into DRAM
circuits or by splitting DRAM into disjoint banks, the problem of row access performance has
been addressed. The result are a DRAM circuit with a small integrated cache of fast SRAM
cells[28].
Video RAM (VRAM): An example of a special purpose DRAM, introduced in mid-1980,
developed for graphics applications. The improvement objective of VRAM was not speed but
to gain massive parallelism in data rate. As a special purpose the demand for this circuits are
smaller hence smaller series are manufactured which automaticly makes them more expensive.
FERAM: In ferroelectric RAM consists the the memory element of a ferroelectric capacitor. The
ferroelectric eect is the ability of a material to return an electric polarization in the absence of
DRAM:
48
4.2 Memory hierarchy
an applied electric eld[127]. This property is used to construct memories where the memory
element is a ferroelectric crystal. After the atoms in the crystals has been polarized and the
electric eld been removed the crystal will remain as it is. The polarization of the crystal
constitutes the logical value of 1 or 0.
EDO, SDRAM and VRAM are all standardized by JEDEC14
4.2 Memory hierarchy
When designing the systems memory hierarchy

The principle of locality states that any program over some time interval will reference the same set of
memory addresses repeatedly. This through research of computer program behavior. From the principle
of locality two sub-principles can be derived, spatial and temporal locality. Spatial locality apply to when
an address are referenced, programs are most likely to reference the following or a near address in a short
period of time, whereas temporal locality applies to programs inherent behavior to often re-reference
previously used addresses within a short time. As an indication of locality it is common to say that a
program spends 90% of the time executing only 10% of the program code [8].
The memory system can be sub-divided into primary and secondary parts or further down depending
on how the dierent parts are dened. In this report, primary memory(PM) consists of the volatile
memory that constitute the main working memory, CPU-registers, cache(s) and main RAM. Secondary
are for more long lasting storage, invoking magnetic and optic media or even non-volatile memories (i.e.
FLASH or EEPROM). Our main concern in this report will be to analyze the parts involved in primary
memory. Further division of PM can be made into dierent levels based on the components relative
distance from the CPU, where level 0 or 1 actually is on the processor die, and the highest depending on
the hieracial depth represents main RAM. Size and speed of memories is dierent on each level, faster and
smaller at lower levels, bigger and slower as further away from the processor. Each level in the hierarchy
contains a subset of its closest higher counterpart.
PRIMARY
SECONDARY
CPU
MAIN RAM
REG
CACHE(s)
BLOCK
PAGE/
BLOCK
EXTERNAL
STORAGE
WORD
Level 0
LOW
HIGH
Level 1 - (n-1)
Level n
CAPACITY
SPEED
HIGH
LOW
Figure 24: A memory hierarchy

4.3 Cache memories
Cache memories are small, fast and expensive memories. Small in the context that they only are able
store a fraction of available main memory at any instant time. They are inherently fast due to their
small size and by the use of high performance SRAM technology. Their main function is to store recently
referenced data close to the processor in order to exploit the locality gained from recent referenced
addresses. Caches one or multiple at dierent levels of the memory hierarchy, in dierent sizes, dierent
storage and operational strategies constitutes the levels in the memory hierarchy between CPU and main
memory.
14 The JEDEC Solid State Technology Association (Once known as the Joint Electron Device Engineering Council), is
the semiconductor engineering standardization body of the Electronic Industries Alliance (EIA), a trade association that
represents all areas of the electronics industry.
49
4.3 Cache memories
the general case

A generalized cache can be viewed as several storage arrays lines each containing a contiguous block of
data[50]. Associated with every array are a tag eld (tag) where the corresponding memory address15
stored data are maintained. Depending on how the cache handles writes, lines can by incorporating a
valid bit or bits marked v in gure 25 to be aware of its status in respect to consistency with main
memory. Further can each line store information for replacement polices, r in gure 25 usually some bit
pattern indicating ranking of the block or access frequency.
When the processor issues a memory reference the address is presented to the cache, what then happens
can brie y be summarized to the following: The line in the cache to examine for a hit are decided by
those bits of the main memory address marked set. Depending on the caches write policy, the valid bit
must be checked. Those bits constituting the tag in the main memory address are compared with the
corresponding bits stored in the tag eld or the selected cache line. The remaining bits of the address are
used to locate the desired block out of the blocks held in that cache line. Requested block are read from
the line and returned to the processor. This brief description of a hit describes how the cache locates,
veries and deliver requested data. However if the verication failed we would have encountered a miss.
4.3.1 Cache:
BLOCK ADDRESS
TAG
SET
BLOCK
TAG
DATA
BLOCKS
TAG
DATA
BLOCKS
TAG
DATA
BLOCKS
Figure 25: A Cache line.

4.3.2 The nature of cache misses
There exist three major types of cache misses that occur independent whether the cache is in a single
or multiprocessor system. However a fourth type of misses known as coherence misses are introduced in
multiprocessor systems, which originates from data sharing between processors.
Compulsory misses also referred as cold starts or rst reference. Compulsory misses occurs whenever
program execution begins and the cache is empty(i.e booting). All references are the rst to the
actual memory block and must be obtained form main memory until the programs working set has
been established in the cache. Related to cold start are warm start [2], which occurs when a whole
working set are successively swapped out in multiprogrammed system issuing a task switch.
Capacity misses are the eect of when a programs working set are to large to entirely t in the cache.
Capacity misses are easily reduced by increasing the cache size.
Con ict misses or collision, even though the cache only have a single block stored, capacity misses can
occur. This situation appears when a memory references are mapped to a already occupied cache
entry. The problem of collisions are addressed by increased associativities, see section 4.3.7.
Coherence misses In a cache coherent shared memory multiprocessor system two new type of cache
misses occur. The problem arise when data are spread among the processors who have the data
cached locally. Sharing of data can be divided in two categories.
15 The
tag contains only a subset of the bits constituting the full address, size of the tag depends on cache organization.
4.3 Cache memories
50
True-sharing occurs when a data word produced by one processor is used by another processor.
False-sharing are when independent data words are being used by dierent processors, are

cached in the same line and at least one access are a write.
True sharing misses occur when a processor modies some word in a cache block, resulting in
invalidations among sharers of that block. Later when one of the sharers try to access that
word it will nd it invalidated, hence a cache miss.
False sharing misses will occur whenever a processor writes to a word in a cache block, yielding
cache invalidations of that cache block among the processors sharing it. Any processor trying
to access another word in that cache block will nd it invalidated and results in a cache miss.
4.3.3 Storage strategies
This section focus on the dierent classes of common cache memories. Since much of the work done by
the cache is mapping memory blocks into the memory provided in the cache, todays three common types
of cache memory techniques are named after how this mapping and block placement is performed.
Direct mapped: When a single memory block expressly can be mapped to one and only that one
cache line, that cache is proclaimed to be a direct mapped cache. Mapping of a block are modulo
division between the address of the block and number of cache lines, except from some bits for
block information. The direct mapped cache are benecial due to its simple construction, fast and
small, but in order to obtain good performance it is reliant upon locality in referenced data. Direct
mapped caches also shows great compatibility with processor pipelines, and steps in the cache access
can be built in the pipeline steps[51].
A major disadvantage of the direct mapped cache are its inability to store two memory blocks that
are to be mapped onto the same cache line. If consecutive accesses are made to two or more blocks
mapped to same line, severe trashing will occur switching cache context for that line every access.
Thus, performance improvements will fail.
Fully associative: In a fully associative cache memory blocks can be placed on any arbitrary cache
line. Mapping is only to store the remaining bits of the memory address in the tag-eld16, after
bits for block information has been reduced.
When retrieving cached information a parallel bit-wise comparison between all tags and the memory
address must be performed, a sequential compare of all tags would be to slow. To perform the
parallel address/tag matching, considerably more logic is required than for the direct mapped
cache. This will result in increased die size hence more expensive, which limits its usefulness to
only be feasible in relatively small systems[9]. In comparison to the previously introduced direct
mapped, associative caches are more exible. Trashing is more unlikely to occur and the memory
are better utilized since the cache is \free" to place a block on any line.
Set associative: The set associative cache is a combination of the previous two, in an attempt to
combine attractive properties from both strategies. A set associative cache can be viewed as a
direct mapped cache consisting of several numbers17 of parallel cache blocks forming a matrix.
Each parallel line of blocks are called a set whereas a column in the set is called way, hence the
sometimes used n-way set associative refers to a set associative cache with n-ways of associativity.
Like the direct mapped cache a memory address can only be mapped onto a single set, but blocks
designated for the same set can be stored in dierent ways in a fully associatively manner.
16 See gure 25
17 Usually 2,4 or
8, due to the cost of parallel tag comparison.
4.3 Cache memories
51
4.3.4 Replacement policies
All in this document described cache types except the direct mapped cache18 needs a block replacement
strategy when the cache is full or when all ways are occupied in a set associative cache.
With the objective to maintain a high hit ratio, the choice of replacement algorithm is of great
importance[49].
(LRU) aims to reduce the miss-rate relying upon temporal locality among
cache entry's. In an attempt not to swap out a data block that probably will be referenced in
a short time, information about block accesses history are gather by the cache controller and
maintained in the cache lines. LRU is one of the most popular and implemented algorithm for
block replacement[48, 50, 53]. The scheme is relatively easy to implement, and yield good results
in keeping the miss-rate down[48, 53]. However with increased associativity the probability of the
LRU-line are the best line victimize are declining[53].
Random: The objective of random replacement is to spread the allocation in an uniform way through
out the entire cache. The hardware randomly selects a block to be discarded, overwritten or written
back to the main memory. The random algorithm do not take any previous execution history in
consideration but has similar performance as LRU with somewhat larger caches [8, 9]. random replacement are very easy to implement. Due to the algorithms non-deterministic behaviour a pseudo
random scheme can be used, it is fundamentally random but have a predictable behavior which can
be used for hardware debugging[8].
LRU: Least Recently Used
The cache controller maintains information about the order the dierent blocks arrived in the
cache. That information do not say anything about how the dierent block are being accessed by
the processor.
LFU: Least Frequently Used, to evict the cache line that been used least of the lines residing in the
cache over a nite period of time could be a good approach. The algorithm demands some way to
keep track of time and implementing a clock for that is to expensive.
MRU/notMRU: The Most Recently Used algorithm keeps track of the cache line that was accessed
last of the lines in the cache. The line chosen for replacement are randomly picked by those that
are not Most Recently Used.
Dynamic Exclusion: Dynamic exclusion is a replacement method that tries to decide whether a cache
line shall be replaced or the new entry bypassed the cache and go directly to the processor[128].
This occurs when two references are mapped to the same line. The protocol tries to keep one of
them in the cache and the other one outside. Dynamic exclusion tries to avoid replacing a line
with a line that could degrade performance. A small nite state machine is used to recognize the
common reference patterns, where storing an new reference would reduce performance.
FIFO:
4.3.5 Read policies
Most of the requests from the CPU are read operations (i.e all instruction fetch are reads). When a CPU
issues a read it will stall until the request is fullled, therefore optimize the cache to reduce the latency is
the main objective for read operations[46]. Data blocks can simultaneously be read as the tag comparison
is being performed, depending upon the result of the tag comparison it is a hit or a miss.
Read hit: When the tag check yields a hit for the desired word, the cache will provide the CPU with
requested data from the pre initiated read. Replacement information will be updated in order
according to the rules of the used algorithm.
18 In
direct mapped caches memory blocks can only be mapped onto one specic line, there is no other to chose.
4.3 Cache memories
52
In the case when requested data is not to be found in the cache the premature read of
data will be suspended, and data must be brought in from the next level in the memory hierarchy.
A fast but costly and complex hardware method is to deliver the data to the processor in parallel
as the cache is being updated. A slower but less complex alternative is to provide the processor
with requested data after the cache line has be updated with new data.
Read miss:
4.3.6 Write policies
Only about 7% of the total interaction between CPU and memory are writes[8]. It is a small but not
neglectabel part of the memory/CPU interactions, so nding the best way to enhance performance for
those 7% can be worth the eort of nding a optimal write policy for the actual architecture.
Write hit
When a CPU write results in a cache hit, are the primary objective to reduce the used bandwidth[46].
The amount of used bandwidth to the next level in the memory hierarchy when write hits occur
depends on the policy used by the cache. The choice of write policy directly aects how the caches
handles the coheres problem that comes with writes. For handling write hits two policies exists:
Write-trough or sometimes called store-through caches updates both the cache block and the
memory at next level on every write. During writes to the cache including the nearest lower
level using write-through the processor will encounter a write-stall due to the latency of writing
to lower levels. To overcome this dilemma a write-buer can be installed. During the writethrough the processor issuing its write to the cache as usual but instated of present the write
directly to the lower level the processor write to the write-buer and then continues with its
work. When data has been presented to the write-buer it is responsible for updating the
lower level memory.
The write-through method is easy to implement compared to its counterpart write-back. When
causing write-through the next level in the memory hierarchy always contains the most recent
copy of the data.
A opertunistic approach for direct-mapped caches using write-through is to simultaneously
perform the tag check and write data to next level write-before-hit [46]. In case the access turn
out to be a miss, there is no harm done since that line should have be replaced in any case.
The write-back cache must conrm a hit in order to modify cached data, This is vital in case
of a miss, dirty data can be overwritten by the modication and result in inconsistency. For
set-associative caches no matter write policy used, tag conrmation is always needed before
write[46]. As recently mention a set-associative cache or a any cache using write-back needs to
perform tag-check and write in two steps, which complicates the cache access to be invoked in
the CPU-pipeline. There is a need for interlocks between the write-back step and the memory
step which will increase latency[46], whereas as for a direct mapped write-through cache are
easy to integrate withe the CPU-pipeline
Write-through caches have a better error tolerance than write-back caches[46]. The better
tolerance originates in the fact that write-through caches contains no unique data that might
have been modied.
Write back also called store-in or copy back cache. In contrast to the write through policy, write
back only modies the cached data. By the use of the dirty-bit, that cache line can be marked
as dirty, which is an indication for the cache controller that the line when being replaced must
be written back to the next level. The write-back policy reduces the write trac that leaves
the cache by taking advantage of the temporal and spatial locality in writes[46].
Consecutive writes to block only results in a single write[2] compared to a write for every
modication of the block as for the case with write-through. Since every write issued by the
processor do not results in a write to next memory level, less bandwidth is used which are
desired in multiprocessor systems[8]. When write-back caches are used in a multiprogrammed
environment bursts of writes are common when the processor performs a task switch[2].
4.3 Cache memories
53
Write miss
When a requested write to the cache results in a miss the choice of policy for handling the situation
have a signicant eect on the CPU stall time during the cache miss, as wall as for the rell trac
to the cache. The bandwidth is a big concern for write misses but most of the polices focus on
reducing the latency[46].
Write allocate also referd as fetch on write. A cache invoking write allocate as write miss policy
will load the desired block into the cache when a miss occurs. After the block has been fetched
the cache acts accordingly to a write hit[8]. Jouppi[46] claims that write allocate and fetch on
write are separable. According to Jouppi are the above a denition of fetch on write whereas
write allocate do not fetch data from the level further down the hierarchy. Instead the address
written to are allocated in the cache. With those denitions it is possible to have direct
mapped cache with write allocate without the fetch of data for the refereed address. This will
improve performance compared to non-blocking caches that makes use of buers to handle
write misses, due to a subsequent read of the modied data will yield a hit. In the general
case write allocate are often used along with copy-back caches.
No-write allocate caches treats a write miss by updating the memory in the next level, hence
with a consecutive read data must be fetched from the other level introducing a CPU-stall.
Generally write-through caches uses no-write allocate.
4.3.7 Improve cache performance
From the average memory access formula:

T = Hittime + M issrate M isspenalty
three distinguished areas of optimization can be discovered directly in the parameters.
Hit time The hit time have considerable eect on the whole system. This because hit time has a
considerable implication on the clock frequency, hence aecting the performance of the processor.
Size & Complexity Smaller hardware are inherently faster than larger and more complex constructions. Implicit a small cache exhibits a faster hit time than large caches do. For stand
alone processors it is of great importance that the cache area is small for integration on the
same chip, eliminating o-chip bus latency. Making the cache small do not automaticly yield
reduced hit time, a to complex scheme for storage like set-associative caches can have implications on the hit time due to extra steps to conclude a hit[44]. Small caches means simple
design, hence direct mapped-caches where tag control and data transmission can be overlapped
are preferable for hit time reduction[8].
Eliminating virtual-to-physical address translation Much time during the address probing
is consumed in address translation of virtual addresses to physical addresses. By using virtual
addresses for the cache will eliminate the translation step.
Pipelining & Write-buer By the use of a write-buer for pending writes to update memory in
a lower level and pipelined writes to that buer, hit time issuing writes can be reduced.
Miss rate In order for the system to perform well, it is important to keep the miss ratio as low as
possible[45]. Much of the work done improving cache performance have focused on reducing the
miss rate[8]. For a recapture of the dierent cache misses and their properties the reader are
incuraged to read section 4.3.2.
Increased block size trade-os A short cut to reduced miss rate utilizing spatial locality for
reduction of compulsory misses is to increase the block size. However increasing block size
brings unwanted disadvantages that must be considered and evaluated.
4.3 Cache memories
54
Increasing the block size without changing the total cache size yields fewer blocks in the cache,
which will results in increased rate of con ict misses and even capacity misses will increase.
The total cache size remains, yielding fewer blocks as block size increases generating increased
rate of con ict misses and even capacity misses will increase.
When block size is enlarged requires more bandwidth and increased the latency, hence increased
miss penalty. Obviously there are important tradeos to be considered.
Higher associativity trade-os Increasing associativity of the cache will reduce con ict misses
on the expense of eventual increased hit time.
Miss & Victim caches A technique to reduce con ict misses without increase miss penalty or
aecting the clock speed, are to between the cache and its rell path insert a small fully
associative cache. two solutions based on this architecture are presented.
Miss caching A miss cache are a small, two to ve lines fully associative cache[45] for
insertion between the rst level cache and the level closest under. In case of a cache miss
data from lower lever are both inserted in the normal cache and in the miss cache where
the LRU entry is replaced. In parallel with the probing for address match in the direct
mapped cache, the miss cache also probed for a hit. In case the cache probe yields a miss
and a hit was made in the miss cache then the direct mapped cache can be reloaded with
the actual cache block in the next clock cycle, taking the stored data from the miss cache.
The miss cache is better in reducing data con ict misses than for instruction con ict
misses.
Victim cache A victim cache is an improvement of the previous introduced miss cache[45].
The improvement lies in the reduction of copying data to both the cache and the miss cache
when there is a cache miss, hence waste of cache space. By using another replacement
algorithm, performance of the miss cache can be enhanced. Requested date from a miss
are not longer loaded into the miss cache, but cache lines that the direct mapped cache has
victimize for replacement are, hence the name victim cache. If a miss occurs in the direct
mapped cache and the requested block is stored in the victim cache the corresponding lines
are swapped between the two caches. Further improvements can be achieved by the use of
selective victim caches[51]. In contrast to \simple" victim cache the selective victim cache
can either place a block from a lower level in the direct mapped cache or in the victim
cache. This selective placement are done using a prediction scheme based on previous
references. Those block that are most likely to be accessed in a near future are assign to
the real cache whereas others classied as not likely to be referenced in a certain amount
of time, are to be placed in the victim cache. Prediction parameters are recalculated in
case of a miss in the main cache and requested block are to be found in the victim cache.
Pseudo-associativity After exploring the dierent cache strategies regarding their performance
and drawbacks, one could think that combining the best features from dierent techniques
can become benecial. A cache that possess the fast hit time of direct mapped together
with the low miss rate of set-associative caches. This can be achieved with a special case
of associativity presented to the direct mapped cache, a column-associative cache or pseudoassociative [47]. This special cache reduces con icts by dynamically choosing another location
for the con icting data using a hashing functions. Instead of reside con icting data to another
location within the same set as the case for set-associative caches, the column-associative cache
which fundamentally is direct mapped nds the alternative block place within another set still
in the same cache, hence the name column-associative[47]. The new location is easily found
with a bit- ipping technique, hence ipping the most signicant bit of of the set selection bits
When a reference hits in the cache it performs just like a ordinary direct mapped cache, on
the other hand when the reference miss another set is controlled. This will give the columnassociative cache two dierent hit-times19 which can result in deteriorating performance and
complicated Worst Case Execution Time (WCET) calculation.
19 Regular
hits as an ordinary direct mapped cache and pseudo slower due to the hash-function and double probing.
55
4.4 MMU
Time wasted by processor stalls waiting for cache misses can be reduced with
prefetching, which brings data closer to the processor before needed. Prefetching applies
to instructions as well as for data. The dierent prefetching paradigms and techniques will be
covered in section 4.6.
Compiler optimizations There exist other approaches than those hardware based techniques to
reduce the miss rate. Reduction of miss rate can be achieved with several compiler optimizations. Those techniques are not in the scoop of this analysis, and will therefore be left over for
the intriguent reader to nd elsewhere.
Miss penalty As the processor/memory speed gap continues to increase the relative time measured in
CPU stall cycles for a miss to be handled will also increase. With this in mind, reducing the miss
penalty time are no less signicant than hit time reduction.
Read miss before write miss The use of write-buer in order to allow the processor to overlap writemisses with execution of other instructions, introduces problems when a read instruction is issued
to an previously modied address not yet written back to the lower level, hence the modied value
still remains in the write-buer. With a rather large buer waiting for the writes to nish and than
proceed can waste considerable amount of CPU-cycles, depending the number of pending writes in
the buer. A better technique is to reuse the data residing in the buer in no other con icts are
present
Sub-block placement Sub-block placement is based on the idea to to have larger block residing in the
cache, and to enable invalidate fractions of the whole block hence sub-block. Only a the sub block
that generated the miss is necessary to load from memory.
Fast word access The size of desired data when the CPU issuing memory access instruction are determined by that specic instruction. Byte, word, double word and further up to the size of the
processor registers. That amount of bytes are often in large caches only a subset of the total data
stored in the cache block. This relation between cache blocks and request size can be utilized in
order to reduce miss penalty. Since the processor do not need the whole cache block but just a
fraction of it, lets give the processor its requested portion of the block as soon as it been loaded.
Then carry on loading the rest of the block, this is called early restart. Another promising and
somewhat more aggressive approach is to read requested data rst of all from its memory block
and the when the CPU are satised continue to read the remaining part or parts of the block. This
scheme is known as critical word rst.
Non-blocking loads CPU stalls waiting for a cache miss20 to propagate can to a certain degree be
reduced if the cache featured non-blocking loads. Non-blocking caches enables the instruction which
caused a cache miss to be overlapped by pending instructions, during the miss is treated[43]. The
amount of possible overlap depends on number of independent21 instructions available. If during the
overlapped execution of pending instruction an instruction dependent upon the data being loaded
the processor must stall and wait for the cache miss to terminate. Another cache miss during
overlapped execution will likewise stall the processor.
Prefetching
4.4 MMU
The MMU Memory Management Unit is primary responsible for providing address translation, between
virtual and physical address space. Virtual memory allows program address space to exceed the size of
physical memory present. The size of the program is only restricted by the addressing capacity of the
processor[134]. Since the physical memory is physicly addressed translation between virtual and physical
space must be performed.
20 Miss after issuing a load instruction, hence the name
21 Independent in respect to data being loaded into the
non-blocking loads.
cache during the overlap.
56
4.5 Multiprocessors architectures
A cache that operates on physical addresses that been translated from virtual addresses by the MMU is a
physical indexd cache. A virtually addressed cache operates on virtual addresses. The advantage of having
a virtual addressed cache is that there is no need for address translation for entries residing in the cache.
The drawback is that they must deal with synonym or aliasing problem of recognizing all virtual addresses
that map to the same specic physical address. The placement of the dierent caches in respect to
memory, mmu, and the processor can be seen in gure 26. Figure 26 (a) represents a conguration with a
physical addressed cache, whereas gure 26 (b) show the case for a virtual addressed cache. Except address
translation the mmu is also responsible of detection and processing of missing items22 . The structure
of the address translation unit depends on segmentation and paging of the memory. Segmentation,
subdividing of the address space into logically related groups (segments). Another approach is to divide
the address space into xed-size pages. A method used to reduce the overhead of address translation
encountered in paging systems is to use a translation lockaside buer (TLB)[134], the TLB contains a
associative memory where recently used descriptors to pages are kept.
CPU
MMU
Physical
address
Virtual
address
Physical
Address
Cache
Main
Memory
Physical
address
(a)
Virtual
Address
Cache
CPU
Virtual
address
Physical
address
Main
Memory
MMU
(b)
Figure 26: MMU and cache placement.

All computers can dependent on their level of parallelism in executing instructions and utilizing data, be
divided into four categories, according to Flynn's taxonomy[2]. The four categories: SISD, SIMD, MISD
and MIMD, further information about those dierent abbreviations can be found section 2.7.4 . This
chapter are designated to the Multiple Instruction Multiple Data architectures (MIMD), which essentially
are a multiprocessors.
Further can multiprocessor systems be divided by the use of dierent programming models. A programming model can be described in user level communication primitives of the system[1]. It can be said
to be the model for how communication is managed by the programmer. Implicit through the assembler
instructions generating communication (transparent to the programmer) or via explicit message passing.
In the shared address model is communication managed through a specic shared memory location.
Which hides the abstraction of communication from the programmer who can not tell if he/she is programming on a uniprocessor system utilizing multiprogramming or on a multiprocessor.
Message passing as the name implies (also often called multi computers ) relies on explicit messages for
communication between nodes. In this paradigm each processor has it own local memory only accessible
by that processor[119] The communication is managed via a software layer protocol, this will increase
communication time in comparison to the shared memory approach. On the other hand message passing
solution scales better as more nodes are added[121]. Shared memory multiprocessor can also be reliant
upon message-passing, this is for distributed shared memory multiprocessors described in section 4.5.2.
In the data parallel model, data are processed parallel and communication is for synchronization which
can be either message based or by the use of shared addresses. According to Flynn's taxonomy this model
is a SIMD architecture.
22 Parts
of the program that has not been physically mapped yet.
57
The main concern of this chapter is designated to examine the memory organization used for Multiprocessors. What implication there is in maintaining a shared memory multiprocessor in respect to uniformity, non-uniformity, coherence, and scalability. A few of those problems has already been mentioned
(i.e coherence misses, see section 4.3.2). Even though the basics of cache management and architecture
remains the same as for uniprocessors, it exist other problems concerning cache and memory management
that are specic to multiprocessor architectures that must be addressed.
4.5.1 Symmetric Multiprocessors
Symmetric multiprocessors (SMP) also referd as UMA Unied Memory Access which for all main memory
resides at equal distance to all individual processors. No matter which address a processor issue a reference
to, they are all accessed in equal time independent of physical location in the system. Since much of the
communication in shared memory systems are performed by memory referencing assembler instructions,
the choice of memory organization are a key issue[1]
In general, three types of memory hierarchies are common for shared memory multiprocessor systems.
P1
Pn
Switch
P1
Pn
P1
Pn
Cache
Cache
Cache
Cache
(Interleaved)
Level 1 cache
(Interleaved)
Main memory
Shared cache
Bus
Memory
Interconnection network
I/O
Memory
Bus-based shared memory
Memory
Memory
Dancehall
Figure 27: Shared memory hierarchies

In the shared cache model (gure 27), all processors shares a single cache. The one among
all processors shared cache are placed between the processors and main memory, cache and memory
are tied together via a interconnect. The model features many benical attributes for example, the
shared cache architecture have made the need for cache coherence obsolete, due to that all processors
see the same cache. Communication latency is reduce by the relative closeness to the rst shared
medium, the cache. spatial locality is well exploited and false sharing is eliminated. The actual
size of the cache can be made smaller then for a solution where every processor has its own cache,
due to potential overlapping between the processors working sets. That sharing of working sets is
not only benecial, the shared working set can be destroyed when one processors makes access to
long streams of non- shared data. A shared cache is more complex in its construction than required
for caches that are located private on their processors. Bandwidth can be a sever bottleneck in
this model since connecting several processors to a single device with maintained high performance
requires a vast amount of bandwidth. In order to increase the bandwidth both memory and cache
can be interleaved, hence enabling more processors to concurrently access the memory system. All
in all this approach is a potential candidate for multiprocessor-on-a-chip implementation, when
small systems that do not scales very well is considered fesable[1].
Bus-based In the bus-based architecture the processors shares a common bus. The nodes attached to
the bus except for main memory consists of the processor and its cache hierarchy. This model exhibits moderate scalability due to bandwidth limitation of the bus system. The bus-based approach
has been widely used in small to medium sized multiprocessors and is still the dominant parallel
Shared cache
58
architecture[1]. Coherence in bus-based systems are often obtained through bus snooping protocols
see section 4.5.5
Dance-hall The dance-hall approach shows many similarities to the bus-based model, the most obvious
dierence lies in replacing the bus and make use of a more sophisticated interconnect. The other
dierence is that main memory is divided into small entities which individually connects to the
interconnect. This approach is designed to encounter a higher degree of scalability than the previous
described solutions. Although, the hierarchy is still symmetric in the context of uniform length
between all processors and memory, the actual length between the entity's can limit performance
as the system enlarges.
4.5.2 Distributed memory
The Distributed memory (DSM) architecture is an approach to overcome the limitations in scaling encountered in SMP models, but keep the convenience of a shared address space. This is achieved by using
memory that are physically distributed among the nodes, but logically implements a single shared address
space[121]. A general description of a distributed system can be viewed in gure 28.
The key distinction between this model and a symmetric multiprocessor is that here each processor
node have its own local subset of the total global memory, and communication is done through explicit
message passing. The short distance between CPU and the local memory enables higher speed and low
latency for memory references that can be handled by local memory. Processor nodes are tied together
by a scalable interconnect. This will result in high memory latency for references that do not hit in the
local memory system, hence references must be obtained from another processor's local memory. This
property of non-uniformity in memory access time, gave this approach the name, NUMA Non Uniform
Memory Access [121, 116].
Two well known shared memory multiprocessors are: The Stanford DASH Directory Architecture for
SHared memory and the Stanford FLASH FLexible Architecture for SHared memory, more about those
architectures can be found in[119, 117].
P1
Pn
Cache
Cache
Memory
Memory
Distributed memory
Figure 28: Distributed memory hierarchy.

4.5.3 COMA Cache
Only Memory Access

The key concept of the COMA architecture is to use the local memory of each node in the system as a
cache, in which each block has a tag containing the address of the block and its state[121]. The memory
cache are also referd as Attraction Memory AM. A remote block that is frequently accessed by a processor
can be replicated at the referencing node. The block is transparently replicated by hardware and placed
both in the processor's cache and in the AM. This will increase the chance of data being present locally
at the node[120].
There are three issues in the COMA-architecture to be addressed, block localization, block replacement,
and memory overhead.
59
When main memory AM has the functionality of a cache the block's address is a
global identier and not a physical location. Since a block can exist in other processors local AM,
there needs to be a method to localize remote block when a miss occurs. The processor that missed
a reference in its local AM, must communicate with a directory in order to localize the holder of
a valid copy. In hierarchial COMA architectures[120] the nodes are organized in either a tree or a
ring structure, where each level contains a directory. To obtain a block, traversal of several levels
in the hierarchy is sometimes necessary.
Block replacement Since block in COMA architectures migrates they do not have xed backup location, where to a write-back can take place. A block scheduled for replacement, must even if it is
un-modied and the only remaining copy be replaced and not just overwritten and disappear. The
system must keep track of the last copy of a block and migrate it to another AM when replaced.
Memory overhead There must always remain a amount of un-allocated memory in the AM in order
to manage replication and migration. If non un-allocated memory remains in AM a replacement
has to be done for every new block that is put in the AM.
The main advantage of the COMA-architecture is the ability to capture remote capacity misses as hits
in the local memory[121]. The latency of the normal hierarchial COMA has led alternative approaches
that tries to overcome the problem.
Flat-COMA: The Flat-COMA architecture does not rely on any hierarchy to nd a block, this enables
the architecture to use a high-speed network[120]. The directories are distributed among the nodes.
Memory blocks can still migrate but directory entries remains at the home node. For a miss in AM,
a request goes to the directory responsible for the block, which redirect the request to the block
holder.
Simple-COMA: In simple-COMA (S-COMA) the replacement is software directed, which diers from
the hardware approach used in normal and Flat-COMA.
Multiplexed Simple-COMA: The S-COMA architecture suers from memory fragmentation problem, due to allocation of memory in paged chunks even if the block are much smaller[120]. Leading
to in ated working sets and frequent replacements. This problems are being addressed in Multiplexed Simple-COMA (MS-COMA) by allow multiple virtual pages in a node to map to the same
physical page simultaneous.
Block localization
4.5.4 Coherence
Although coherence is important in uniprocessors with cache(s) in dierent levels of their memory system
the problem escalate when caches are used in multiprocessor systems. The problem arises when dierent
caches have the same memory location and one of them updates that address. The other shares of
that location will see dierent values for the same address. That is also the case for main memory, the
system is said to be inconsistent at this instant time[116]. To invoke caches and not solve the problem
of inconsistency will put a the programmer in a dilemma since the intuitive programming model of
consistent memory is no longer present[121]. The problem is address by having the caches monitoring the
state of their contents in respect to eventual shares, write permission, and invalidation. This motoring and
preservation of state information is administrated by the coherence protocol. Dierent protocols applies
to dierent multiprocessor architectures. Dependent on their interconnect architecture and memory
model (e.g shared or distributed). Another factor that can aect the consistency in the system, is I/O
transfers performed by Direct Memory Access (DMA). The DMA transfer data between some I/O device
and directly to it's dedicated location in memory without involving the CPU. If the DMA transaction
overwrites a location in main memory which still resides in one or several processors cache, inconsistency
will prevail. The problem is analog in the reverse, a DMA transaction can transfer stall values that
resides in main memory, while the correct values are still in the cache. This problem can be addressed
by the use of uncachable locations or by actions taken by the operating system.

4.5.5 Coherence through bus-snooping
60
In SMP's, the bus is a continent device for maintain cache coherence, all processors in the system are
able to observe the ongoing memory transactions[112]. All the caches snoops the bus in order to monitor
the other caches actions on the bus. When a snooping cache discovers a transaction relevant to some of
its own cached blocks, actions must be taken according to the applied coherence protocol. The snoop
control is basically a regular tag control analog to the tag control performed for normal memory accesses
by the processor. Actions will be taken dependent upon what action the other cache did in combination
with the actual state of cached data.
Snoopy protocols are benecial solutions in bus based cache coherent multiprocessors, due to the
inexpensive and speedy broadcast properties of the bus[115]. A major drawback is the increased bus
trac introduced by coherence actions issued on the shared bus. Contention will increase as more
processors issues bus trac, hence, bus-based solutions are only applicable to small and medium size
multiprocessors[115].
The snoopy protocols are classied in to three groups based on the used write policy, write invalidate,
write update, and a combination of both adaptive protocols in an attempt to combine the benets of the
other two.
Write-invalidate protocols This class of snoopy protocols can allow multiple readers of the same data
but only one at a time are allowed to write[115, 114].
All writes done by the processor are propagated on to the main memory bus. Any cache observing
a write to an address corresponding to one of its own cached blocks, it invalidates that block. This
scheme can be implemented using only two states, Valid (I) & Invalid (V), a un-cached line are
often considered as being invalid.
Write-back \invalidate" protocols This class also applies to the rule of multiple reader and a single
writer. For the most cases write-invalidate protocols are impractical due to the vivid bus trac
imposed by the write back policy and reload of previously invalidated lines[114]. By the use of
write-back some of the writes to memory when using write-through will be eliminated if the cache
that modied the data could keep it and write it back when invalidated by other cache or replaced.
To eliminate that transition it becomes necessary to distinguish unmodied blocks from those that
are modied[1]. This can be achieved using a protocol that consists of three states modied (M),
Shared (S), and Invalid (I), (MSI). The modied state indicates that data has been modied and
invalidated by shares, the holder of that block are now considered owner of that block. A cache
is said to be the owner of a block if it on a request for that data must provide that data to the
main memory or sometimes even to other caches[114]. The shared state means that one or several
caches can have that block, but non of the have modied it, hence the cache and main memory are
consistent. Every write to a shared block must be preceded with the invalidation of all other copies
[115]. When a cache have the block in modied state, local reads or write do not generate bus
trac nor invalidations since all other copies are already invalidated. A cache holding a block in
modied state must provide the memory with the updated data in case of another processor writes
or read that address. A disadvantage is the extra bus transactions made on a write hit to data in
shared state even though no other cache may share that block.
This problem is addressed in the Illinois MESI protocol. To avoid the extra transaction the protocol
needs to recognize the sharing status of a cached block[115]. with the introduction of exclusive (E)
state, often called exclusive clean or un-owned [1], data blocks can be indicated as the only valid and
unmodied copy of the block. By the use of dierent states for unshared (exclusive) and shared
copies, the protocol will improve performance in handling private data. This due to avoiding
invalidations on write hits to unmodied blocks with no other shares[115].
Write-update protocols This class of coherence protocols makes use of a distributed write approach
which allows several copies of the same cache block to exist in a writable state simultaneously[115,
114]. The cache that issues a write, broadcasts the word to be written to a shared block to all
61
caches in the system. This type of protocols often makes use of a special bus line or lines so that
a cache dynamically detect the sharing status of a block. The extra bus line are only in use when
a block is shared by more than one cache. When a block no longer is shared among the caches
it is marked as private and the update broadcast is no longer necessary. This makes this type of
protocol use write through for shared writes and write-back when data is private[115].
A typical implementation of a write-update protocol is the Fire y coherence protocol, used in the
NEC Fire y multiprocessor[114]. Coherence is maintained with the use of four states, which are
combinations of the two state bits dirty & shared, no invalid state is needed since the protocol is
update based[1, 114]. When the dirty bit is set, the cached block is modied in respect to main
memory and must be written back to memory in case the block must be replaced. The shared
bit indicates that one or several other caches may have the same block. When this occurs write
through must be used. Reads and writes to unshared memory addresses are satised by the cache,
no involvement of main memory is needed.
The Xerox Dragon multiprocessor workstation is another known implementation using a slightly
dierent approach then in re y but is still a write-update protocol. The most apparent dierence
is the memory update policy. The memory is not updated using a distributed write like the re y
protocol, instead the owner of the block is responsible for writing data back to main memory. This
compels an invalid or more precisely a owned modied state. The Dragon protocol involves four
states, modied (M) already described, shared clean (Sc) state, the block are shared by two or more
caches and main memory might be up-to-date, and exclusive clean (E) has the same semantics as the
corresponding state in the MESI protocol. The last state shared modied (Sm) two or more caches
holds the block which is not consistent with main memory, this cache is the owner of the block and
are responsible for the write-back of the block when that block is scheduled for replacement. A
block may reside in Sm state only in one cache at a time, all other possible shares holds the block
in Sc. The main benet of the dragon protocol compared to the re y is that frequently updates
of main memory due to shared write hits are avoided[115].
Even the Dragon and re y protocols have their short comings, in the Dragon protocol there exist
occasions when the main memory responds to a data request, even though there exist valid copies
in other caches. This situation appears due to only a dirty cache are allowed to respond to a data
requests. In the Fire y protocol clean cache lines can be transfered between caches, but a write to
a shared block will impose excessive memory trac as the main memory must be updated[113].
In an attempt to reduce this unnecessary access to main memory, Takahashi et al. propose the
(CRAC)Coherence solution Reducing memory Access using CCU protocol[113]. The CCU Central
Coherence Unit has two responsibilities, monitors all cache tags and maintain coherence through
cache to cache transfer, it also arbitrates concurrent bus requests from dierent processors and
controls the data transfer between o-chip memory and the on-chip caches. In order to keep
caches and memory consistent the protocol incorporates ve states: Invalid, Clean-Exclusive, DirtyExclusive, Clean-shared, and Dirty-shared. Upon a memory request data can come from any cache
holding a copy, reducing memory trac when copies exists. Writes to shared lines do not result in
an access to main memory, data are instead sent to other caches that have a copy of that line. Only
lines in Dirty-exclusive or Dirty-shared state are responsible for updating memory, which occurs
when the actual line is being replaced. The responsibility of update the memory can be transfered
to other shares of the block, this occurs when a Dirty Shared block must be replaced and another
cache has that line. Data is transfered and set in Dirty state.
Adaptive protocols Apparently non of the described protocol delivers optimal performance across all
types of work loads[115]. A solution that performs better than both pure write-invalidate or writeupdate protocols is to combine the benets from each of them in to a new type of protocols. These
adaptive protocols tries to achieve optimal performance by adaptation of used coherence mechanism
according to observed and predicted data use[115]. This led to a variety of dierent protocols, the
RWB (Read Write Broadcast) and EDWP (Ecient Distributed Write Protocol) among several
others, more information about RWB and EDWP can be found in[103, 104]. This diversity resulted
62
in a need for a standardized cache coherence protocol

The MOESI-protocol introduced by Sweeazey & Smith [106]. The MOESI-class (or model)[115]
protocols consists of ve states representing, Modied (M), Owner (O), Exclusive (E), Shared (S),
and Invalid (I). A large set of existing protocols falls in to this class since the ve states represent
a super set of the other protocols, they can all be described using the MOESI-model[108]. The
Class is very exible, since each cache in the system can employ a dierent method for coherence
and write policy from the class at the same time[115]. The IEEE standard 896 for the future bus
describes its functionality.[109]
4.5.6 Directory-based coherence
For computers with distributed memory bus-based coherence will not deliver enough bandwidth when
the system scales, another approach must be considered. Scalable cache coherence is mainly based upon
using a directory that maintains the state of individual memory blocks and message passing between the
directories to keep the system consistent. This class of multiprocessors are often called Cache Coherent
NUMA or just CC-NUMA architectures[1]. When a processor requests a memory block, it must look up
the state of the block in the directory. Every block of main memory has got a record associated to it,
that contains information of all the caches that currently have a copy of that memory block. A nod in
the system that encounters a cache miss must communicate via the interconnect with the directory that
holds that block. The actual location of the directory can be obtained from the memory address, since
each directory is coupled to its corresponding main memory (see gure 29). The information gained from
the directory determines what must be done in order to obtain a copy of the requested memory block.
This includes communication with the actual holder of the block, sending invalidation to other holders,
and receive acknowledgments from those caches. The requesting node must also when needed inform the
directory of possible changes of the block's state, this is also done using interconnect communication.
Most of the communication through the interconnect are performed by a Communication Assist (CA)
rather then by the processor it self[1]. The key characteristic of directory-based coherence protocols are
the use of a directory, which stores information of the system's global state in regard to coherence between
main memory and the caches[115]. Directory-based cache coherence protocols can either be invalidate
or update based. Invalidate-based protocol requires that the cache to write to a block has exclusive
ownership of that block, and update protocols requires an order preserving network[107].
P1
Pn
Cache
Cache
Memory
Memory
Directory
Directory
Distributed memory
Figure 29: Distributed memory with cache directories assigned to the memory.
The previously example of a directory, can be said to be a Full-map directory. Fullmap directories are in the class of centralized directory-based cache coherence protocol chaiken-90.
They resides in main memory and have entries for every memory block that is cachable[112, 115].
A full-map protocol make use of directory entries with one bit per processor and a dirty bit, where
each bit represents if the block is present or not, in the corresponding processor's cache. The
dirty bit indicates that the cache have write permission to the block, only one processor have write
permission in an instant time. The caches has two bits for state information, valid, not valid, and
Full-map directory
4.6 Hardware-driven prefetching
63
the second bit indicates if the cache has write permission for that block or not. The drawback is
that the protocol do not scale well in respect to memory overhead[112].
Limited directories The motivation for limited directories are the problem with memory overhead
observed in full-map directories[115]. The approach to reduce the overhead, is to restricting the
number of simultaneous shares of a block. This directory replace the presence vector used by
full-map directories with a small number of identiers that points out the shares of the block.
The performance of the limited directory in comparison to the full-map is dependent of amount of
shared data, the number of processors that access each shared location, and the synchronization
method[112].
Chained directories An improvement in scalability is to introduce a chained or distributed directory,
which do not impose any limit to the number of possible shares of a block[115]. The directory is
spread across the individual caches. A linked list is used to maintain control over all possible shares
of that a memory block. The main memory contains a link to last cache to become sharer of the
block, all the caches has a pointer entry used to point out the next sharer of that block. Two types
of chained directories exist single (i.e Stanford Distributed-Directory SDD[111]) or double linked
(i.e The Scalable Coherent Interface SCI[110]).
In the simple single linked version of chained directories, main memory entries has a pointer to the
rst cache23 that have a copy of that actual block[107]. Each of one of all caches in the list contains
a link to the next cache that have a copy of the block, except the last which contains the chain
terminator[115]. The single link chain can introduce extra overhead in replacement of caches lines,
causing extensive invalidations[111, 115, 112].
Instead of only use a single pointer, each list entry contains a forward and a backward pointer, this
approach solves the replacement problem encountered with only single linked lists[115]. Since the
replaced block easily can be dropped by chaining its predecessor and successor24.
The performance of distributed directories are competable to full map directories, whether double
linked are better than single linked chains is debated[111, 112].
Prefetching has proven to be an eective method in order to tolerate memory latency[67]. This is
achieved by letting the CPU overlap data accesses with computations. There exists two broad categories
of prefetching, hardware driven and software assisted. In this section only hardware based techniques
are in focus, software based prefetching are described in section 2.7.5. Except from few minor issues
concerning the hardware originating from software controlled prefetching. The main advantage of hardware driven prefetching is the dynamic handling of prefetch at runtime without compiler assistance as
for software based prefetch[70]. Hardware-based prefetching techniques relies on speculation in order to
predict future references patterns based on information regarding past referenced patterns[70], information which dynamicly is provided to the prefetch hardware during runtime. The additional hardware
for prediction and management of prefetching ought to be simple and not in the critical path of the
processor's cycle-time[1].
Applications that have good caches performance or exhibit irregular reference patterns will not benet
form prefetching, programs that do iterate over long arrays can achieved performance improvements[70].
Systems that exhibits large latencies, are likely to be the most beneted by prefetch. This due to
that stall cycles represent a signicant amount of the total execution time in those systems[70]. When a
system utilizes prefetching memory trac will increase due to prefetch of obsolete data, increased amount
of cache misses due to con icts with established working set, extra invalidations caused by additional
write-sharing, and increased rate of invalidations misses due to prefetch for writes[67].
23 Head of the list.
24 Further information
about this can be found in[110, 111, 112]
64
When prefetching is used the responsibility of the cache will increase when recently prefetched data
must coexist along with the caches current working set. Causing negative eects on the total miss rate,
due to con ict misses between the prefetched data and already cached data[69].
4.6.1 One-Block-Lookahead: (OBL)
The most Simple form of sequential prefetching is variations of one-block-lookahead[71, 70]. Signicant
to all OLB schemes are that they initiates a prefetch of the next consecutive block. This technique is
dierent from just doubling block size since the prefetched block is treated as a separate item regarding
cache replacement. The dierent OBL-schemes are classied depending on what action triggers the
prefetch. Most of the techniques described as OBL can be modied to prefetch more than just one block.
Prefetch always Every reference generates a prefetch of its successive block.
prefetch-on-miss When a memory reference causes a miss in the cache, the referenced block will be
fetched. Then the next block will be prefetched if that block not already resides in the cache. This
simple scheme can reduce the number of misses in a strictly sequential references stream in half[45].
tagged prefetch Using tagged prefetch a tag-bit is associated with every block. The bit is used as an
indication of the blocks status in regard to when prefetch of its successive block is to be issued.
When a block has been prefetched its tag is initiated to zero. Then for every reference to that
block the tag will be set to one, whenever a blocks tag turns from zero to non-zero its successor
block will be prefetched. For a strict sequential reference stream, misses except the rst can be
reduced to zero[45, 70]. Due to the extra overhead and complexity introduced by the tag bit and
its management, tagged prefetch are more expensive to implement.
A severe risk with all OBL-schemes is that the actual prefetch action been issued to late for the memory
system to respond before data is needed by the processor[70]. As a further extension to the OBL paradigm,
several consecutive blocks can be prefetched and buered in a
queue, called stream buer.
fifo
4.6.2 Stream buer
Stream buer (or just buer ) proposed by Jouppi[45] is a cache-line FIFO-queue inserted between two
levels in the memory hierarchy for prefetching of consecutive memory blocks that are to be accessed by
the processor in near future. The stream buer FIFO consists of stacked cache lines containing tag eld,
available bit and storage for the memory block. A tag comparator are assigned to the head entry of the
FIFO, hence in the rst variant of the stream buer only the rst entry are checked for a hit and transfered
up in the hierarchy. Assigned to the last line in the buer are a adder responsible 25of calculating the next
address to prefetched data from, in this case the the last address + one unit stride . When a cache miss
occurs the buer starts to prefetch successive memory blocks starting with the address generating the
cache miss. When the prefetch request are being sent to lower level of the memory system, the tag of the
block to be fetched is written in the stream line and the available bit is set to false. Upon the arrival of
requested block, data is placed in the entry and the available bit is set. In the case of subsequent cache
misses, the head of buer are compared and if the tag matches the referenced address and the available
bit is set data is fetched from the buer in a single cycle. As one entry are moved from the buer and
up the hierarchy, the entries remaining in the buer are shifted towards the head, leaving the tail entry
empty. Based on the previous last entry the following address is calculated and passed on as a prefetch
request. If an access misses both in the cache an the buer, the contents of the buer is ushed and a new
prefetch cycle begins by prefetching the address that caused the ush. Write-backs bypasses the buer
invalidating stale copies that can reside in the buer. In order to handle interleaved streams of data,
multi way streams was introduced enabling prefetching of multiple streams in parallel. When an address
reference miss in the cache, all stream buers are probed for a hit. The one containing the data provides
it, in case of no hit in the streams, the oldest stream is ushed and ready to begin prefetching again.
25 A
unit stride corresponds to the instruction length or the processor, word, double word, quad word or greater.
65
For replacement(e.g ush) are LRU often used. Design parameters for stream buers are the depth of
the buer which are equal to the number of prefetched blocks each stream contains, and the number of
buers used[52]. The optimal depth depend upon the performance of the memory hierarchy, a stream
shall at least be deep enough so the main memory latency are covered.
4.6.3 Filter buers
A problem detected in the previous described buer is wasted bandwidth, fetching useless blocks that
never are referenced. In an attempt to reduce this squander of memory bandwidth Palacharla &
Kessler[52] suggest a scheme that will ltrate isolated references from being fetch and inserted in to
the buer using a allocation lter. The lter prevents the buer to prefetch a block when a reference
miss26 but for the second miss the buer will prefetch the subsequent block to the referred data yielding
miss number two. Block i and block i+1 will be ignored and not prefetched but i+2 down to block i+n.
The management of miss count and reference comparisons are handled by a history buer simply called
lter. Simulations[52] indicates that the proposed lter can reduce the extra waste of bandwidth with as
much as 50%. Further optimizations to the scheme can be made. The use of unit strides can be replaced
with a use of dynamicly calculated strides depending on previous miss addresses. When multiple streams
are in use there are a chance of data overlapping between streams. In order to benet the most of multiple
streams, overlapping must be eliminated. this is done by assigning a tag comparator to to each line in
the buer and compare internal addresses between buers. This modication will result in a much more
complex construction then for the rst purposed original stream buer. The buer can be extended to
not only allow the head element to be transfered but even enable non head entry's to be passed on to the
next level. Hence the time required for loading data that are not at the top of the FIFO will be reduced
since no ushing is required.
4.6.4 Opcode-driven cache prefetch
A instruction op-code based prefetch scheme (IOBP) is proposed by Chi & Lau[68], that performs data
prefetching based on information given by the instruction decode unit. In programs there are predictable
pattern of constant strides in arrays or in pointer references. Those data types are commonly accessed
using a index-displacement addressing mode. Information of this patterns can be extracted from the
instructions used for this kind of references,
or
instructions.
In addition to load or store data, the register used for address calculation will be updated with the
calculated eective address for the reference. Executing a
instruction using index
displacement are as follows:
Rt,(Rx+Disp). The eective address (EA) = (Rx) + Disp, Rt
= (EA) when the execution is nished register Rx will be updated with the value of the eective address,
hence Rx = EA. This is normally used in order facilitate access to successive data. In the prefetch scheme
the procedure is utilized in order to have Rx prepared to calculate the next expected reference, which
now is EA + Disp. That address is sent to the prefetch unit which will perform the actual prefetch.
These actions are repeated for every issued
instruction.
A possible problem with this method arises when indexing with constant stride of 1 and cache blocks
are larger (i.e by a factor 4). Issuing a reference to location A will result in prefetching of location
A+1, which already resides in the cache, hence no prefetch will be issued. This will go on until A+5 is
referenced, which will trigger a prefetch. This can result is stall cycles for the processor which might need
the data faster than the prefetch can provide. A better approach is to still predict next reference based
on the data address but prefetch next data based on cache block addresses. Thus, when address A+1
placed in cache block B is referd cache block B+1 is to be prefetched. To further improve performance
a scheme to prefetch multiple cache blocks can be invoked. The required additional hardware is very
simple, no architectural changes needs to be done there is also no need for new compiler optimizations.
load/store-update
load/store-modify
load/store-update
load/store
load/store-update
26 Miss
in the cache and in the stream buer.
66
4.6.5 Reference Prediction Table: (RPT)
Due to the inability for sequential prefetching techniques to handle strides through nonsecutive memory
blocks, there is need for a technique that takes advantage of both large and small strided array-referencing
patterns. This can be obtained with a special hardware entity which monitors the reference pattern
issued in the processor. Prefetching actions are detected by address comparison between consecutive
load/store instructions. When the prefetch hardware detects a predictable reference pattern generated
by a particular load or store, it will start prefetching for that instruction. The monitoring of memory
accesses, prediction of prefetch address, and calculation of stride is described in the following example:
During successive loop iterations a memory instruction m references the addresses a1, a2 ,a3, up to an.
A prefetch is initiated if a2 - a1= 6= 0, the represents the stride to use for further accesses. The rst
address to be prefetched will be A3 = a2 + , where A3 is the predicted value of address a3. Prediction
will continues until An 6= an.
To record the reference history of the memory instructions during program execution, a special purpose
cache is used reference prediction table (RPT). Each line in that cache contains,
memory instruction
address, previous address accessed by that instruction, stride value if any27, and the state of the cache
entry. The RPT is indexed with the the program counter.
When an instruction rst enters the RPT it is said to be in initial state and its stride is zero, due to
not been executed before. Later that instruction is executed again its state is set to transient and a stride
have been calculated. Transient state are an indication that a reference pattern for that instruction is
emerging. RPT will then issue a prefetch to the address estimated from the instructions stride and its
previous address. The third time that instruction is executed the instruction is promoted to steady state,
which indicates the the stride calculated on previous executions are stable. When an incorrect prefetch
is made for an instruction, that instruction is reset to initial state.
The RPT scheme performs better than sequential schemes does by correctly handle large stride
arrays[70]. There will still be initial misses before the reference pattern is established. At the end
of loop or if irregular, this scheme will produce unnecessary pre-fetches.
4.6.6 Data preloading
Preloading diers from previous prefetch schemes in regard to what information is used to determine
what to be prefetched. All the above described method relies upon information about previous execution
in order to conduct prefetching. In a by Baer & Chen[72, 118] proposed data preloading scheme is based
on predictive information from the instruction stream execution. The architecture is dependent upon a
Branch Prediction Table (BPT) see section 2.2.6 for a more detailed description. A BPT is used to predict
a programs future execution path. Further hardware requirements are a Look-Ahead Program Counter
(LA PC) is used to predict the future of the execution stream, the LA PC is incremented and modied
using information from the BPT. A RPT previously described in section 4.6.5, with a few minor changes
(i.e a new state no fetch is added.) is used. There is also need for a buer that holds addresses of in
progress or outstanding requests, a Outstanding Request List (ORL). The RPT is used in a similar fashion
as before but the LA PC is used for indexing. When the LA PC comes across a load/store instruction
the RPT is searched for that instruction. If that instruction resides in the table, three controls must be
performed in order to determine if a prefetch are to be issued or not. A prefetch will be issued if the
state of the entry is not no fetch and block to be loaded not already in cache or marked in progress in
the ORL. If the control success a load request is issued and the address is stored in ORL.
The scheme has proven quite benecial in environments that exhibits regular access patterns, but performs only moderately when access patterns are more irregular. The advantages gained from prefetching
can however be limited by the memory bandwidth[72].
27 In order to have a stride value it must have been established which happen when an instruction is executed the second
time.

4.6.7 Prefetching in multiprocessors
67
In Multiprocessor systems, preloading of data can originate either form the sender of data sender-initiated
usually in producer consumer situations, or by the receiver of data receiver-initiated. Sender-initiated
preloads often occurs short after data has been produced, whereas receiver-initiated which are the most
frequently used, preloads data before it is actually needed. A key issue for how far ahead a preload
can issued in multiprocessor system is whether the prefetch is binding or non-binding [1]. Binding/nonbinding is in this context applies to how the processor see the data after a prefetch been conducted.
With a binding preload data is bounded at the time of the actual prefetch. No matter if other processors sharing the same block write to it, the the processor issuing the prefetch will see the value as it
was when fetched. The opposite applies for non-binding, where the state of prefetched data are under
constant supervision of the coherence protocol and updated in respect to other processors interventions
to that block. Timing and \luck" are also important in order to make prefetching successful[1]. The
use of explicit prefetch instructions (see section 2.7.5) can in cache coherent systems prematurely issue
a state transition of that block in another cache from exclusive to shared state, which can complicate
the scheme for writes[69]. The use of data preloading in multiprocessor systems can complicate cache
coherence. Further the technique to predict regular patterns in uniprocessor system is not inherently the
same it has to be done in multiprocessors, where loop iterations can be spread over several CPUs. Memory bandwidth can be aected by eventually increased memory trac due to incorrect prefetching [72]
In systems with shared address space without caching of shared data, the need for cache coherence is
of no use. Hence due to the non-caching scheme for shared data, prefetched data is stored in a prefetch
buer from which the processor read the head entry. The buers prefetch depth is adapted to overcome
the latency of the memory system. The actual prefetch are often receiver initiated and the preload
instruction that triggers the the prefetching must be non-blocking and not stall the processor.
In general prefetching in shared memory multiprocessors are a much more complicated task than
prefetching in uniprocessor with no shared memory. This compelled by the greater sensitivity to the
memory sub-system and data sharing hazards[69].The fact that shared data can directly be stored in the
processor's cache, compels the coherence protocol to involve control of data that are being prefetched.
Prefetching also introduce additional negative eects, especially with bus-based shared memory multiprocessor which are very sensitive to increased memory trac due to narrow bandwidth on the bus. This can
make these architectures insensitive to CPU-throughput and totally dependent upon memory throughput which is proportional to the miss rate. Hence, the system's performance will deteriorate using any
prefetch method that increases the miss rate. For the CPU this will result in increased execution time as
the memory systems saturates[69]. Prefetching of shared data in a cache coherent multiprocessors, can
introduce new demands on the entire cache organization. The problem arise when possible future working set of one processor interferes with one or several other processors current working set. Additional
misses will occur when recently prefetched data are invalidated when another processor(s) issues a write
to that block prior to it is being used. Increased miss rate will also occur when data block is prefetched
in exclusive mode, generating invalidations among shares.[69].
5 SUMMARY
68
5 Summary
The evolution of the semiconductor industry has enabled integration of whole systems on a single piece
of silicon. It is the consumer demands for faster, cheaper and smaller products that has forced the
development towards a SoC solution. By combining all the functions into one chip the system becomes
smaller, faster, and less power consuming. To decrease the time-to-market SoCs are entirely or partially
build with IP-components. Thanks to SoC, a whole new domain of products, like small hand held
devices, has emerged. The concept has been around a few years now, but there are still challenges
that needs to be resolved. There is a lack of standards for enabling fast mix and match of cores from
dierent vendors. Further needs are new design methods, tools, and verication techniques. SoC solutions
needs special kind of CPUs that consumes less power, is cheaper, smaller, but still has high-performance
requirements. To fulll all these demands, they are getting more and more complex as the number
of transistors are rapidly growing which has led to the emerging of multiprocessors systems- on-a-chip.
Interconnecting all the IP-cores in a SoC solution diers from the strategy used on systems-on-board.
Since ordinary interconnects have a dierent set of constraints than SoC interconnections, they are
unsuitable for connecting components within a SoC. A number of dierent standards for SoC interconnects
has emerged, all with dierent strategies to solve the hard task of integration. The memory hierarchy has
been a bottleneck in systems for a while. Unfortunately, the gap between the CPU speed and memory
speed continues to grow, leading to larger and larger caches serving memory requests locally. Caches in
multiprocessor systems introduces many diculties that must be solved. The rate of process renement
makes it possible to embed more and more of the needed memory on-chip, which in some cases are large
enough for a target application.
REFERENCES
69
References
[1] David E.Culler, Pal Singh. Jaswinder and Anoop Gupta. Parallel Computer Architecture, A Hardware/software approach, Morgan Kaufmann Inc, San Fransisco California, 1999, ISBN 1-55860-343-3.
[2] Sven Eklund avancerad datorarkitektur, Studentlitteratur, Lund, 1994, ISBN 91-44-47671-X.
[3] global sources. System-on-a-chip sets new rules in the industry
global sources MArs 10, 1999.
http://www.globalsources.com/MAGAZINE/EC/9905/SOCREP.HTM
[4] Bill Cordan, Palmchip Corporation An eecient bus architecture for system-on-chip design Custom
INtegrated Circuits, 1999. Proceedings of the IEEE 1999 pp: 623{626
[5] Sibabrata Ray, Hong Jiang. A recongurable bus structure for multiprocessors with bandwidth reuse,
Journal of Systems Architecture 45, 1999
[6] Hammond Lance, Olukotun Kunle. Considerations in the Design of Hydra: A Multiprocessor-on-aChip Microarchitecture, Stanford Technical Report CSL-TR-98-749, Stanford University, 1998.
[7] Lars-Hugo Hemert Digitala kretsar, Studentlitteratur, Lund, 1996, ISBN 91-44-00099-5.
[8] John L. Hennessy, David A. Patterson Computer Architecture A Quantitative Approach, second edition, Morgan Kaufmann Inc, San Fransisco California, 1996, ISBN 55860-329-8.
[9] Vincent P. Heuring & Harry F. Jordan Computer Systems Design and Architecture, Addison Wwesley,
California, 1997, ISBN 0-8053-4330-X.
[10] Howard Sachs, Mark Birnbaum VSIA Techical Challenges Custom Integrated Circuits, 1999. Proceedings of the IEEE 1999 , 1999 , Page(s): 619 -622
[11] Geert Roosseel, Sonics Inc. Decouple core for proper integration eeTimes Jan 3, 2000.
www.eetimes.com.story/OEG20000103S0048
[12] Jon Turino, SynTest Technologies, Inc. emphDesign for Test and Time to Market - Friends or Foes
Test Conference, 1999. Proceedings. International, 1999 , Page(s): 1098 -1101
[13] Lewis, Je. Intellectual Property (IP) Components, Artisian Components Inc.
http://www.ireste.fr/fdl/vcl/ip/ip.htm
[14] Olukotun Kunle, Bergman Jules, Kun-Yung Chang and Basem Nayfeh. Rationale, Design and Performance of the Hydra Multiprocessor, Stanford Technical Report CSL-TR-94-645, Stanford University,
1994.
[15] RealChip Custom communication Chips.
System-on-Chips,
http://www.realchip.com/Systems-on-Chips/systems-on-chips.html
[16] Rincon Ann Marie, Cherichetti Cory, Monzel James. A, Stauer David, R, Trick Michael, T.
IBM Microelectronics Corp.
Core Design and System-on-a-Chip Integration, IEEE Design & Test of Computers. Volume: 14 4 ,
Oct.-Dec. 1997, pp: 26{35
[17] Rincon Ann Marie, Lee William. R and Slattery Michael.
IBM Microelectronics Corp.
The Changing Landscape of System-on-Chip Design, Custom Integrated Circuits, 1999. Proceedings of
the IEEE 1999, pp: 83{90
REFERENCES
70
[18] IBM Microelectronics Corp & Synopsys, Inc./Logic Modeling Design Environment for System-OnA-Chip Products & solutions Success Stories
http://www.synopsys.com/products/success/soc/soc wp.html
[19] Wilson, Ron. Is SoC really dierent?,
T November 8, 1999. http://www.eetimes.com/story/OEG19991108S0009
[20] Wilson, Ron. The rest of the SoC task,
T October 11, 1999. http://www.eetimes.com/story/OEG19991011S0006
[21] David Patterson. Vulnerable Intel,
The New York Times, June 9, 1998, pp: 44{49
[22] Manfred Schlett. Trends in Embedded-Microprocessor Design,
Computer, August, 1998.
[23] Wulf Wm. A and McKee Sally. A. Hitting the Memory Wall: Implications of the Obvious, Computer
Science Report CS-94-48.
[24] Prince Betty Memory strategies International, USA
Memory in the fast lane, IEEE Spectrum Feb 1994, Vol 31, Issue 2.
PP: 38{41.
[25] Golla C and Ghezzi S.
Flash Memory Architecture, Microelectron Reliab.. Vol 38, No 2, 1998, pp: 179{184.
ee
imes
ee
imes
[26] Abraham S.G, Sugumar R.A, Windheiser D, Rau B. R and Gupta R. redictability of load/store
instruction latencies Microarchitecture, 1993, pp: 139{152.
[27] Boland K and Dollas A, AT&T Global Information Predicting and precluding problems with memory
latency IEEE Micro, Aug 1994, Vol 14, Issue 4, pp: 59{67.
[28] Katayama Y, IBM Research, Tokyo, Japan Trends In Semiconductor Memories IEEE Micro, 1997,
Vol 17, Issue 6, pp: 10{17.
[29] Rao R. Tummala, Vijay K. Madisetti System on Chip or System on Package ? IEEE Design & Test
of Computers Volume: 16 2 , April-June 1999 , Page(s): 48 -56
[30] Mark Dorais, ROHM ELECTRONICS Analyze ASIC Designs To Optimize Integration Levels ELECTRONIC DESIGN ONLINE, August 1999
http://devel.penton.com/ed/Pages/magpages/aug0999/digitech/0809dt3.htm
[31] Lage C, Hayden J.D and Subramanian C. Adv. Products Res. & Dev. Lab., Motorola Inc., Austin,
TX, Advanced SRAM technology-the race between 4T and 6T cells Electron Devices Meeting, 1996.,
International Dec 1996, pp: 271{274.
[32] Takai Y, Nagase M, Kitamura M, Koshikawa Y, Yoshida N, Kobayashi Y, Obara T, Fukuzo Y and
Watanabe H. 250 Mbyte/s Synchronous DRAM Using a 3-Stage-Pipelined Architecture IEEE Journal
of Solid-State Circuits. April 1994, Vol 29, Issue 4, pp: 426{431.
[33] Henry Chang, Larry Cooke, Merrill Hunt, Grant Martin, Andrew McNelly, Lee Todd.
Surviving the SOC Revolution, A Guide to Platform-Based Design, Kluwer Academic Publishers, 1999,
ISBN 0-7923-8679-5.
[34] National Semiconductor, Geode Products
Geode SC1400 (Information Appliance-on-a-Chip)
www.national.com/appinfo/solutions/0,2062,243,00.html
REFERENCES
71
[35] International Competence Cluster for System-on-a-Chip to be created in Sweden

http:://www.isa.se/default.cfm?page=/regionalinfo/index.htm
[36] Embedded Systems and the Year 2000 Problem
Mark A. Frautschi
Draft of 10 September 1999, Shakespeare and Tao Consulting
[37] The Pittsbourgh Digital Greenhouse
http://www.digitalgreenhouse.com/
[38] www.arm.com
[39] AMBA specication (REV 2.0) ARM Limited, 13 may 1999. www.arm.com
[40] www.palmchip.com
[41] Crawford, S.E & DeMara, R.F. Cache coherence in a multiport memory environment Massively
Parallel Computing Systems, Proceedings. 1994. pp: 632{642.
[42] O'Krafka, B.W & Newton, A.R. An empirical evaluation of two memory-ecient directory methods
Computer Architecture, 1990. Proceedings. 1990. pp: 138{147.
[43] Farkas, K.I, Jouppi, N.P & Chow, P. How useful are non-blocking loads, stream buers and speculative
execution in multiple issue processors? High-Performance Computer Architecture, Proceedings. 1995.
pp: 78{89.
[44] Seznec, A. DASC cache High-Performance Computer Architecture, Proceedings. 1995. pp: 134{143.
[45] Jouppi, N.P. Improving direct-mapped cache performance by the addition of a small fully-associative
cache and prefetch buers Computer Architecture, Proceedings. 1990. pp: 364-373.
[46] Jouppi, N.P. Cache Write Policies and Performance Computer Architecture, Proceedings. 1993. pp:
191-201.
[47] Agarwal, A & Pudar, S.D. Column-associative Caches: A Technique For Reducing The Miss Rate
Computer Architecture, Proceedings, 1993. pp: 179{190.
[48] Pendse, R & Bhagavathula, R. Performance of LRU block replacement algorithm with pre-fetching
Circuits and Systems, Proceedings. 1998. pp: 86{89.
[49] Colagiovanni, L. & Shaout, A. Cache memory replacement policy for a uniprocessor system Electronics Letters, 14 April 1990. Vol. 26 Issue: 8. pp: 509{510.
[50] So, K & Rechtschaen, R.N. Cache operations by MRU change IEEE Transactions on Computers
June 1988. Vol. 37 Issue: 6. pp: 700{709.
[51] Stiliadis, D & Varma, A. Selective victim caching: a method to improve the performance of directmapped caches IEEE Transactions on Computers, May 1997, Vol. 46, Issue: 5 pp: 603{610.
[52] , S & Kessler, R.E. Evaluating stream buers as a secondary cache replacement Computer Architecture, Proceedings 1994. Proceedings, pp: 24{33.
[53] Wong, W.A & Baer, J-L. Modied LRU policies for improving second-level cache High-Performance
Computer Architecture, 2000. HPCA-6. pp 49{60.
[54] CoreConnect Bus Architecture International Business Machines Corporation, 1999
www.chips.ibm.com
[55] Microprocessor system buses: A case study Ehud Finkelstein, Shlomo Weiss Journal of Systems
architecture 45 (1999) pp: 1151{1168
REFERENCES
72
[56] Computer Architecture Computational Science Education Project,

1995
http://csep1.phy.ornl.gov/csep.html
[57] Cheng-Ta Hsieh, Massoud Pedram Architectural Power Optimization by Bus Splitting Date Conference Proceedings 2000, pp: 612{617
[58] Instruction Set Denition and Instruction Selection for ASIPs Johan Van Praet, Gert Gossens, Dirk
Lanneer, Hugo De Man IEEE, High-Level Synthesis, 1994, Proceedings of the Seventh International
Symposium.
[59] Xilinx DSL Modems Glossary http://www.xilinx.com/products/xaw/mdm/dsl/dsl gls.htm
[60] W.K. Dawson, R.W. Dobinson Buses and bus standards Computer Standards & Interfaces 20 (1999),
pp: 201{224
[61] Using the Multistage Cube Network Topology In Parallel Supercomputers Howard Jay Siegel, Watne
G. Nation, Clyde P. Kruskal, Leonard M. Napolitano, JR
Proceedings of the IEEE, VOL 77, NO. 12, December 1989, pp: 1932{1952
[62] Gyungho Lee, Bland W Quattlebaum, Sagyeun Cho, Larry L. Kinney Design of a bus-based shared
memory multiprocessor DICE Microprocessors and Microsystems 22 (1999), pp: 403{411
[63] David J. Lilja Reducing the Branch Penalty in Pipelined Processors Computer, 1988, Volume: 21 7
, July 1988 , pp: 47 -55
[64] Brad Calder, Dirk Grunwald. Fast & Accurate Instruction Fetch and Branch Prediction Computer
Architecture, 1994., Proceedings the 21st Annual International Symposium on , 1994 , pp: 2{11
[65] Colin A. Warwick, Abbas Ourmazd Trends and Limits in Monolithic Integration by Increasing the
Die Area
IEEE Transactions on semiconductor manufacturing, vol 6. No3, august 1993.
[66] Tien-Fu Chen & Jean-Loup Baer. Eective hardware-based data prefetching for high-performance
IEEE Transactions on Computers May 1996, Vol. 44, Issue: 5, pp: 609{623.
[67] Tien-Fu Chen & Baer, J-L. A performance study of software and hardware data prefetching schemes
Computer Architecture, proceedings. April 1994. pp: 223{232.
[68] Chi-Hung Chi & Siu-Chung Lau. Reducing data access penalty using intelligent opcode-driven cache
prefetching Computer Design: VLSI in Computers and Processors, 1995. ICCD. pp: 512{517.
[69] Tullsen, D.M & Eggers, S.J. Limitations Of Cache Prefetching On A Bus-based Multiprocessor Computer Architecture, proceedings May 1993. pp: 278{288.
[70] Vander Wiel, S.P & Lilja, D.J. When caches aren't enough: data prefetching techniques IEEE Computer July 1997. Vol. 30. Issue: 7. pp: 23{30.
[71] Kim, S & Veidenbaum, A.V. Stride-directed prefetching for secondary caches Parallel Processing,
Proceedings 1997. Aug 1997. pp: 314{321.
[72] Jean-Loup Baer and Tien-Fu Chen. An eective on-chip preloading scheme to reduce data access
penalty Proceedings of the 1991 conference on Supercomputing. pp: 176{186.
[73] Mladen Berekovic, Dirk Heistermann, Peter Pirsch A Core Generator For Fully Synthesizable and
Highly Parameterizable RISC-cores for System-on-chip Design
Signal Processing Systems, 1998. SIPS 98. 1998 IEEE Workshop on , 1998 , pp: 561{568
[74] Tullsen, D.M., Eggers, S.J., Levy, H.M. Simultaneous multithreading: Maximizing on-chip parallelism
Computer Architecture, 1995. Proceedings., 22nd Annual International Symposium on, 1995 , pp: 392{
403
REFERENCES
73
[75] Eggers, S.J.; Emer, J.S.; Levy, H.M., Lo, J.L.; Stamm, R.L.; Tullsen, D.M. Simultaneous multithreading: a platform for next-generation processors IEEE Micro, Volume: 17 5 , Sept.-Oct. 1997 , pp:
12{19
[76] Stefan Pees, Martin Vaupel, Vojin Zivojnovic, Heinrich Meyr On Core and More: A Design Perspective for Systems-on-a-chip
Signal Processing Systems, 1997. SIPS 97 - Design and Implementation., 1997 IEEE Workshop on ,
1997 , pp: 60{63
[77] Design Environment for System-On-A-Chip IBM Microelectronics Corp, Synopsys, Inc. , 1997
http://www.synopsys.com/products/success/soc/soc wp.html
[78] Weiss, A.R. The standardization of embedded benchmarking: pitfalls and opportunities Computer
Design, 1999. (ICCD '99). International Conference on, 1999, pp: 492{508
[79] Peter N. Glaskowsky Silicon Magic:DVINE-LY INSPIRED ? Microprocessor report, March 27, 2000
[80] Roman L. Lysecky, Frank Vahid, Tony D. Givargis Techniques for reducing read latency of core bus
wrapper Proceedings of Design, Automation and Test in Europe 2000, pp:84{91
[81] Lefurgy, C.; Bird, P.; Chen, I.-C.; Mudge, T. Improving code density using compression techniques
Microarchitecture, 1997. Proceedings., Thirtieth Annual IEEE/ACM International Symposium on ,
1997 , pp: 194{203
[82] Lefurgy, C.; Piccininni, E.; Mudge, T. Reducing code size with run-time decompression HighPerformance Computer Architecture, 2000. HPCA-6. Proceedings. Sixth International Symposium on
, 1999, pp: 218{228
[83] IDT Peripheral Bus (IPBus) Intermodule Connection Technology Enables Broad Range of SystemLevel Integration An IDT White Paper www.silicore.net/pdles/idtipbus.pdf
[84] W.A Halang Real-time systems: Another perspective The Journal of Systems and Software. April
1992, pp: 101{108
[85] http://www.sussex.ac.uk/engg/research/vlsi/projects/pibus/
[86] Wade D. Peterson Application Note: WBAN003, Design Philosophy of the WISHBONE SoC Architecture September 7, 1999 www.silicore.net
[87] Lichen Zhang Predictable architecture for real-time systems Information, Communications and Signal
Processing, 1997. ICICS., Proceedings of 1997 International Conference on
Volume: 3 , 1997 , pp: 1761{1765
[88] B. Cogswell and Z. Segall MACS: A predictable architecture for real-time systems IEEE, December
1991, Proceedings of 12th Real-Time Systems Symposium, pp: 296{305
[89] J Turley Evaluating Embedded Processors MICRO Design Resources, Sebastopol, Calif., 1997.
[90] Lennart Lindh, Tommy Klevin Scalable Architecture for Real-time Applications and use of busmonitoring Real-Time Computing Systems and Applications, 1999., pp: 208{211
[91] Tricore Architecture Overview Handbook release version 1.3.0 1999 Inneon Technologies Corp.
www.inneon.com/us/micro/tricore/sub arc.htm
[92] FISPbus Foundation Library Mentor Graphics, 1998 http://www.mentor.com/inventra/cores/catalog/sp bus peripheral
[93] www.silicore.net
[94] Wade D. Peterson Application Note: WBAN003. Design Philosophy of the WISHBONE SoC Architecture September 7, 1999, www.silicore.net
REFERENCES
74
[95] Wishbone Interconnection for Portable IP Cores, Specication Revision A www.silicore.net
[96] Ann Harwood Motorola's Peripheral Interface Standards Embedded Processor Forum, May 1999
http://www.mot.com/SPS/MCORE/downloads/nal-epf.pdf
[97] Marc Torrant
Simultaneous Multithreading Presentation, May 14, 1999
http://www.rit.edu/ mxt8837/thesis/Overview 5 14 99/index.htm
[98] MCORE Reference Manual, Motorola Inc., 1997
[99] Triscend homepage: http://www.triscend.com
[100] Tensilica homepage: http://www.tensilica.com
[101] ARC homepage: http://www.arccores.com
[102] Transmeta homepage: http://www.transmeta.com
[103] Rudolph, L & Segall, Z. Dynamic Decentralized Cache for MIMD Parallel Processors. Proceedings
of 11th ISCA, 1984, pp: 340{347.
[104] A Cache Coherence Approach for Large Multiprocessor Systems. Proceedings of Supercomputing
Conf, IEEE, 1988, pp:337{345.
[105] Tyson, G.; Farrens, M.; Pleszkun, A.R. Misc: A Multiple Instruction Stream Computer. Microarchitecture, 1992. MICRO 25., Proceedings of the 25th Annual International Symposium on , 1992 ,
pp: 193{196
[106] Sweazey, P & Smith, A. J. A class of Compatible Cache Consistencey Protocols and their Support
by the IEEE Futurebus. Proceedings of 13th ISCA, 1986. pp: 414{423.
[107] Glasco, D. B. Design and Analysis of Updtae-Based cache Coherence Protocols for Scalable SharedMemory Multiprocessors. Technical report NO. CSL-TR-96-670. 1995.
[108] Sweazey, P. VLSI support for copyback caching protocols on Futurebus. Computer Design: VLSI in
Computers and Processors, 1988. ICCD '88., Proceedings of the 1988 IEEE International Conference,
1988, pp: 240{246.
[109] Compcon Spring '88. Thirty-Third IEEE Computer Society International Conference, 1988, pp:
505{511. Shared memory systems on the Futurebus
[110] James, D.V, Laundrie, A.T, Gjessing, S, & Sohi, G.S. Distributed-directory scheme: scalable coherent iterface. IEEE Computer, june 1990, Vol. 23. Issue 6. pp: 74{77.
[111] Thapar, M, & Delagi, B. Distributed-directory scheme: Stanford distributed-directory protocol. IEEE
Computer, June 1990, Vol. 23, Issue 6, pp: 78{80.
[112] Chaiken, D, Fields, C, Kurihara, K, & Agarwal, A. Directory-based cache coherence in large-scale
multiprocessors. IEEE Computer June 1990, Vol 23, Issue 6, pp:49{58.
[113] Takahashi, M, Takano, H, Kaneko, E, & Suzuki, S. A shared-bus control mechanism and a cache
coherence protocol for a high-performance on-chip multiprocessor. Proceedimgs of High-Performance
Computer Architecture Symposium 1996, pp: 314{322.
[114] Thacker, C.P, Stewart, L.C, & Satterthwaite, E.H., Jr. Fire y: a multiprocessor workstation. IEEE
Transactions on Computers, Vol. 32 Issue 8, Aug 1988. pp: 909{920.
[115] Tomasevic, M, & Milutinovic, V. Hardware approaches to cache coherence in shared-memory multiprocessors, Part 1. IEEE Micro, Vol. 14 Issue 5, Oct. 1994, pp: 52{59.
REFERENCES
75
[116] Tanenbaum, Andrew. S. Distributed Operating Systems, Prentice-Hall, 1995, ISBN 0-13-143934-0.
[117] Kuskin, J, Ofelt, D, Heinrich, M, Heinlein, J, Simoni, R, Gharachorloo, K, Chapin, J, Nakahira, D,
Baxter, J, Horowitz, M, Gupta, A, Rosenblum, M, & Hennessy, J. The Stanford FLASH multiprocessor.
Proceedings of Computer Architecture, 1994, pp: 302{313.
[118] An eective programmable prefetch engine for on-chip caches. Tien-Fu Chen Proceedings of Microarchitecture, 1995, pp: 237{242.
[119] Lenoski, D, Laudon, J, Gharachorloo, K, Gupta, A, Hennessy, J. The directory-based cache coherence protocol for the DASH multiprocessor. Proceedings of Computer Architecture, 1990. pp: 148{159.
[120] Cache-only memory architectures. Dahlgren, F, & Torrellas, J. IEEE Computer. Vol 32 Issue 6,
June 1999, pp: 72{79.
[121] Hennessy, J, Heinrich, M, & Gupta, A. Cache-coherent distributed shared memory: perspectives
on its development and future challenges. Proceedings of the IEEE Vol. 87 Issue 3 , March 1999, pp:
418{429
[122] www.sonicsinc.com
[123] Drew Wingard, Alex Kurosawa Integration Architecture for System-on-a-Chip Design Custom Integrated Circuits Conference, 1998, Proceedings of the IEEE 1998 pp: 85{88
[124] A. John Anderson Multiple Processing, A systems overview Prentice Hall, 1989, ISBN 0-13-605213-4
[125] Govindan Ravindran, Michel Stumm Performance Comparison of Hierarchical Ring- and Meshconnected Multiprocessor Networks High-Performance Computer Architecture, 1997., Third International Symposium on , 1997 , pp: 58{69
[126] Gasbarro, J.A. The Rambus memory system. International Workshop on, Memory Technology,
Design and Testing, 1995, pp:94{96.
[127] Philofsky, E.M. FRAM-the ultimate memory. IEEE International Nonvolatile Memory Technology
Conference, 1996, pp: 99{105.
[128] McFarling, S. Cache replacement with Dynamic Exclusion ACM 1992.
[129] Gillingham, P. MOSAID Technologies Inc. SLDRAM Architectural and Functional Overview. SLDRAM Consortium 29 Aug 1997.
[130] Kanishka Lahiri, Anand Raghunathan, Sujit Dey Fast Performance Analysis of Bus-Based Systemon-Chip Communication Architectures IEEE/ACM International Conference on Computer-Aided Design, 1999 , pp: 566{572
[131] Alpha Systems - Compaq's commitment to Alpha http://www.compaq.com/alphaserver/news/commit letter.html
[132] W.J. Bainbridge Asynchronous Macrocell Interconnect Using Marble Proceedings of the Fourth
International Symposium on Advanced Research in Asynchronous Circuits and Systems, 1998, pp:
122{132
[133] Eyre, J.; Bier, J. DSP processors hit the mainstream Computer Volume: 31 8 , Aug. 1998 , pp:
51{59
[134] Milenkovic, M. Microprocessor memory Management Units. IEEE Micro. April 1990. pp: 70{85.
Socrates
- Specications
Revision: 0.99
Authors

Supervisors

as
Aug 7, 2000
Abstract
This document contains the specications for the SoCrates Congurable Platform, -a SoC multiprocessor system intended for a single FPGA. This is the second of three documents that forms our
Master Thesis in Computer Engineering.
II
CONTENTS
Contents
1 System Architecture
1.1 Functionality . . . . . . . . . . . . . . . . . . .
1.1.1 Functionality demands . . . . . . . . . .
1.2 Implementation . . . . . . . . . . . . . . . . . .
1.2.1 Electrical Interface . . . . . . . . . . . .
1.2.2 Programming of internal registers . . . .
1.2.3 Explicit demands . . . . . . . . . . . . .
1.2.4 Motivation of design choiches/trade-os
1.2.5 Future work . . . . . . . . . . . . . . . .
1.3 Testing . . . . . . . . . . . . . . . . . . . . . .
1.3.1 How to test . . . . . . . . . . . . . . . .
1.3.2 What to test . . . . . . . . . . . . . . .
2 CPU Node
2.1 Functionality . . . . . . . . . . . . . . . . .
2.1.1 Motivation for the component . . . .
2.1.2 Functionality demands . . . . . . . .
2.1.3 Interactions with other components
2.2 Implementation . . . . . . . . . . . . . . . .
2.2.1 Collaborating components . . . . . .
2.2.2 Electrical Interface . . . . . . . . . .
2.2.3 Future work . . . . . . . . . . . . . .
2.2.4 Testing . . . . . . . . . . . . . . . .
2.2.5 What to test . . . . . . . . . . . . .
2.2.6 Test environment . . . . . . . . . . .
2.2.7 Test methodologies . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3.1 Functionality . . . . . . . . . . . . . . . . . .
3.1.1 Motivation for the component . . . . .
3.1.2 Functionality demands . . . . . . . . .
3.1.3 Interactions with other components .
3.2 Implementation . . . . . . . . . . . . . . . . .
3.2.1 Electrical Interface . . . . . . . . . . .
3.2.2 State machines/pseudo code . . . . . .
3.2.3 Programming of internal registers . . .
3.2.4 Explicit demands . . . . . . . . . . . .
3.2.5 Motivation of design choices/trade-os
3.2.6 Future work . . . . . . . . . . . . . . .
3.3 Testing . . . . . . . . . . . . . . . . . . . . .
3.3.1 How to test . . . . . . . . . . . . . . .
3.3.2 What to test . . . . . . . . . . . . . .
3.3.3 Test environment . . . . . . . . . . . .
3.3.4 Test methodologies . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 CPU
4 Network Interface
4.1 Functionality . . . . . . . . . . . . . . . . .
4.1.1 Motivation for the component . . . .
4.1.2 Functionality demands . . . . . . . .
4.2 Implementation . . . . . . . . . . . . . . . .
.
.
.
.
.
1
1
2
2
2
2
2
2
3
3
3
4
4
4
4
4
4
5
5
6
6
6
6
7
7
7
12
12
12
13
14
14
14
15
15
15
15
15
15
16
16
16
16
16
16
III
CONTENTS
4.2.1
4.2.2
4.2.3
4.2.4
4.2.5
4.2.6
4.2.7
4.2.8
4.2.9
4.2.10
Events and response actions . . . . . .

Address decoding . . . . . . . . . . . .
Locked accesses . . . . . . . . . . . . .
Internal registers . . . . . . . . . . . .
Memory chip selects . . . . . . . . . .
Electrical Interface . . . . . . . . . . .
Programming of internal registers . . .
Explicit demands . . . . . . . . . . . .
Motivation of design choices/trade-os
Future work . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.1 Functionality . . . . . . . . . . . . . . . . . . .
5.1.1 Motivation for the component . . . . . .
5.1.3 Interactions with other components . .
5.1.5 Future work . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5 IO Node
6 Interconnect
6.1 Functionality . . . . . . . . . . . . . . . . . .
6.2 Implementation . . . . . . . . . . . . . . . . .
6.2.2 Bus protocols . . . . . . . . . . . . . .
6.2.3 Read cycle . . . . . . . . . . . . . . .
6.2.4 Write cycle . . . . . . . . . . . . . . .
6.2.7 Future work . . . . . . . . . . . . . . .
6.3 Testing . . . . . . . . . . . . . . . . . . . . .
6.3.2 How to test . . . . . . . . . . . . . . .
6.3.3 What to test . . . . . . . . . . . . . .
7 Arbitration
7.1 Functionality . . . . . . . . . . . . . . . . . .
7.2 Implementation . . . . . . . . . . . . . . . . .
7.2.2 Request queue . . . . . . . . . . . . .
7.2.3 Programming of internal registers . . .
7.2.6 Future work . . . . . . . . . . . . . . .
7.3 Testing . . . . . . . . . . . . . . . . . . . . .
7.3.1 How to test . . . . . . . . . . . . . . .
7.3.2 What to test . . . . . . . . . . . . . .
16
17
18
18
18
19
19
20
20
20
21
21
21
21
21
21
21
22
22
22
22
22
22
22
23
23
23
24
24
24
25
25
25
25
26
26
26
26
26
26
26
26
26
27
27
27
27
27
28
28
IV
CONTENTS
7.3.4 Test methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8 Boot
8.1 Functionality . . . . . . . . . . . . . . . . . . .
8.1.5 Future work . . . . . . . . . . . . . . . .
8.1.6 Test methodologies . . . . . . . . . . . .
9 Memory Wrapper
9.1 Functionality . . . . . . . . . . . . . . . . . . .
9.2 Implementation . . . . . . . . . . . . . . . . . .
9.2.1 Entity/Interface . . . . . . . . . . . . .
9.2.2 Explicit demands . . . . . . . . . . . . .
9.3 Testing . . . . . . . . . . . . . . . . . . . . . .
9.3.1 How to test . . . . . . . . . . . . . . . .
9.3.2 What to test . . . . . . . . . . . . . . .
9.3.3 Test methodologies . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
28
29
29
29
29
29
29
29
29
30
30
30
30
30
30
30
30
31
31
31
31
31
SYSTEM ARCHITECTURE
1
1.1
System Architecture
Functionality
The system contains one or more processing units, a real-time unit, interconnect, and pheriphal components (gure 1). The system can be booted by loading code from an external source via the parallel
port. During the download phase, each node is halted. When the download is completed, the I/O node
"releases" each node by broadcasting a signal telling each node to begin its execution.
A distributed shared no-cache memory allows threads to communicate with each other. The global address space is 32-bits, which means that the memory can be addressed from 00000000h to FFFFFFFFh.
An 32-bit address consists of a 8-bit base which is concatenated with a 24-bit oset. The base is a one-hot
coded unique identication which divides the global address space into several local address spaces. Each
local address space ranges from 000000h to FFFFFFh.
MEM
NI
CPU/DSP
NI
CPU/DSP
MEM
MEM
NI
CPU/DSP
I/O
CPU/DSP
The FPGA that is used is a XILINX XCV1000 that has 1,124,022 gates and 131,072 bits of RAM
organised in 4 Kbit dual-port RAM clusters. The rst version of SoCrates will use two processing
nodes, meaning that the available RAM will be divided into two parts, 65,536 bits each (8192 bytes),
which results in a physical address space from 000000h to 002000h. Each processing node uses the local
memory address space for an exception vector table, TCB address table, a list of TCBs, thread segments
(code, data, stack), Network Interface (NI) registers, I/O registers, and Real Time Unit (RTU) registers
(gure 2).
NI
Interconnect
Arbiter
NI
MEM
CPU/DSP
MEM
CPU/DSP
NI
RTU
MEM
Interrupt lines
External Interrupts
Figure 1: A general gure of the SoCrates system architecture.

External interrupts are handled by the real-time unit which assigns the interrupt service routine to a
specic node and executes it there. A single interrupt line from the real time unit to each node is used
for task switching.
1.1.1 Functionality demands

World domination by year 2005. Complete global monopoly on all past, present and future embedded
systems. So has it been written, so shall it be done.
1.2
Implementation
002100h
(physical limit) 002000h
000000h
002100h
Network Interface
registers
002100h
Network Interface
registers
Network Interface
registers
Shared Data
Shared Data
Shared Data
User Data
User Data
User Data
User Code
User Code
User Code
Exception Code
Exception Code
Exception Code
Real Time Unit

Code
Real Time Unit

Code
Real Time Unit

Code
TCB block #N
TCB block #N
TCB block #N
TCB block #2
TCB block #2
TCB block #2
TCB block #1
TCB block #1
TCB block #1
Exception vector table
000000h
Processing node 1
Processing node 2
000000h
Processing node N
Figure 2: The local memory organisation of the processing nodes.
1.2
Implementation
1.2.1 Electrical Interface

Pins on the FPGA will be used for parallel transactions and external interrupts.
1.2.2 Programming of internal registers

Each processing node needs to access six RTU registers in order to handle taskswitching. Additional
registers are needed for the Network Interface and the I/O node.
1.2.3 Explicit demands

The goal of this version of SoCrates is to complete the implementation of a prototype and no particular
demands has been stated.

The one-hot identication coding of each node makes multicasting and broadcasting easy to implement.
If one bit is set in the 8-bit identication eld, the corresponding node will answer. Several bits set at a
time will result in a multicast and all bits set will cause a broadcast over the system.
1.2.5 Future work

The prefetch functionality can be extended to move larger blocks of data (current version only prefetches
32-bit data). A set of tool for code development and debugging should also be included in a later version.
1.3
1.3
Testing
Testing
1.3.1 How to test

The system is tested in a bottom-up fashion, meaning that the components which the system consists
of are tested individually before the whole system is tested. The complete system will be tested by
executing an test-application which performs some kind of calculation on the processing nodes and gives
some feedback.
1.3.2 What to test

The main goal of the project is to produce the rst prototype of the SoCrates system. The testing of the
system will be performed with a simple user application that performs some kind of calculations on the
processing nodes and feedbacks the result.
CPU NODE
CPU Node
2.1
Functionality
The node serves as a container component for those components that are necessary in order to create the
smallest environment in which the CPU can perform its task. The whole system consists of at least one
such node and other nodes with dierent functionality and responsibilities, see section 1 for the whole
system specication.
2.1.1 Motivation for the component

The node constitutes a computing entity and are genericly instantiated according to the application
demands when the whole system is being synthesized and put in silicon. This enables a system to consist
of a congurable amount of CPU-nodes.

The node contains several intraconnected components that serves as a replaceable entity. It contains
all components that are required for the local processor to carry out its task with as little interconnect
communication as possible.

The node interacts with the interconnect via the Network Interface's external interface, see section 4.2.6.
A single Interrupt Request (IRQ) line from the Real-Time Unit (RTU) is connected to the processor.
2.2
Implementation
The node implementation consists of making instances of the participating components and connecting
them to a functional unit.
2.2.1 Collaborating components

The glue component of the node is the Network Interface (NI), which connects the components and
handles communication with other components in the system.
Network Interface
NI can be viewed as component with three dierent interfaces handling dierent requests and
response from dierent parts of the system.
{ CPU-interface
The CPU-interface handles request/response from/to the processor (see section 4.2.1 or gure 3).
{ MEM-interface
The MEMory-interface handles all strobes that controls the local memory on the node. Addresses originating from a remote access and data to be delivered to a remote node are also
handled, see section 4.2.1 or gure 3 for further information.
{ EXT-interface
The EXTernal-interface is the nodes link to the rest of the system. Outgoing requests and
responses are handled according to section 4.2.1 and gure 3.
CPU
The CPU performs requests to NI and memory when issuing load, stores, swap, and prefetch
instructions. Strobes and control signals are transmitted to NI while address and data are delivered
to data CPU and address CPU, which is available to both memory wrapper and NI.
2.2
Implementation
Memory wrapper
The memory wrapper serves as an interface to the on-chip RAM blocks. This is done to obtain
a general interface between the memory, NI, and CPU independent of how the on-chip memory is
accessed.
Dual-ported memory
reset
clk
reset
clk
mas
berr
mode
rw
ack
ad_strobe
bus_grant
bus_request
data
reset
clk
address
A dual-ported memory is attached to the NI via the memory wrapper and CPU as shown in g 3
to avoid a decrease of available bandwidth when both a remote NI and the local CPU are accessing
the memory. The protocol between CPU and memory is true SRAM zero wait-state. NI inserts
wait-states via nWAit if it can not satisfy zero wait-state transaction.
MEMORY WRAPPER
MEMORY
lock
rw_CPU
rw_NI
cs_CPU
nwait
berr
NETWORK
INTERFACE
trans
cs_NI
mode
address_NI
mas
data_NI
rw
CPU
data_CPU
address_CPU
irq
Figure 3: Node structure. Signals and electrical interfaces are described in section 4.2.6.

Signal-Name
Width
Type
Description
data
GLOBAL DATA WIDTH-1 downto 0

1
1
1
1
1
1
1
1
InOut
InOut
Out
In
InOut
InOut
InOut
InOut
InOut
In
Bidirectional data bus

Bidirectional address bus
Bus request line
Bus grant line
Valid address/data
transaction ackowledged
read or write transaction
Processor mode
Bus error indication
Interrupt request (RTU)
address
bus request
bus grant
ad strobe
ack
rw
mode
berr
irq
2.2.3 Future work

Easily interchangeable interconnect.
2.2
Implementation
2.2.4 Testing
2.2.5 What to test
Internal and external processor initiated transactions must be veried for correct behavior. Concurrent
requests and responses must be simulated.
2.2.6 Test environment

In order to perform a tough test of the node, the minimum test environment must at least consist of a
simulated bus with arbitration and interactions with other nodes.
2.2.7 Test methodologies

A SEAMLESS co-simulation environment veries the basic transfers on the buses, processor respective
NI initiated transfers.
CPU
CPU
3.1
Functionality

One or several CPUs are needed in the SoCrates system to execute user application(s) in form of threads.
The CPU's purpose and responsibility is to execute instructions from the thread that is currently assigned
to it.

The processor is a 32-bit semi-ARM-compatible RISC processor without an instruction pipeline. It should
be ARM compatible in the following way:
By executing ARM code (not Thumb)
By only operating in User and Supervisory mode
The memory format is Big Endian
The instruction set is augmented with a prefetch instruction. The instructions performing multiplication
does not have to be implemented in this version of the processor. The processor consists of several internal
components (gure 4):
A register le consisting of 20 32-bit registers
A 32-bit ALU
A barrel shifter
An address increment unit
A control unit
An exception handler
The register le contains 20 registers, where R0-R14 are general purpose registers (R15 is used as a
program counter, see gure 5). There is a seventeenth register, the Current Processor Status Register
(CPSR), that holds status information. The rest of the registers are banked, according to ARM organization. The gure also shows which registers are visible when operating in either User or Supervisory
mode. The CPSR register is organized in the following way:
Bit(s)
31
30
29
28
27-8
7
u6
5
4-0
Symbol
N
Z
C
V
I
F
T
M4-M0
Description
Negative/Less than
Zero
Carry/Borrow/Extend
Over ow
Reserved
IRQ disable
Fast Interrupt Request (FIQ) disable
State bit
Mode bits
The CSPR is identical to the ARM CSPR register and further explanation of the semantics of each bit
can be read in the ARM7TDMI manual, section 3.8.
3.1
Functionality
Address bus
Address Register
Exception handler
Address
Incrementer
Instruction
Register Bank
18 32-bit registers
2 status registers
Decoder
&
Control Logic
Barrel shifter
32-bit ALU
Data bus
Figure 4: The internal architecture of the CPU

The Arithmetic Logical Unit (ALU) will be able to perform 16 basic operations specied in the ARM
manual as Data Processing functions (see ARM7TDMI manual, section 4.5). The ALU will also be
controlled by a signal that indicates whether the current ALU operation should aect the status ags
(N,Z,C,V) or not. The address increment unit helps the processor to point out the next instruction
without using the ALU. The control unit and exception handler will be explained in the implementation
section.
The supported instruction set is based on the ARM instruction set, containing 32 instructions. Because
multiplication is not supported in this version of the processor, the MLA, MLAL, MUL, and MULL instructions will be omitted. Co-processor specic instructions are also excluded (CDP, LDC, MCR, MRC,
STC). Because the processor only operates in ARM state, the BX instruction that changes the processor
state is not needed and is therefore omitted. The remaining instructions that will be implemented can
be seen in gure 6.
In order to decode an instruction, the organization of each instruction op-code need to be specied.
Every ARM instruction begins with a four bit condition eld (gure 7). This eld gives the compiler the
opportunity to make instructions conditional. There are 15 available conditions and they are described
in the ARM7TDMI manual, chapter 4. Because all supported instructions are ARM-compatible (except
PRF), the detailed information of each instructions op-code will be provided by the ARM7TDMI manual,
chapter 4. The prefetch instruction will be coded as an ARM instruction with a condition code set to
1111b 1 (normally, ARM has reserved the case when the condition eld equals 1111b for future use, but
here it means that a prefetch is issued)(gure 8).
The behaviour of the LDR and STR instructions are special in an ARM core and will therefore be
explained in this section. When the processor issues a read or a write, the bottom two bits in the address
is ignored by the memory controller/system. When a halfword transaction is issued, the least signicant
bit of the address is ignored. On byte accesses, the whole address is used. It is not recommended to put
out a unaligned address on the address bus. When adding oset to the base address results in a unaligned
address (in both word an halfword accesses), the bottom bit(s) (two for word, one for halfword) is used
1b
stands for binary notation
3.1
Functionality
ARM State General Registers and Program Counter

System & User
Supervisor
R0
R0
R1
R1
R2
R2
R3
R3
R4
R4
R5
R5
R6
R6
R7
R7
R8
R8
R9
R9
R10
R10
R11
R11
R12
R12
R13
R13_svc
R14
R14_svc
R15 (PC)
R15 (PC)
ARM State Program Status Registers
CPSR
CPSR
SPSR_svc
= banked register
Figure 5: The register organization of the CPU

to rotate data so that the addressed byte is can be easily accessed (the position depends on the current
endian conguration).
This is an example of how the words, halfwords and bytes are accessed. Consider the following memory
conguration:
Data Word
AA BB CC DD Endian Conguration
Bit:
31
0
Byte Address: 3
2
1
0
Little Endian
Byte Address: 0
1
2
3
Big Endian
This is the expected actions for a little endian system when the two least signicant bits of the address are [10], i.e. not word aligned.
For unaligned data stores:
a) Word store (STR): presented to the data bus: AA BB CC DD
A Word store should generate a word aligned address. The word presented to the data bus is not
modied even if the address is not word aligned. The memory controller should ignore the two least
signicant bits of the data bus.
b) Half word store (STRH): presented to the data bus: CC DD CC DD
Register data bits [15:0] are duplicated accross the data bus. The memory controller should ignore
the least signicant bit of the address.
3.1
10
Functionality
Mnemonic
ADC
ADD
AND
B
BIC
BL
CMN
CMP
EOR
LDM
LDR
MOV
MRS
MSR
MVN
ORR
RSB
RSC
SBC
STM
STR
SUB
SWP
TEQ
TST
PRF
Instruction
Add with carry
Add
AND
Branch
Bit Clear
Branch with Link
Compare Negative
Compare
Exlusive OR
Action
Rd:= Rn + Op2 + Carry
Rd:= Rn + Op2
Rd:= Rn AND Op2
R15:= address
Rd:= Rn AND NOT Op2
R14:= R15, R15:= address
CPSR flags:= Rn + Op2
CPSR flags:= Rn - Op2
Rd:= (Rn AND NOT Op2)
OR (op2 AND NOT Rn)
Stack manipulation (Pop)
Rd:= Rn + Op2 + Carry
Rd:= Op2
Rn:= PSR
Load multiple registers

Load register from memory
Move register or constant
Move PSR status/flags to
register
Move register to PSR
PSR:= Rm
status/flags
Move negative register
Rd:= 0xFFFFFFFF EOR Op2
OR
Rd:= Rn OR Op2
Reverse Subtract
Rd:= Op2 - Rn
Reverse Subtract with Carry Rd:= Op2 - Rn - 1 + Carry
Subtract with Carry
Rd:= Rn - Op2 - 1 + Carry
Store Multiple
Store Multiple
Store register to memory
<address>:= Rd
Subtract
Rd:= Rn - Op2
Swap register with memory Rd:= Rn[n], Rn[n]:=Rm
Test bitwise equality
CPSR flags:= Rn EOR Op2
Test bits
CPSR flags:= Rn AND Op2
Prefetch
Network Interface call
Figure 6: The Instruction set

c) Byte store (STRB): presented to the data bus: DD DD DD DD
Register data bits [7:0] are duplicated on all four byte lanes of the data bus. The memory controller
needs all bits (incl. the two least signicant bits) of the address.
For unaligned data read:
a) Word read (LDR): read into register: CC DD AA BB
The whole word is read, but in the ARM core the bytes are rotated such that the byte which is
addressed is stored on [7:0]. The memory controller should ignore the least signicant bit of the
address.
b) Half word read: (LDRH): read into register 00 00 AA BB
The selected hal word is placed on the bottom [15:0] bits in the register and the remaining bits are
lled with zeros by the core. The memory controller should ignore the least signicant bit of the
address.
c) byte read (LDRB): read into register: 00 00 00 BB
The selected byte is placed on bits [7:0] in the destination register and the remaining bits of the
3.1
11
Functionality
31
28 27
Condition Field
0
Op-code & Operands
Figure 7: A general ARM instruction

31
1
28 27
1
0
Prefetch Address
Figure 8: The prefetch instruction

register lled with zeros by the core. The memory controller needs all bits (incl. the two least
signicant bits) of the address.
In general:
The result of all half word loads or stores (issued as ARM or Thumb instructions) with a non-word aligned
address will be unpredictable.
The endian conguration of the system does not aect the reads or stores if only words are used. It
does only matter when half word and bytes are used. The following (two) tables shows the appropriate
actions when issuing a LDRH and LDRB:
A[1:0] Little Endian (BIGEND=0] Big Endian (BIGEND=1)
00
D[15:0]
D[31:16]
10
D[31:16]
D[15:0]
Endian eects for 16-bit fetches (LDRH)
A[1:0] Little Endian (BIGEND=0] Big Endian (BIGEND=1)
00
D[7:0]
D[31:24]
01
D[15:8]
D[23:16]
10
D[23:16]
D[15:8]
11
D[31:24]
D[7:0]
Endian eects for 8-bit data fetches (LDRB)
Example:
Data Word
Bit:
Byte Address:
Byte Address:
AA
31
3
0
BB
CC
2
1
1
2
DD
0
0
3
Endian Conguration

Little Endian
Big Endian
If last 2 address bits are [00] (aligned transfer):

word read:
Little Endian AA BB CC
Big Endian
AA BB CC
half word read: Little Endian: 00 00 CC
Big Endian:
00 00 AA
byte read:
Little Endian: 00 00 00
Big Endian
00 00 00
DD
DD
DD
BB
DD
AA
3.2
12
Implementation
If last 2 address bits are [10] (unaligned transfer):

word read:
Little Endian CC DD AA
Big Endian
CC DD AA
half word read: Little Endian: 00 00 AA
Big Endian:
00 00 CC
byte read:
Little Endian: 00 00 00
Big Endian
00 00 00
BB
BB
BB
DD
BB
DD

The processor interacts with NI and the local memory by putting an address on the local address bus
according to the local bus protocol. If the address is local, the NI enables Chip Select (CS) at the local
memory and the memory delivers the data. If the address is global, the NI handles the request by putting
the request on the interconnect and nally delivers it to the local processor.
3.2
Implementation

The electrical interface of the CPU consists of the following buses and signals (gure 9):
A[31:0]
D[31:0]
MAS[1:0]
CPU
nM
nRW
nWait
nRESET
ABORT
nIRQ
nMREQ
LOCK
Figure 9: The external interface of the CPU
A[31:0 ] This is the processor address bus.

D[31:0 ] This is the processor data bus.
MAS[1:0 ] The Memory Access Size bus lets the processor indicate which size is intended when accessing
memory. Following options are available:
MAS[1]
0
0
1
1
MAS[0]
0
1
0
1
Memory Access Size

Byte
Half-word
Word
Reserved
nMode This bus indicates in which mode the processor is operating in. A LOW signal indicates supervisor mode and HIGH indicates user mode.
3.2
13
Implementation
nRW This signal indicates whether the processor wants to write or read data. When HIGH, this signal
indicates a write cycle; when LOW, a read cycle.
nWait This signal is used when accessing slow pheripherals, to let the processor wait for a number of
clock cycles. This is acheived by driving nWait LOW. If nWait is not used it must be tied HIGH.
In SoCrates, the nWait signal is used to delay the processor execution to wait for an external
transaction to complete.
nReset This signal triggers a hardware reset of the CPU. A LOW level will cause the instruction being
executed to terminate abnormally. When nReset becomes HIGH for at least one clock cycle, the
processor will re-start from address 0. nReset must remain LOW (and nWait must remain HIGH)
during reset.
ABORT This is an input which allows the memory system to tell the processor that a requested access
is not allowed.
nIRQ Must be taken LOW to interrupt the processor when the appropriate enable is active.
nMREQ When LOW, this signal indicates that the processor requires a memory access.
LOCK When LOCK is HIGH, the processor is performing a \locked\ memory access, and the memory
controller must wait until LOCK goes LOW before allowing another device to access the memory.
There are eight exceptions that are supported by the ARM7TDMI processor. Only a few of these will
be implemented (data abort, reset, undened instruction, SWI, IRQ). The routines are pointed out by
interrupt vectors in the following way (ARM7TDMI):
Address
0x00000000
0x00000004
0x00000008
0x0000000C
0x00000010
0x00000014
0x00000018
0x0000001C
Exception
Reset
Undefined Instruction
Software Interrupt (SWI)
Abort (prefetch)
Abort (data)
Reserved
IRQ
Fast Interrupt Request (FIQ)
When there are multiple exceptions, a xed priority system determines the order in which they are
handled (highest priority rst):
1. Reset
2. Data abort
3. FIQ
4. IRQ
5. Prefetch abort
6. Undened instruction
3.2.2 State machines/pseudo code

The control unit consists of three phases: Instruction Fetch (IF), Instruction Decode (ID), and Execute
(EX). In the IF phase, the instruction is fetched from the memory. It is then decoded in the ID phase.
In the ID phase, every instruction is checked for its condition code. Finally, the instruction is executed
in the EX phase. Additional states will be used for exception handling.
The Instruction Decoder will be implemented as a tree consisting of bitwise comparisons to determine
which instruction is currently fetched (gure 10).
3.2
14
Implementation
31
32-bit ARM instruction
27
Condition Field
Bit 27
Bit 26
Bit 26
Bit 25-24
11
Block Data
Transfer
Co-Proc.
1
BL
Bit 25
other
OK
other
Bit 24
SWI
Address
Alignment
Bit 25
Bit 24-20
other
10X10
Single Data undefined

Transfer instruction
Bit 7, 4
11
other
Bit 6,5
MSR
Data
Processing
Bit 24-20
10010
00
other
other
Bit 24
Bit 23
MULL (A)
SWP
0
Bit 22
Half-word
Immediate
Transfer
10X00
Data
Processing
Half-word
Register
Transfer
Bit 18
MRS
1
BX
MSR
MUL (A)
Figure 10: The Instruction Decoder

No programmable internal registers needed here so far.

No explicit demands has been stated so far.

The processor can be ordered from a company, resulting in high fees and possibly long delivery time. To
avoid this problem, a processor will be implemented in VHDL instead. This solution will suce for the
rst phase of the project and can later be improved or replaced by a commercial processor. The absence
of an instruction pipeline keeps the design simple and makes the processor more predictable. Also, due to
the fact that a multiplier uses large amount of gates, it will not be implemented. The loss of performance
when omitting the pipeline can (partly) be compensated by the low access time to the local memory.
The choice of compatibility with an ARM processor is due to the fact that many existing solutions in the
industry uses ARM as a processor core and they might be reluctant to re-implement the software for a
new processor. The processor has a lot of modes and states that is not needed in the rst version of the
processor and will therfore be a reduced version of a complete ARM processor.
3.3
Testing
15
3.2.6 Future work

The prefetch instruction can be extended to be able to prefetch larger amount of data and the processor
can be made fully compatible with an ARM processor. This means that the processor supports ARM
and Thumb state as well as the seven available modes. Another alternative is to integrate a real ARM
processor in the next stage, which means that changes might have to be made to the local bus interface.
3.3
Testing
3.3.1 How to test

The processors internal components should be individually tested with the help of testbenches.
3.3.2 What to test

The following processor components should be tested :
The processors internal components (ALU, control, address incrementer, register, exception)
The supported instructions
Its external interface

If the local memory can't be provided, a congurable memory model can be used instead. The NI can
also be simulated with the help of a testbench/model.

No methodology yet.
NETWORK INTERFACE
16
Network Interface
4.1
Functionality
The Network Interface (NI) handles communication both locally internal to the node as well as transfers
between dierent nodes.

The component hides the architecture of the system to the processor, since the CPU does not need to
know how many nodes are attached and how they are connected. Further, all CPUs see a single address
space. Whether address references are to be handled locally or at a remote node are hidden and carried
out by NI.

NI must handle the following events:
Local Load/Store
A local Load/Store is a access to a memory that are physically located at the same node as the
requesting CPU.
Load/Store to memory located at an another node
The reference could not be handled locally, thus resulting in a transaction of data on the global
interconnect.
Prefetch of data
The NI must be able to handle prefetch requests from both the local CPU and to serve prefetch
requests from other CPUs. The NI will assume that all prefetches are to non-local memory locations.
Further, at boot of the system, NI must halt the CPU to let the I/O node write to the memories before
starting execution.

NI interacts with the global interconnect, the CPU, local memory via the memory wrapper, and the
arbiter.
4.2
Implementation
As many as possible of the actions should be done in parallel. The local CPU must be able to access
the memory simultaneously with accesses that are coming through the interconnect. All of the doubledirected signals must be Three-stated whenever it is possible to avoid signals to be driven from two or
more sources, possibly resulting in unpredictable behavior.
4.2.1 Events and response actions

When NI detects an access from the local CPU it must decode the address to decide whether it is an
access to local memory, to global memory or a initiation of data-prefetch. If a memory access can not be
done in one clock cycle, as in case of access to memory located at another node, NI must raise the nWait
strobe to CPU. When the memory system replies, NI informs about the successful access by lowering the
nWait strobe.
Events that trig NI are incoming from two sources: the interconnect or the CPU. All possible events
(see gure 4.2.1), and the response actions taken by NI to these events are described below:
4.2
Implementation
17
Local CPU read from local memory
When NI detects a read from local memory, initiated by the local CPU, it controls the chip select
and write-enable strobes to the memory wrapper. The memory responses to the action the next
cycle and delivers the requested data to CPU and at the same time NI resets the strobes to the
memory wrapper.
Local CPU write to local memory
If the access is a write to local memory NI controls the chip select and write-enable to the memory
wrapper. No response signals are sent to either the NI or CPU. NI resets strobes to the memory
wrapper in the same manner as described above.
Local CPU read from remote memory
In a case of a read from remote memory, NI checks if the address matches the contents of the
prefetch address register. In case of a match, and if the valid bit is set, NI can deliver the data to
the local CPU directly from its prefetch register. If the address matches and the valid bit is not
set NI uses the interconnect as described in section 6.2.3. After the interconnect has delivered the
data to the requesting NI, it is forwarded to the local CPU. The cycle after data is forwarded, the
signals to the CPU are resetted.
Local CPU write to remote memory
In this case NI performs a write to remote memory according to the protocol described in section
6.2.4. After completion NI resets the signals to the CPU.
Local CPU data-prefetch
A prefetch of data is initiated by writing the address of data into the prefetch register located in
NI. The following actions to fetch the data is identical to the case of a read from a remote node
(see section 6.2.3) except that the delivered data that comes via the interconnect is written to the
internal prefetch data register instead of forwarding it directly to the local CPU. If NI is busy with
an activity that requires communication via the interconnect and a new remote read, remote write
or prefetch is issued during that time, the new action will be delayed until the current transaction
is nished.
Remote Node read
A remote read comes via the interconnect and is therefore initiated by an another node. NI sets the
address and strobes to the memory wrapper and reads the contents of the memory following cycle.
The response on the interconnect to this request is described in section 6.2.3. Reset of strobes to
the memory wrapper are made one cycle after the memory access. If NI is busy with an activity
that requires communication via the interconnect and a new remote read, remote write or prefetch
is issued during that time, the new action will be delayed until the current transaction is nished.
Remote Node write
If a remote write is requested via the interconnect, NI sets the address, data and strobes to the
memory wrapper. The following cycle data is written to the memory and the response on the
interconnect is described in section 6.2.4. Reset of strobes to the memory wrapper are made one
cycle after the memory access. If NI is busy with an activity that requires communication via the
interconnect and a new remote read, remote write or prefetch is issued during that time, the new
action will be delayed until the current transaction is nished.
4.2.2 Address decoding

When a memory requests is coming from the local CPU, the NI decodes the address and checks whether
it is a local memory access or a global access. This is done by comparing if the NI:s dedicated bit is '1' in
the the 8 upper one-hot coded bits in the address. Comparison with the address located in the prefetch
address register is done only when the address is not local, in other words when there is no match with
4.2
18
Implementation
Local CPU Access
Read
Write
Local Remote
Local Remote
Read
Write
Read
Write
EVENT SOURCE
Prefetch
Remote
Read
EVENT
ACTION TO PERFORM
Remote Node Access

Read
Write
Local
Read
Local
Write
Figure 11: Events that the NI must be able to handle

this nodes identity. Alignment checks are performed each time CPU accesses memory. Bus-error is given
if it is a halfword access and the address is not even or if it is a word access and the address is not a
multiple of four.
To respond to a request coming from the interconnect, the NI looks at a dedicated bit of the on-hot
coded vector in the addresses 8 upper bits. If the dedicated bit is raised to high level NI must response
to the request.
The double-ported memory allows simultaneous reads and writes and dierent combinations of them.
Simultaneous writes (from both remote and local CPU) to the same address in memory are allowed buts
the value will be unpredictable.
4.2.3 Locked accesses

ARM instruction set has a instruction to make sequential accesses to memory without being interleaved
by others. Therefore, if the lock signal goes high, NI must delay other accesses in two ways:
If the access is to local memory the remote node accesses by other nodes must be prevented by
delaying them.
If the access is a remote read or write the global interconnect must be locked by not leaving
the bus mastership until the lock signal is lowered by the CPU. There is no lock signal on the
global interconnect but the protocol described in section 6.2.2 allows bus lock by not releasing the
bus request signal.
4.2.4 Internal registers

After a complete prefetch, the prefetched data is stored in a internal data register in NI. There is also an
one bit validity ag that is set when the data to an prefetched address becomes valid. The size of a data
prefetch is always 32-bits.
4.2.5 Memory chip selects

To make it possible to write a single byte to the memory without changing the contents of other bytes of
the word that the written byte belongs to, the memory must have four separate chip selects. Each one of
them corresponding to a byte of a long-word as shown below. The nodes uses little endian conguration
which means that the memory should be connected as following (where byte 0 is the least signicant byte
and byte 3 the most signicant byte):
Byte 3 of the memory connected to Data[31:24] and chip select[3]
4.2
19
Implementation

External Interface: NI-Interconnect
Signal Name
data
address
bus request
bus grant
ad strobe
ack
rw
mode
berr
Width
31 downto 0
31 downto 0
NOF NODES-1 downto 0
1
1
1
1
1
CPU Interface: NI-CPU

Signal Name
data cpu
address cpu
rw
berr
nwait
lock
trans
mode
Width
31 downto 0
31 downto 0
1
1
1
1
1
1
Type
InOut
In
In
Out
Out
Out
In
In
Type
InOut
Out
Out
In
InOut
InOut
InOut
InOut
InOut
data NI
address NI
rw CPU
rw NI
cs cpu
cs NI
clk
reset
Bus-error -transaction not performed
Indicates read or write mode. 0=read, 1=write

Bus-error -transaction not performed
Transaction time 1 cycle
Bus lock, used by swap instruction
Transaction start
Mode bit. 1=user, 0=super
Width
GLOBAL ADDRESS WIDTH-1 downto 0
3 dpwnto 0
3 downto 0
1
1
Miscellaneous signals: NI
Signal Name
Request for bus mastership

Indicates which node has the bus mastership.
Validity of address and data
Acknowledge for transactions
Description
Memory Interface: NI-Memory Wrapper

Signal Name
Description
Width
1
1
Type
In
In
Type
InOut
Out
In
In
In
In
Description
chip select for local accesses
chip select for accesses from other nodes
Description

Accessible registers in NI and their memory locations are listed in the table below:
Register Name
Prefetch address register
nWaitAlwaysActive
Width
32 bits
1 bit
Memory location
2000h
2004h
Description
Address for the data
CPU halt register
Prefetch of data to the node is initiated by the CPU by writing an address to the prefetch register.
After fetching the data via the interconnect, it is stored in the internal prefetch data register. If a new
address is written to the prefetch address register before the former prefetch has been completed, NI will
still fulll the ongoing prefetch. The new prefetch will start after completion of the ongoing and the
internal prefetch data register will be overwritten with the new fetched data.
At reset of the system nWaitAlwaysActive is set to 1 which causes NI to raise the nWait signal to
the CPU and thus halting it. The nWaitAlwaysActive register is reseted to zero when the boot-sequence
4.2
Implementation
20
has nished by making a broadcast from the I/O-node to all NI:s in CPU nodes. All registers are both
readable and writable from both interconnect and local CPUs.

CPU must be able to access the local memory in just one clock cycle to avoid signicant loss of performance.

The lack of time left to implement results in that no DMA functionality is specied for NI. Alignment
checks are performed in NI (at each node) since comparison of addresses anyway takes place to decide
whether access is local or global. This makes the system to response faster if an alignment check fails,
but it makes the design somewhat larger since the logic is placed at each node compared to the case with
a central alignment checker placed on the shared bus. Also, if changing the interconnect from a shared
bus to point-to-point links the central alignment checker approach is no longer applicable. Due to the
complexity in the ARM-compatible transfer protocols between CPU and memory, simpler protocols are
used with less number of signals.
4.2.10 Future work

To o-load the CPU when making transactions of several data between dierent nodes the NI should
include some kind of DMA functionality. The mode-bit coming from CPU can be used to limit accesses to
registers and memory locations when executing instructions in user mode. When running in supervisormode all registers are accessible. This functionality is not implemented due to lack of time in this version.
Broadcast possibilities for CPUs are not implemented in this version.
IO NODE
5
5.1
21
IO Node
Functionality

The IO node takes care of all external communication including download of the application.

This nodes takes care of code-, and datadownload and contains the UART functionality. All nodes sleep
after reset leaving the IO-node as the master of the global interconnect.
One dedicated UART per processing node is instantiated, copied from the SYSA1 course. If interrupt
driven communication is choosen as the mode of operation, the interrupt lines are connected to the RTU.
The ISR is implemented as an interrupt-thread started by the RTU.

The UARTS are visible to the nodes as memorymapped global registers following the SYSA1 incarnation.
UART1 starts at the node's baseaddress, UART2 has address base + 100h etc.

No buering of outgoing communication is implemented, therefor the node is responsible not to choke
the IO node. In this version of Socrates, a complete contextswitch is performed on each incoming byte
which is rather time-consuming. A special fast interruptmode executing in the currently running thread's
context should be implemented for performance reasons.
5.1.5 Future work

An autodetection of the PROM would be nice. Buering would be nice. A signal DOWNLOAD tells
IO wether to download code or not after a system reset. The DOWNLOAD signal can be implemented
as an external physical pin on the FPGA or a externally accessible control register. An external signal
PROM OTHER tells the IO node whether to fetch the data from the EPROM or the serial/parallell
link. The EPROM contains a entry on which the adress to the nWAIT always active registers and the
corresponding data that should be broadcasted at system startup.
22
INTERCONNECT
Interconnect
6.1
Functionality
The interconnect moves data between two arbitrary nodes via their NIs.

To enable nodes to exchange information they must be connected together via an interconnection network.
For example, communication is needed when two tasks runs on dierent nodes where one nodes produces
data and the other node consumes the data.

It must be possible for any node to read and write data from/to an arbitrary node. All transactions
are reliable. The protocol is prepared to let NIs do a read and a write transaction in sequence without
intervention from other nodes. While holding the bus-request line high no new arbitration will occur,
giving the opportunity for a NI to lock the bus and do multiple transactions without intervention from
other NIs.

The interconnection network interacts with a number of processing nodes, I/O nodes and nodes consisting
of memory via their NI:s. It also interacts with a central arbitration unit.
6.2
Implementation
The interconnection network consists of a shared bus with transaction protocols for reads and writes.
When using a shared bus solution with multiple masters there is need for arbitration to resolve possible contention between several potential masters. Arbitration is done centrally by an arbitration unit
described in section 7.

Signals like the data and address lines are made generic to make changes easy. Every master node
attached to the bus has a bus-request and a bus-grant line between the node and the arbiter. The table
below shows the electrical interface between the interconnection network and the NI.
Signal Name
data
address
bus request
bus grant
rw
ad strobe in
ad strobe out
ack in
ack out
mas
berr
mode
lock
clk
reset
Width
GLOBAL ADDRESS WIDTH-1 downto 0
1
1
1
1
1
1 downto 0
1
1
1
1
1
Type
InOut
InOut
In
Out
InOut
In
Out
In
Out
InOut
InOut
InOut
InOut
In
InOut
Description
Request for bus mastership
Indicates which node has the bus mastership.
Size of transfer
Buserror -transaction not performed
0=user mode, 1=supervisor mode
1=locked transaction, 0=unlocked transaction
global clock
6.2
23
Implementation
6.2.2 Bus protocols

Both read and write protocols are synchronous. All the signals of InOut type must be set to three-state
according to read- and write-cycles described below. To enable a later extension of the system with a
time-out module there is a buserror line that will be set high if there is no response for a transaction
within a specied time. During the read and write cycles mas lines denes the size of the transaction
(byte, word or halfword) according to section 3.2.
6.2.3 Read cycle

To read data from an another node the NI at node x competes for bus mastership by setting the
bus request signal. After retaining the bus mastership by detecting the bus grant signal from the central
arbitration unit, the reading node sets the address, data and rw signals. When the serving node has
delivered the requested data it pulls up the acknowledge signal for one clock cycle. When the source
node reads the data, it lowers the bus request and the address signals and three-states the rw at the same
point of time(5).
clk
bus_request(x)
bus_grant(x)
ad_strobe
address
valid
data
mas
valid
valid
rw
ack
Figure 12: Read cycle. Signals with levels between high and low are considered as three-stated. The
number of cycles between request and grant, n, depends on the arbiter and its arbitration method.
Implementation of NI and the local node sets the number of cycles, m between a request of data and
response of data.
6.2.4 Write cycle

To write data from one node to an another, the source node must compete for the bus by setting the
bus request signal. When the arbiter has set the bus grant, the initiator sets at the next cycle address,
data , ad strobe and rw signals. Depending on the implementation of NI, after a number of cycles, p
6.2
24
Implementation
when the destination nodes NI has fetched the data from the bus it sets the ack signal(4). Detecting the
ack the initiator lowers ad strobe and three-states address, data and rw.
clk
bus_request(x)
bus_grant(x)
ad_strobe
address
valid
data
valid
mas
valid
rw
ack
Figure 13: Write cycle. Signals with levels between high and low are considered as three-stated.

No known throughput or latency demands.

The shared bus approach makes implementation easy and is suitable when the number of nodes is small
as in this version of SoCrates. Broadcasting is also easy to implement since all the processors can monitor
the trac on the bus simultaneously. Three state technique is well known among hardware engineers and
it makes design is easy to implement and also to expand.
6.2.7 Future work

To increase bus utilization future version could be extended to support pipelining, split-transactions and
burst transfers. When extending the number of nodes the shared bus solution may become a serious
bottleneck in the system. Therefore the topology must be changed to some kind of point-to-point interconnect where several transfers can occur simultaneously. Many of the other on-chip buses like AMBA
and WISHBONE uses a shared bus with a multiplexed interconnection making it interesting to study
benets with that approach.
Another possibility is to have two general buses instead of one data and one address bus. Since many
of the transfers on the bus are read transactions, using two general buses two reads could be performed
6.3
Testing
25
in parallel by multiplexing address and data onto the two general buses. This would need a new kind of
arbitration unit and protocol.
6.3
Testing

The initial testing of the component is done together with the central arbiter and a testbench that
emulates two NIs. A more complex test is done with implemented NI:s that are stimulated with a
testbench to perform transactions.
6.3.2 How to test

The tests includes the following actions:
Read transactions from one NI to another in both directions
Write transactions from one NI to another in both directions
Try to initiate several transactions at the same point of time
6.3.3 What to test

The aim of the test is to verify the transaction protocols.
26
ARBITRATION
Arbitration
7.1
Functionality
The arbiter supervises the potential bus masters, leading and dividing the work on the bus.

In a system with shared bus, several components can act as bus master. All entities that can act as a
bus master at any instant time must be guaranteed exclusive right to the bus without intervention from
other components. If two or more components are to be bus master simultaneously the bus are said to be
double driven, yielding con icting signals where the two entities drives dierent logic values. Con icting
drivers results in corrupted signals and eventually destroyed electrical lines.

Components must be mutually excluded. The arbiter must guarantee that there at any instant time only
exist one bus master2 . In an environment with several possible bus masters the arbiter must exhibit
fairness and avoid starvation and deadlock situations. All entities that will use the bus must raise their
bus request signal and wait upon the bus grant signal from the arbiter indicating exclusive ownership of
the shared bus.

NI
The NI of each node interacts with the centralized arbiter. Communication goes both ways according
to the signals bus request & bus grant, in the bus protocol see section 6.2.2.
7.2
Implementation

Signal Name
bus request
bus grant
clk
reset
Width
1
1
Type
In
Out
In
In
Description
7.2.2 Request queue

The arbiter consists of an request queue of size NOF NODES, where the current bus master and pending
requests are held. When the arbiter receives a bus request on one of the request lines, the ID of the
requesting node is placed last in the queue. If two or more request arrives simultaneously, the requests
are placed according to the originating node's xed priority. The lower the node ID the higher priority
in arbitration. A bus grant signal is raised for the node that currently are placed rst in the queue, and
shall remain high until that node lowers the bus request signal. To lower the bus request signal conrms
a completed bus transaction, see section 6.2.2 for exact timing. When a transition conrmation occurs,
the queue must be shifted so entry number two will become the rst entry in the queue.

Obsolete.
2 Has
permission to electrically drive the bus.
7.3
27
Testing
N1
N2
N3
N4
...
Nn
Request
...
priority
...
Response
...
N1
N2
N3
N4
...
Nn
Figure 14:

None.

Complexity vs fairness. The choice of a static priority arbitration queue, in favor of an true round-robin
solution with dynamic priority yielding absolute fairness based on LRU, is the magnitude of less complexity. Both algorithmical and in logic (i.e. hardware space complexity). The one minor disadvantage
of the static priority scheduler exhibits, is the unfairness to a low priority node that always is scheduled
last when two or more contenders issues their request simultaneously.
7.2.6 Future work

No considerations in how well this solution scales with increased number of nodes have been addressed.
For many nodes a more fair scheduling might be necessary to implement. In the case the bus-based
interconnect are replaced with another type, arbitration may become obsolete.
7.3
Testing
7.3.1 How to test

The component can be functionally tested by the use of a testbench, by looking at the handling of
requests, responses, and the expected outcome of the dierent request schemes.
7.3
Testing
28
7.3.2 What to test

Request and response sequence. Ordering of requests when multiple requests are issued in parallel.
Queuing forwarding of pending requests, when the bus-holder lower its bus request signal.

In order to conclude correct behavior of the component, needs at least several simulated Network Interfaces representing the dierent nodes, if no real NI are available.

Test sequences to issue in order to conrm correct behavior and the corresponding results.
Stand alone requests
Every node (NI), issues a single request without intervention from other nodes, the outcome for
correctness is a bus grant signal to that node as long as the NI holds bus request high.
Dierent number of simultaneous requests
Two or more network interfaces issues their bus request signals simultaneous in any arbitrary order.
For correct result, the nodes are to receive their bus grant in order of descending node ID's. Tests
when all nodes requests the bus simultaneous must be performed.
BOOT
8
8.1
29
Boot
Functionality

The boot procedure describes how the nodes gets their program and data.

The component loads the processing nodes memories and after transfer of data is nished the system is
trigged to start with a register write to NI.

The I/O node must be able to map the object code onto the platforms memory. NI's defaults their
nWAIT ALWAYS ACTIVE to an active state which stops the CPU from loading the rst instruction and
thus holds the nodes inactive. IO is responsible to activate the system after a complete download. This is
achieved by making the last data to be transfered to be a broadcast to all NI:s nWAIT ALWAYS ACTIVE
registers. Onehot adress coding emulates the broadcast function.

For simplicity DMA is used to congure the local RAM. The scenario where the node is participating
in the download is abandoned becuse of problems with location of the vector table in both RAM and
ROM. ARM did not provide an easy way to move the vectortable, which is a prerequisite for making a
system reset with optional download. The result would have been booting from ROM, "download always"
and without any possibility to modify the vectortable in run-time. (debug demands) Booting from an
uninitialized RAM is not recommended. ROM is not needed in this version because everything is loaded
into RAM.
8.1.5 Future work

By adding a switch and EPROM the user could select if download will be done from a host computer via
the paralell port or from the added EPROM. The format for the EPROM is 32 bit address + 32 bit data. A
certain adress of the EPROM contains the broadcast adress to deactivate the nWAIT ALWAYS ACTIVE
registers. If an EPROM download is detected from the PROM OTHERS signal, the IO node loads the
contents of the EPROM.
30
MEMORY WRAPPER
Memory Wrapper
9.1
Functionality

To make the design easily portable, technology independent parameterizable memorywrappers are introduced.

The wrapper instantiates technology specic memories and present a standardized behviour and strobe
view to the hierarchy above. ASIC and FPGA memories are seldom controlled in the same manner. A
VHDL simulation model (preferrably FPGA-synthesizable) is provided along with a portmappable layout
in the ASIC case. They should be easily interchangable; multiple architectures for one entity.

Specic to the chosen technology, in Socrates platform 1.0 Virtex RAM/ROM are selected.
9.2
Implementation
Smaller memory primitives are instantiated and merged together with the generate/port map command.
To suit ARM, a special wrapper is implemented which has one CS and 4 WE. Four byte-wide memories
are instantiated. The databusses are threestated when the memory is not adressed and no write enable
is active.
9.2.1 Entity/Interface
generics: address width, data width
External Interface: NI-Interconnect
Signal Name
clk
reset n
we 31 24 NI
we 23 16 NI
we 15 8 NI
we 7 0 NI
we 31 24 CPU
we 23 16 CPU
we 15 8 CPU
we 7 0 CPU
cs NI
cs CPU
address NI
address CPU
data NI
data CPU
Width
31 downto 0
31 downto 0
1
1
1
1
1
1
1
1
1
1
depth-1 downto0
depth-1 downto0
31 downto0
31 downto0
Type
In
In
IN
IN
IN
IN
IN
IN
IN
IN
In
In
InOut
InOut
InOut
InOut
Description
Strobes are active high. Data is bidirectional, direction is controlled by WE. The number of WE is
dened by (address width / 8) because of ARM's demand for separately controllable bytelanes.

The wrapper should not introduce any latencies to the strobes.
9.3
Testing
31

Virtex memories due to the FPGA platform choice. DI and DO is merged into a 3-state bus for simplied
connection to shared busses. The memory is clocked on an inverted clock to get a pseudo-asynchronous
behaviour.
9.3
Testing
9.3.1 How to test

Do-le
9.3.2 What to test

The generics should be tested on small data, small address, big data and big address. Memory writes
read through opposite port.

Testbench
SoCrates
-Implementation details
Authors
Mikael Collin, Mladen Nikitovic and Raimo Haukilahti

Supervisors

as,
Oct 9, 2000
Abstract
This document is the result of a Master Thesis in Computer Engineering, describing the implementation of The First prototype of Socrates, a congurable platform for System-on-chip Multiprocessor
system. It ts on a single million gate FPGA including 2 processing nodes, memory, embedded
software, an IO unit, and a Real Time Hardware Unit. Here, an detailed overview is given of the
implemented CPU-core, a non-pipelined clone of ARM7TDMI processor and the Network Interface.
Description is given how to link an application, congure the system, and how to simulate it, before
the end section where suggestions of future work is given. The document also include preliminary
sythesis results and future work.
CONTENTS
Contents
1 CPU
1.1 Instruction set . . . . . . . . . . . . . . . . . . .

1.2 Register le . . . . . . . . . . . . . . . . . . . . .
1.2.1 Registers . . . . . . . . . . . . . . . . . .
1.2.2 Implementation . . . . . . . . . . . . . . .
1.2.3 Interface . . . . . . . . . . . . . . . . . . .
1.3 Pipeline compensators . . . . . . . . . . . . . . .
1.4 Barrel shifter . . . . . . . . . . . . . . . . . . . .
1.4.1 Interface . . . . . . . . . . . . . . . . . . .
1.4.2 Functions . . . . . . . . . . . . . . . . . .
1.4.3 Functionality . . . . . . . . . . . . . . . .
1.4.4 Stages . . . . . . . . . . . . . . . . . . . .
1.5 ALU . . . . . . . . . . . . . . . . . . . . . . . . .
1.5.1 Electrical Interface . . . . . . . . . . . . .
1.5.2 Operations . . . . . . . . . . . . . . . . .
1.5.3 Flags . . . . . . . . . . . . . . . . . . . . .
1.5.4 Excplicit demands . . . . . . . . . . . . .
1.5.5 Testing . . . . . . . . . . . . . . . . . . .
1.6 PC and result incrementer . . . . . . . . . . . . .
1.7 Exception handling . . . . . . . . . . . . . . . . .
1.7.1 Exception decoder . . . . . . . . . . . . .
1.7.2 External interrupt . . . . . . . . . . . . .
1.7.3 Internal exception . . . . . . . . . . . . .
1.8 Control unit . . . . . . . . . . . . . . . . . . . . .
1.8.1 States . . . . . . . . . . . . . . . . . . . .
1.8.2 Instruction information . . . . . . . . . .
1.9 Execution Time Analysis . . . . . . . . . . . . .
1.9.1 Cycle Times & State Machines . . . . . .
1.9.2 Condition Code False . . . . . . . . . . .
1.9.3 Branch & Branch Linked . . . . . . . . .
1.9.4 Data Processing . . . . . . . . . . . . . .
1.9.5 Block Data Transfer . . . . . . . . . . . .
1.9.6 Single Data Transfer . . . . . . . . . . . .
1.9.7 Single Data Swap . . . . . . . . . . . . . .
1.9.8 Processor Status Register (PSR) Transfer
1.9.9 Software Interrupt (SWI) . . . . . . . . .
1.9.10 Prefetch . . . . . . . . . . . . . . . . . . .
2 Network Interface
2.1
2.2
2.3
2.4
2.5
Address Decoder
Control Unit . .
Prefetch Buer .
Sender . . . . . .
Receiver . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3.1 Electrical Interface . . . .

3.2 Operations . . . . . . . .
3.2.1 Excplicit demands
3.2.2 Testing . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 Arbiter
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
4
5
5
7
7
7
7
8
9
9
10
11
11
12
12
12
12
12
13
13
13
13
14
16
18
18
19
19
19
20
21
22
23
23
24
25
25
25
26
26
28
30
30
30
30
30
CONTENTS
II
4 Compiling & Linking the System Software
4.1 Description of the Phases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 Conguring the Socrates Platform

5.1
5.2
5.3
5.4
System setup . . . . . . .
Conguring HW-platform
Simulation . . . . . . . . .
Synthesis . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6.1 Developed hardware . . . .

6.2 System software . . . . . . .
6.3 Linker scripts . . . . . . . .
6.3.1 Task-switch routine
6.3.2 I/O primitives . . .
6.4 Demo application . . . . . .
6.5 RTL-simulation . . . . . . .
6.6 Synthesis Results . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7.1 CPU improvements . . . . . . .

7.1.1 Increase compatibility .
7.1.2 Register le . . . . . . .
7.1.3 Control unit . . . . . . .
7.2 Enhanced prefetch functionality
7.3 Allow process migration . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 Current Results
7 Future work
31
31
35
35
35
35
35
37
37
37
38
38
38
38
38
39
40
40
40
40
40
41
41
1 CPU
1 CPU
This chapter is a description of the processor that is implemented in the SoCrates project. The processor
is integrated together with a Network Interface and local memory to form a CPU node. One restriction
put on the processor was that it could execute ARM code, because lots of applications in the embedded
industry has been implemented on an ARM platform, which would make SoCrates more attractive to
potential users. The processor is implemented with similar features as the original ARM together with
some new components. The gure 1 shows a general view of the processor architecture. Here, the
components that the processor consist of will be described in the following sections.
Address
Data
Control
Unit
Register File
A
B
IMM_value
Exception
handler
PC
Pipeline
Compensator
Pipeline
Compensator
Operator
Operator
Barrel
Shifter
C
Amount
C
Operator
C
N
Z
ALU
+4
V
RES
Operator
PC
Operator
+4
PC
Figure 1: The processor architecture.

The chapter begins with a description of which instructions of the original ARM instruction set are
supported by the ARM-clone. The sections after the instruction set section describes each component
that the processor consists of in terms of functionality, electrical interface, and state machines. The
chapter ends with an execution time analysis which gives a worst and best case execution time for each
instruction supported by the ARM clone.
1.1 Instruction set

In this section the subset of the ARM7TDMI instruction set that will be used and implemented in the
ARM-clone is introduced and presented. The reasons for choosing this subset of instruction can be found
in the Socrates specications.
A more detailed and in depth description of the instruction semantics and syntax for using the dierent
instructions except for the PRE prefetch, can be found in the ARM manual.
B(L) The branch and branch linked instructions performs a pc relative jump. The destination address is
generated by adding a oset to the actual program counter. Whether the instruction is an ordinary
branch or a branch liked is determined by the state of bit 24, the L-bit in the op-code gure 2, one
1.1 Instruction set
means linked. A branched linked will save the return address1 in the link register, to be used when
returning from subroutines.
COND
1 L
(a).
Offset
Figure 2:
Data processing The 16 data processing instructions are the instructions that internally process data
among the processor registers. The dierent instructions are dened by the alu operation the
instructions op-code in gure 3 maps to, see section ALU. The second operand is either an immediate
value or a register depending on the I- ag. When operand 2 is a immediate value the operand 2 eld
looks like gure 3a. A 8 bit immediate vale is rotated ror, the amount given by the 5 bits in the eld
Rotate. Two dierent types of register operands exist, register immediate gure 1.1d, and register
register gure 3e. When operand 2 is register immediate the shift amount is a 5 bit immediate value
whereas the shift amount for register register is the lowest ve bits of the register Rs. For registers
as operand 2 any of the ve shifting functions are possible, given by the Fx eld. special awareness
must be taken when the pc register 15 is used, both on the implication of pipeline compensation
and mode changing. The S-bit is also of signicans for the control ow of the program, it decides
whether the status ags should be updated by an instruction or not.
COND
OpCode
Rn
Rd
(a).
Operand 2
Rotate
(b).
Imm
Shift
(c).
Rm
Amount
Rs
Fx
(d).
Fx
(e).
Figure 3:
BDT Block Data Transfer enables transfer of data to or from any number2 of registers, to and from the
main memory.
This instruction is convenient to use for stack operation though it supports any stacking mode, post,
pre, up, and down indexing. The 16-bit wide register list contains all the registers the instruction
shall operate on. Each register that ought to be processed is marked with a one in it's corresponding
bit position, e.g register R5 in bit number 5. At least two registers must be present in the list. The
behavior of the instruction is derived by the ag-bits in the op-code gure 4.

P-bit, pre or post addition of oset to the base address.

U-bit, subtraction or addition of oset to the base.
S-bit, mode transfer or user mode transfer.
W-bit, write-back modify or not.
L-bit, load or store registers.
1 The address
2 Any general
to the instruction after the branch instruction.

purpose register R0-R15.
1.1 Instruction set
Special awareness must be taken whenever the program counter is in the register list. The use of
the S- ag is also important when using the pc. When using user mode transfer3 the base address is
obtained from a banked supervisor register and all the registers in the list are non banked registers.
COND
P U
S W L
Rn
(a).
Register list
Figure 4:
SDT Single Data Transfer, load or store byte or word quantities. The contents of the Rd destination
register is saved on location Rn base register+ a oset value, or Rd is loaded with the contents of
address Rn+oset. The actual instruction and modes are determined by the ag-bits in the op-code
gure 5a.

I-bit, oset is an immediate value or a register..

P-bit, pre or post addition of oset to the base address.
U-bit, subtraction or addition of oset to the base.
S-bit, mode transfer or user mode transfer.
W-bit, write back modify or not.
L-bit, load or store registers.
When the I bit is clear the oset is a 12-bit immediate value shown in gure 5b. Otherwise the
oset is a register like gure 5c, that ought to be shifted in the same way as the data processing
register immediate is done before addition to the base is performed.
COND
P U B W L
Rn
Offset
(a).
Immediate offset
(b).
Rd
Shift
Rm
(c).
Figure 5:
SWP This instruction is primary test & set instruction, designed for implementating software semaphores.
The instruction reads a location given by the base register Rn and stores the contents of register
Rm on address (Rd). Then the previously read value is stored in Rd. The instruction requires
that the dierent accesses to the memory are atomic, therefore the instruction is able to lock the
interconnect during execution.
COND
0 B
Rn
Rd
Rm
(a).
Figure 6:
MRS Transfer status register to a general purpose register. The contents of cpsr or spsr is transfered to
any of R0-R14 even to R15 the PC4 .
3 Only in supervisor mode.

4 What is the semantic of that
action, when is that move adequate?
1.2 Register le
COND
4
0
0 Ps
Rd
(a).
Figure 7:
MSR Transfers the contents of any register including the pc to cpsr or spsr. Two modes do exist, the rst
transfers the whole register contents to the status register see gure 8. The other one only updated
the condition ags-bits of the status register, gure 8a. Two dierent kinds of ag transfer are
possible. Register or immediate, where the register contents is used directly whereas the immediate
value is rotated before transfer.
COND
0 Pd
COND
0 Pd
Rm
(b).
Source operand
Rotate
0
(c).
Imm
0
(a).
Rm
(d).
Figure 8:
SWI Software interrupt, this instruction is used to enter supervisor mode in a controlled manner. A
software trap will occur, put the processor in supervisory mode. The return address is saved in the
link register (R14 supervisor) and the contents of the cpsr is transfered to the saved status register.
The ignored eld of the instructions operation code, gure 9, is ignored by the processor and can
be used to pass information to the interrupt service routine ISR handling the exception.
COND
1 1 1 1
IGNORED
(a).
Figure 9:
PRE The prefetch instruction is an extension to the original ARM7TDMI instruction set, designed and
implemented for the Socrates system on chip architecture. The instruction will issue a non-blocking
read of one 5 data word from any address within the system. Unlike the rest of the instructions the
prefetch instruction can not be executed conditionally since the four ones in the normal condition
eld is the actual op-code for the instruction. A prefetch is issued of the data located on the node
given by the eight bit ID eld and the 20 bit wide local address eld see gure 10. A prefetch can
be viewed as a non-blocking load of eventual remote located data that normally would result in a
processor stall. When the data later on is needed a read to the same address will deliver requested
data.
1.2 Register le
The original ARM7TDMI make use of a ve ported register le, this due to the use of a pipeline and
the fact that some modes of the data processing instruction needs to read three registers at one instant
time. Since no pipelining is used in this clone a "pseudo\ dual ported RAM model ought to perform well
for the purpose. The register le is "pseudo\ dual ported in the meaning that either two register can be
read simultaneously, or a write to a single register can be performed.
5 The
rst prototype only supports prefetch of one 32-bit word, will be enhanced in later versions.
1.2 Register le
1
ID
(a).
Address
Figure 10:
The ARM processor has 16 general purpose registers ranging from zero to fteen whereas the pc is
the sixteenth register R15, and the current processor status register cpsr. There also exist banking
(duplications) of registers that are accessible only in supervisor mode. The banked register are R13 svc,
R14 svc, and the saved processor status register spsr. A table of all general and banked registers can be
found in the Socrates specications cpu section and the ARM7TDMI manual.
1.2.1 Registers
As stated in the Socrates specications document the ARM-clone only supports system/user and supervisor states, therefore except the general registers R0 to R15, and processor status the banked registers
R13, R14 and saved status must be implemented. This makes a total register count of 20. All registers
address, name, usage and implementation can be viewed in the table below.
Register address
-0000
-0001
-0010
-0011
-0100
-0101
-0110
-0111
-1000
-1001
-1010
-1011
-1100
01101
01110
01111
10011
10100
11111
Register name
R0
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
R13
R14
R15
R13 supervisor
R14 supervisor
pc
Function
R0
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
fp
ip
sp
lr
UNUSED
sp supervisor
lr supervisor
pc
Implementation
Ram
Ram
Ram
Ram
Ram
Ram
Ram
Ram
Ram
Ram
Ram
Ram
Ram
Ram
Ram
Ram
signal
signal
signal
The general purpose registers and their banked counterparts.
Register name Function

Implementation
cpsr
current processor status register
signal
spsr
saved processor status register supervisor signal
The processor status register and the banked saved status register.
1.2.2 Implementation
The main part of the register bank for this clone will consist of the ram module wr sync dpram from the
sysa 1 course. The ram module is congured to contain 16 32-bit wide ram cells, which is where the main
parts of the registers will be located. The registers not located in ram will be implemented as signals
within or outside the actual register le. Those registers that are signal but resides inside the bank are
the banked R 13, and R 14. The two processor status registers are both implemented as signals but not
1.2 Register le
invoked in the physical register le unit. This is motivated by the reason that the MSR instruction has
the option to not update the whole register but just the conditional ags. This update of the ags would
have been very hard to realize with the status registers as ram cells. The program counter is also placed
outside the register le as a signal. This due to speed performance reasons.
A wrapper component is built around the ram module where a mapping function translates the incoming signals from the processors control unit into signals interpreted by the ram module. The address of
the dierent register is seen in the tables above. The wrapper and is's contents is viewed in gure 11.
In order to read the contents of any register in the register le the we reg strobe must be cleared and
the address to the desired register must be presented to one of the two address lines for reads. Data will
be ready for use on one of the two out ports the next cycle. Two registers can be read simultaneously.
When issuing a write, only a single register is written. The we reg is set high and the register address is
put on the add write line. The data to write is put on data in. The write is completed in memory after
two cycles this due to the synchronous nature of the ram module.
we_reg
add_read_A
add_read_B
add_write
Mapping function
Dual ported RAM
R0
R8
R1
R9
R13_svc
R2
R10
R14_svc
R3
R11_fp
R4
R12_ip
R5
R13_sp
R6
R14_lr
R7
unused
PC_R15
CPSR
SPSR_svc
Mapping function
data_A
data_B
Figure 11: The processor register le.
data_in
1.3 Pipeline compensators
1.2.3 Interface
Signal Name
clk
we reg
add write
add read A
add read B
data in
data A
data B
Width
1
1
4 downto 0
4 downto 0
4 downto 0
31 downto 0
31 downto 0
31 downto 0
Type
In
In
In
In
In
In
Out
Out
Description
Clock pulse
Write enable strobe
Write address
Read address port A
Read address port B
Write data input
Port A data output
Port B data output
Clk, the clock needed by the synchronous dpram from the sysa 1 course which the register bank
mainly consist of. Writes are synchronous.

We reg, write enable strobe. One enables writes, zero disables writes.
Add write, the register address to which the processor wish to write to. This address is only decoded
when we reg is high.
Add read A, read address for port A, in order to gain access to one of the registers R0-R15 independent of mode and eventual banked registers.

Add read B, same as above but for port B.

Data in, the actual data the processor intend to save in the register indicated by the address on
add write.
Data A, the output resulting from reading the contents of the register corresponding to the address
specied by the signal add read A.
Data B, same at for data A but A replaced by a B.
1.3 Pipeline compensators

The original ARM processor makes use of a three stage pipeline in order to execute the instructions.
Instructions must in some modes take in count that the program counter do not point at the address
from where the actual instruction was fetched. Therefore the clone must compensate for this "prefetch\
of instructions. This compensation is performed by the pipe compensator unit, which adds 0,4,8, up to
12 to the incoming 32-bit value and pass it further. The amount which is added is concluded by a four
bit operator.
1.4 Barrel shifter

The barrel shifter is used to perform shifts and rotation of data to be used as alu argument two. The
barrel shifter shifts the incoming operand according to the specied function and the requested amount,
which is between 0 to 31 bits encoded in a 5 bit wide vector.
1.4.1 Interface
Signal Name
Rm
amount
Op
I rot
cpsr C
B out
Width
31 downto 0
4 downto 0
1 downto 0
1
1
32 downto 0
Type
In
In
In
In
In
Out
Description
Input argument
Shift amount
Operand code
Indicates immediate shift, (no rrx)
Carry bit from Status Register
Result of operation + carry out from shift
1.4 Barrel shifter
Rm, the 32-bit contents of any register except the status registers, or an immediate value specied
by the instruction currently executing.
Amount, the ve bit value giving the amount of steps to be shifted. The amount is specied by an
immediate value or by a register contents supplied by the actual instruction.

Op, gives the function to be performed by the barrel shifter, specied by the instruction.

I rot, immediate shift ag indicates that the amount is specied by an immediate value and rrx are
not to be used.
cpsr C, current state of the carry ag.
B out output result from shifting Rm x-bits along with carry out produced by the shifting process.
The carry produced by the barrel shifter is used as carry in for the alu later on.
1.4.2 Functions
Basically the barrel shifter supports three shifting modes, logical shift left(lsl ) and right(lsr ), and arithmetic shift right(asl ). It also supports two rotates, rotate right(ror ) and rotate right extended(rrx ). The
rrx function is a special case of a ror #0, which is interpreted as rrx.
Logical shift left

Logical shift left performs a shift of the specied amount to the left. The last bit shifted out is
placed in the carry and zeros are shifted in from the right, for more detail see gure 12.
0 0 0
LSL #3
Figure 12: Example of lsl #3.
Logical shift right

Lsr is a re ection of the previously described lsl, see gure 13 for further details.
0 0 0 0
C
LSR #3
Figure 13: Example of lsr #3.
Arithmetic shift right

Arithmetic shift right is similar to lsr but with the dierence that asr preserves any sign bit in the
MSB of the word. The state of bit 31 is shifted in from the left, behold gure 14.
Rotate right
Ror performs a rotate of the bits in the operand, the contents of bit zero is placed in the carry and
bit 31 for a ror #1. No bits are ever shifted out. Figure 15 illustrates the result of an ror #3.
1.4 Barrel shifter
C
ASR #3
Figure 14: Example of asr #3.
C
ROR #3
Figure 15: Example of ror #3.
Rotate right extended

The rrx function is as mention above a special function encoded as ror #0. The barrel shifter
perform a shift of one bit to the right where the out shifted bit is placed in the carry. The previous
state of the carry is placed in position 31 of the operand, see gure 16.
C
C
ROR #0
RRX
Figure 16: Example of ror #0 translated to rrx.

A more specic and detailed specication of the dierent shift and rotate operations can be found in
the ARM7TDMI manual section data processing.
1.4.3 Functionality
The barrel shifter works in a pipelined fashion with three steps stages where the dierent stages except
the last one produce an intermediate value that is passed on to the next stage for further processing.
The result from the last step is the actual output from the barrel shifter and is to be passed on to the
alu. By dividing the possible maximum shift amount into three stages we will reduce the chip area used
compared to have 31 dierent shift modules, one for every possible amount. Still some eectiveness in
speed is obtained.
1.4.4 Stages
In each stage of the pipeline the amount is calculated and the function is evaluated. The incoming
operand whereas it is the unprocessed operand or an intermediate value is passed to the right functional
shifting block according to function and amount. The intermediate value produced denoted stage n in
gure 17, is a result of the shifts carried out by all previous steps.
1. The rst stage of the barrel shifter encodes the amount to be shifted from the two lowest bits of the
amount vector. This yields a shift amount of f0,1,2, or 3g which gives the actual subset of the whole
1.5 ALU
10
shift to be performed by the rst step. A zero amount in those two bits indicates a bypass of the
operand to the next step. A special case emerges when the shift amount is zero and the operator
is ror. If the signal I rot equals zero we have to perform a rotate extended, this is expressly done
in this stage.
2. Shifts of 0,4,8, or 12 bits are performed in the second stage. The amount of bits to shift is taken
from bit 2 & 3 in the amount signal. Likewise the rst step zero amount means bypass to next step.
3. The third and last stage takes the intermediate value from the second stage and perform either a
bypass to the output or a shift by 16 bits before passing it on to the output. This depends whether
the last bit in the amount vector is one or zero, zero means bypass.
OPERAND
SHIFT 16
RESULT
SHIFT 0
Stage 2
SHIFT_AMOUNT
SHIFT 12
SHIFT 8
3
2
SHIFT 4
OPERAND
SHIFT 0
Stage 1
0
SHIFT 3
OPERAND
SHIFT 2
Data
SHIFT 1
SHIFT 0
Figure 17: The barrel-shifter pipeline.
1.5 ALU
The Arithmetic Logic Unit (ALU) performs 16 basic ARM dened data processing operations. The
arguments to ALU comes via the two inputs A and B and the result is delivered at the output OUT and
at the 4 output ag signals. The unit is asynchronous.
1.5 ALU
11
A
c_in
OP
s
c
v
z
n
RES
Figure 18:

Signal Name Width
A
31 downto 0
B
31 downto 0
OP
3 downto 0
S
1 downto 0
c in
1
RES
31 downto 0
c
1
z
1
v
1
n
1
Type
In
In
In
In
In
Out
Out
Out
Out
Out
Description
Input argument 1 (Op1)
Input argument 2 (Op2)
Operand code
Set condition code
Carry bit from Status Register
Result of operation
carry ag
zero ag: set if the result of operation is zero
valid ag: set if the data is not valid
negative ag: set if the result is negative
1.5.2 Operations
Which operation that is performed by the ALU is dened by the input on the OP signals. All operations
and their semantic actions are shown in the table below:
1.6 PC and result incrementer
OP
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
12
Operation
AND
EOR
SUB
RSB
ADD
ADC
SBC
RSC
TST
TEQ
CMP
CMN
ORR
MOV
BIC
MVN
Description
RES = A AND B
RES = A EOR B
RES = A - B
RES = B - A
RES = A + B
RES = A + B + c
RES = A - B + c - 1
RES = B - A + c - 1
Set condition codes on A AND B
Set condition codes on A EOR B
Set condition codes on A - B
Set condition codes on A + B
RES = A OR B
RES = B
RES = A AND NOT B
RES = NOT B
The TST, TEQ, CMP and CMN operations does all aect the RES signal which means that it is up
to the control unit to not use the result. For further information refer to ARM7TDMI Data Sheet: ARM
Instruction Set.
1.5.3 Flags
The S input signal control if the ags are to be set or not. If the S is 1 the ags are aected and if S is 0
no ags are aected. In the implementation the S ag in the alu is always set and the data path control
desides wether the status register should be updated or not depending on the actual instruction and wich
ags are set in it's op-code.

Alu must be able to deliver the result within one clock-cycle.
1.5.5 Testing
All operations and are tested with all possible combinations of inputs. The verication of the simulation
is made \by hand". By using generics, ALU can be tested for input arguments of size 2.
1.6 PC and result incrementer

The two incrementer units are used to add four to an incoming 32-bit wide signal. The incrementer
unit either add four to the incoming vector or bypass it through, which is decided by a operator signal.
Whenever the processor needs to increment the PC by four, or to perform the last step of the pipeline
compensation described above, the value is fetched from the incrementer units. Otherwise values are
obtained from the incoming vectors of the incrementers.
1.7 Exception handling

The processor must be able to handle abnormal states that might occur, and incoming interrupts. The
interrupts are from the processors point of view an external event beyond the control of the processor,
whereas exceptions are due to undesired circumstances internally or the execution of a SWI instruction.
However all of those cases must be taken care of, in order to restore the system, perform service to
external devices, or simply shut down the processor or the system, so no further damage can be done.
1.8 Control unit
13
All handling of interrupt and exceptions are performed in the exception state, in which a oset vector
from the exception decoder is processed and loaded in to the pc enableing execution to continue from
that point. The address is presented to the address bus and the processor goes to the decode state.
1.7.1 Exception decoder

The exception decoder decodes the dierent exception lines the processor or some external device6 has
raised. The decoder unit will asynchrony response with the address to the specic IRS Interrupt Service
Routine for the incoming exception. The address to all ISRs are located in the exception table which is
described in the Socrates specications document.
Signal Name
Reset
Undened
Software
A prfetch
A data
Reserved
IRQ
FIQ
mask
arm irq
oset
Width
1
1
1
1
1
1
1
1
1
1
2 downto 0
Type
In
In
In
In
In
In
In
In
In
Out
Out
Description
Reset line
Undened instruction line
Software interrupt line SWI
Prefetch abort line (unused)
Data abort line bus error
Reserved for future expansion
Rtu interrupt line
Fast IRQ line (unused)
interrupt mask from status register
Interrupt signal to the processor
Address oset to the ISR
1.7.2 External interrupt

As previously described interrupts are from an external source, and the case for the Socrates platform
it is the RTU that interrupts the processor in order to make it perform a task-switch. When the RTU
lower the IRQ line the arm irq signal goes high and will be detected by the processor when the currently
executing instruction is nished7 and the processor resides in one of the states, fetch or decode. Any
eventual write back is performed. Instead of loading the pc with the address to next instruction and place
it on the address bus the address is saved in the supervisor link register and the cpsr register is saved in
spsr. The supervisor bit in cpsr is set and all interrupts is disabled. A state transition to exception is
done.
1.7.3 Internal exception

Unlike the interrupt any internal exception is taken care of as soon as they occur, the preparation for
entering exception state can be made in any other state, in the same fashion as for external IRQs.
The reason for this is that a bus error or a undened instruction can make the processor behavior
unpredictable and most undened. It is better to abort execution directly and shut down the processor.
The only internal exception that do not halts the processor is the SWI software interrupt which is done
in a controlled way by the programmer. In the rst prototype of the Socrates architecture the SWI
instruction starts to load the address to the software task-switch.
1.8 Control unit

The processor and it's functional units described above are controlled and managed by the control unit.
The control constitutes of a nite state machine responsible for decoding and execution of instructions.
Execution of the dierent instruction is done by enabling the right signals at the right time to those
functional block that are invoked in the execution of a certain instruction. The machine is also totally
responsible for the internal address bus.
6 The RTU on the Socrates platform.
7 Write back of register might be the only
remaining action to perform.
1.8 Control unit
14
The state machine is implement as a mealy machine with synchronous outputs, all assignment of signals
are in the same vhdl process. This may not be the most eective way to implement a state machine for
speed performance but it has been our main technique until this date.
Exception
Start
B_A
B_L
Load
Store
DP
E
B_S
swap
Figure 19: The processor state space.
1.8.1 States
The control machine consists of 12 states, which can be viewed in gure 19. A general description of
what action is performed in each one of the processor states is given in this section. General in the
meaning that every signal assignment are not described but the function of the state in order to execute
the dierent instructions are.
start The start state is a hardware setup state, internal signals are set to their initial values. The pc,
cpsr, and spsr are initiated. All buses are three stated. When the system boot signal8 the rst
instruction is fetched and a state transition to decode is done.
fetch
F The fetch state is for most of the instruction the last state in their execution path. It is also
the state where fetch of the next instruction is made. The last part of an instructions execution is
performed here which is write back of the destination register and eventual updates of the status
ags for data processing instructions.
decode D All instructions comes to this state once in their execution cycle. If the nWait signal is high
execution of the instruction can begin, otherwise the processor halts. First the condition eld of
the op code is evaluated according to the states of the condition ags specied in the ARM7TDMI
manual. If the conditions for the instruction to be executed fail the pc will be updated and the
next instruction fetched, we will remain in the decode state to decode the next instruction.
Now the op-code is being decoded according to the decoding tree presented in the Socrates specication document section CPU. Any op-code not recognized by the decoder will generate a undened
8 The
nWait signal goes high.
1.8 Control unit
15
instruction exception which will be handled by the exception handler. For all instructions that need
to pass information from the op-code to other states along the instructions execution path a inst info
vector is created and initiated with adequate information see section 1.8.2 for further detail. For
the most of the instructions registers are fetched from the register le, except for the eventual use
of immediate values. Dependent of what remains to do in order to complete the execution of the
instruction, the next state diers among the instruction in the instruction set. The MRS instruction
is already nished and can issue a fetch of the next instruction and remain in decode state. Others
like data processing load/store and SWP goes to the execute state. Prefetch, branch and MSR will
proceed to the fetch state.
E For most of the instructions reads of register initiated in the decode state are here passed on
to dierent units in the processor architecture. Values are put on the pipeline compensators, shifted
in the barrel shifter an placed on the two alu ports. Data processing instruction are passed further
down their execution path due to their use of registers, immediate values, one register two registers
up to three registers. The block data transfer instructions are the only instruction that enters this
state several times during their execution. Therefore the execute state for those instructions is
divided in to two parts distinguished by the amount of register to process in the register list. If
the register list is empty all register in the list has been processed and the instruction in nished.
Update of the base register if the W-bit in the op-code is set or post indexing is used, is the only
remaining action to be done. If the list is un empty, the rst register in the list is processed, whether
the instruction is a load or a store is determined and action taken, read of source register for a store
or address output for a load.
execute
DP This state is only used when operand 2 of the data processing instruction is of register
register type. This because the register le only enables two registers to be fetched at once, but
this instruction is in need of accessing three register simultaneously. This is overcome by buering
the contents of one of the registers in the execute state and reuse it in this state with the third
register being fetched from the execute state.
dp extra
load When the load instruction reached the load state the eective address calculation is ready. The
address must then be evaluated according to alignment rules, word aligned for word accesses. For
byte accesses no special alignment is needed. If a word access do not pass the alignment check an
undened instruction exception will occur, the load will be aborted and execution will continue with
the exception handler for undened instruction. If the address is aligned correct, the right address
for the load, depending on the addressing mode pre or post indexing is placed on the address bus.
Any eventual write modify of the base register is performed here, owed by a transition to the fetch
state.
store The store state is identical to the load state in regard to alignment control, write modify of the
base register, and it's next state. Not only the address bus is updated but also the data bus with
the actual value to be written in memory. Special action must be taken when the program counter
is to be stored, it must be pipeline compensated by 12. For the case when the write is of byte size
the lowest byte of the source register is duplicated on to all four bytes of the data bus.
B A This state is exclusively used by the block data transfer instructions in order to be
able to calculate the address of the transfer before they reach the execution state.
block address
B L This state handles write back of registers for the load part of the block data transfer
instruction. Since the actual read from memory was initiated in the decode state control of the
nWait signal must proceed all other actions. After the nWait signal goes high the destination
register is examined. If the program counter is the destination register the pc is update with the
value currently residing on the data bus, then if the S-bit in the instruction was set the contents of
spsr is transfered to cpsr (i.e a mode change). The instruction will then return to the execute state
to process the next load.
block load
1.8 Control unit
16
B S In this state the data bus is presented with the contents of the register to be stored.
A control must be done to determined if the register to be stored is the pc or not. The address to
store the data depends on the indexing modes pre, post, up, or down. This is evaluated and the
right address is put on the address bus. Execution will then proceed to the execute state.
block store
swap This state is a special state for the swp instruction where the previously read test variable is being
temporally stored in the internal swap reg. The store part of the swp instruction is also initiated
by presenting the write address and data on to the busses.
exception When entering this state something unwanted has happened or an interrupt from an external
source has occurred. The pc value is stored in the link register for supervisor mode and the pc
is updated with the address to the interrupt service routine which will handle the exception. The
address is constructed by shifting the oset vector from the exception decoder left two bits, then
add the 8-bit identication eld to the upper 8-bits of the address.The address is also presented to
the address bus for a fetch of the rst instruction of the ISR. The machine is driven to state decode
to begin execution of the next fetched instruction.
1.8.2 Instruction information

Since all instructions unfortunately are impossible to execute in a single cycle and the states of the
processor are made as general as possible, information of what instruction has been decoded and it's
features must be passed down the execution path. This is done by the 29-bit wide internal signal inst info.
The inst info is build when the instruction is being decoded in the decode state. Each instruction that
needs this information has it's own conguration and location of the signicant bits. The only part
common to all instructions are bits 0-2 which contains a code indicating which instruction it is, which is
decoded in all states that is not specialized for a single instruction.
B(L) A picture of the B(L) information can be viewed in gure 20b.

Instruction code 000
L The link bit, branch linked if set.
BDT The information vector is visualized in gure 20g.
Rn The base register.
L Load=1 or store=0.
W Update base register if set.
S Status register transfer (mode shift) or forced user mode.
U Up or down bit.
P Pre or post indexing.
Reg list List of all register to process during execution.
pc Indicates whether the pc is in the list or not.
SDT An example of a SDT inst info vector is given in gure 20c.
Rd Destination register.
L Load=1 or store=0.
W Update base register if set.
B Byte or word quantity.
1.8 Control unit
17
U Up or down bit.
P Pre or post indexing.
I Immediate or register oset.
1/2 Obsolete.
H Obsolete.
S Obsolete.
SWP The conguration of the swp information can be seen in gure 20d.
Rn Base register.
Rm source register.
B Byte or word quantity.
MSR See gure 20e, for the information vector.
r/s Destination register.
R Status register cpsr or spsr.
F Whole register or ag bits only.
I Immediate or register transfer.
PRE See gure 20a.
Instruction code 111. The inst info vector for the prefetch instruction expressly contains the
instruction code, this du to no other information needs to be passed along. The instruction
code is necessary because the instruction must be recognized in the the fetch state.
DP Data processing is the only instruction to utilize the full size of the inst info vector, which can be
seen in gure 20f.

pc Destination register is the program counter.
Rn Base register.
pc The base register is the program counter, used for pipeline compensation.
Rm Source register.
pc Source register is the program counter, used for pipeline compensation.
S Updating the status register or not.
Rs The use of Rs when operand 2 is register.
WB Indicates if Rd is not to updated with the result of the operation. This is relevant for tst, teq,
cmp, and cmn which only aects the status ags.
OP Shift operator when operand 2 is register register.

dp Operand 1 op Operand 2, or just operand 2.
alu op Alu operator when operand 2 i register register.
I Immediate or register operand.
1.9 Execution Time Analysis
18
UNUSED
(a).
(b).
Rd
(c).
Rd
(d).
(e).
UNUSED
UNUSED
Rn
UNUSED
H 1/2
P U
Rm
Rn
I
UNUSED
I
alu_OP
pc
dp
OP
W Rs
S pc
Register list
Rm
pc
R r/s
pc
Rd
(f).
S W L
Rn
(g).
Rn
P U
Figure 20: The inst info vector for the B(L), BDT, SDT, SWP, MSR, PRE, and Data processing instructions.

The processor in the SoCrates system has deliberately been designed to be as predictable as possible. By
predictable, we mean that the execution time of each instruction can be specied as precisely as possible.
The original ARM processor, which the SoCrates processor is modeled after, has both cache and a
pipeline. These components are not included in our ARM-clone for the reasons described above. Instead,
the processor makes use of on-chip memory and that the instructions are placed locally to compensate
the loss of performance when removing the pipeline and cache. To see which of the ARM instructions
are supported in the ARM-clone, see the CPU section (chapter 3) in Socrates - Specications, Revision
0.99. The following section gives an analysis of the best and worst case execution time of each ARM
instruction that is supported by the clone.
1.9.1 Cycle Times & State Machines

This section describes each instruction (or instruction class when possible) with a state machine. That
state machine will be the groundwork for the execution time analysis, which gives both the worst and best
execution time. The reason for why an instruction have a worst and best case execution time (WCET
and BCET) is that the time varies depending on whether an immediate value or register is used as an
operand. The time can vary even more when the instruction involves external bus accesses, which means
that the node must wait for mastership on the interconnect. This section will not fully describe how
an instruction is constructed, because it has already been done in the in section 1.19 . Instead, it will
only point out the dierent modes of an instruction when the instruction time varies between them. One
nal aspect need to be addressed when looking at the worst case execution times. There is a worst case
that can happen on every instruction fetch that must be accounted for in every WCET analysis. When
a processor wants to fetch an instruction from the memory it can be stalled, because it is theoretically
possible (but not so common) that another processor is issuing a blocked transfer to the same memory
content where an instruction was supposed to be fetched. This means that the processor issuing the fetch
must wait until the blocked transaction is completed. This delay can be calculated by looking at the
worst case for Blocked Data Transfer (BDT) instruction. The delay will be denoted as Tf etch in all the
WCET calculations.
9 More
information can be found in the ARM7TDMi manual at http://www.arm.com
19
1.9.2 Condition Code False

Every ARM instruction has a condition eld in the upper four bits (bit 31 to 28) of the 32 bit instruction.
Whenever an instruction has been fetched from memory, it always has to be evaluated before it can
be executed. There are 16 possible condition codes which can make any instruction conditional. If the
condition is true, the instruction will be executed, or else the processor will simply fetch a new instruction.
This is displayed in the state machine as a loop from the decode stage to the decode stage. The loop will
always take one cycle.
Figure 21: State transition(s) for instructions with false condition code.
Instruction Type
CC false
BCET
1 cycle
WCET
Tf etch + 1 cycle
1.9.3 Branch & Branch Linked

The branch and branch linked instruction adds a signed 2's complement 24 bit oset to the program
counter. The two instruction types do not dier in execution time because the writeback to the link
register (in the case of a BL) can be done on the transition from the decode state to the fetch state.
D
Figure 22: State transition(s) for Branch & Branch Linked (B, BL) instructions.
Instruction Type
B
BL
BCET
2 cycles
2 cycles
WCET
Tf etch + 2 cycles
Tf etch + 2 cycles
1.9.4 Data Processing

The data processing instructions are the 16 logical and arithmetical functions provided by the ALU.
This family of instructions can be divided into several small families. First of all, the instructions that
demands only one operand can be separated from the rest (those are MOV and MVN). The one-operand
instructions also dier depending on whether the operand is an immediate value or a register. In case
of an immediate value, the execution only takes 2 cycles, because the processor can calculate the result
from the ALU on the rst cycle, and then write it on the second cycle at the same time it fetches a new
instruction (gure 23a) . In the other case (operand is a register), the processor need one extra cycle to
fetch the operand register from the register le (gure 23b). The other part of the DP family is the ALU
operations that require two operands. The rst operand is always a register, but the second operand can
be a rotated immediate value (gure 23b), a shifted register content (gure 23b), or a register content
shifted by the amount specied in another register (gure 23c). In the rst two cases of the two-operand
data processing, one cycle is required to fetch the second operand register (if needed) and to run the
operand through the barrel shifter. The second cycle is for executing the ALU operation. The last cycle
is needed for writeback and fetching the new instruction. The last and slowest case of the two-operand
data processing instructions takes one cycle extra, because one extra register is needed to calculate the
shifted operand.

D
20
D
(a).
(b).
DP
F
(c).
Figure 23: State transition(s) for Data Processing (DP) instructions.

Instruction Type
One-operand DP (immediate)
One-operand DP (register)
Two-operand DP (rotated immediate)
Two-operand DP (immediate shifted register)
Two-operand DP (register shifted register)
BCET
2 cycles
3 cycles
3 cycles
3 cycles
4 cycles
WCET
Tf etch + 2 cycles
Tf etch + 3 cycles
Tf etch + 3 cycles
Tf etch + 3 cycles
Tf etch + 4 cycles
1.9.5 Block Data Transfer

The blocked data transfer instruction stores or loads register to or from local or global memory. The rst
two cycles in gure 24 is for arithmetic calculation and the cycles E > B S > E and E > B L > E
are repeated for blocked store and load for each register in the register list the need to be processed. When
the instruction is completed, the transition from state E to D is for writeback and to fetch a new instruction. Below is a general formula that displays how the execution time for both blocked store and load.
Ttotal = TD >BA + TBA >E + Tbusr equest + Nregisters (TnW aitS T R + TE

TBS >E + TBL >E ) + TE >D , where
>BS
+ TE
>BL + nW aitL DR +
Nregisters = Number of registers in the register list.

TD >BA = 1 cycle
TBA >E = 1 cycle
Tbusr equest = PNi=0 Ti (execution times of all the nodes (N) in the request queue before you)
TnW aitS T R = 0 cycles if local, 5 cycles for global STR
TnW aitL DR = 0 cycles if local, 7 cycles for global LDR
TE >EB = 1 cycle
TE >EL = 1 cycle
TBS >E = 1 cycle
TBL >E = 1 cycle
TE >D = 1 cycle
The best case execution time occurs when only one register (Nregisters = 1) need to be stored or
loaded locally. In this case, BCET becomes 1 + 1 + 0 + 1 (0 + 1 + 0 + 0 + 1) + 1 = 4 cycles.
The worst case execution time occurs when all 16 registers need to be saved externally, which means
that the interconnect mastership is needed. The worst case can be even worsen if all nodes are executing the same instruction and are before you in the request queue, resulting in that one must
wait until all transactions are done until the bus request is granted. The WCET for STM becomes
1 + 1 + Tbusr equest + 16 (5 + 1 + 0 + 0 + 1 + 0) + 1 = 3 + Tbusr equest + 112 = 115 + Tbusr equest cycles. The
same case for LDM becomes 1 + 1 + Tbusr equest + 16 (0 + 0 + 1 + 7 + 0 + 1) + 1 = 3 + Tbusr equest + 144 =
147 + Tbusr equest cycles.
21
nWait
bus_request + nWait
B_L
D
B_A
E
B_S
Figure 24: State transition(s) for Block Data Transfer (BDT) instructions.
Instruction Type
STM
LDM
BCET
4 cycles
4 cycles
WCET
Tf etch + 115 + Tbusr equest
Tf etch + 147 + Tbusr equest
1.9.6 Single Data Transfer

The single data transfer instruction family consists of a load (LDR) and store (STR) operation. Both
instructions can be executed one cycle faster than in the general case, if the rst operand is the program
counter and the second operand is an immediate value. In the general case, the execution time of both
load and store is described in the following formula.
Ttotal = TD
>E
+ TE
>ST R=LDR
+ TE
>F
+ Tbusr equest + TnW ait + TLDR=ST R
>F
+ TF
>D
, where
TD >E = 1 cycle
TE >ST R=LDRP= 1 cycle
Tbusr equest = Ni=0 Ti (Sum of all execution times of all the nodes (N) in the request queue before you)
TnW ait = 0 cycles if local, 5 cycles for global STR, 7 cycles if global LDR
TLDR=ST R >F = 1 cycle
TF >D = 1 cycle
The best case is described in gure 25, where there is no transition from state E to LDR or STR.
Instead, the there is an direct transition from state E to state F, because the PC isn't located in the
register le, and no cycle is needed to fetch it, as in the general case. In the simple case, the rst cycle calculates the address, the second cycle puts the address on the bus (and data if needed), and the
last cycle writes back the data (when needed) at the same time its fetching a new instruction. The
BCET occurs when the LDR or STR does not need to access the interconnect, which gives a total of
1 + 0 + 1 + 0 + 0 + 0 + 1 = 3cycles.
The worst case occurs when a LDR instruction must access the interconnect and all nodes are issuing a
blocked transfer. This causes the bus request loop to wait until all blocked transactions are done. The
WCET then becomes 1 + 1 + 0 + Tbusr equest + 7 + 1 + 1 = 11 + Tbusr equest cycle. The WCET for STR
is basically same, with the exception that the TnW ait for STR takes 5 cycles, which gives a total of 9 +
Tbusr equest cycles.
Instruction Type
LDR
STR
BCET
3 cycles
3 cycles
WCET
Tf etch + 11 + Tbusr equest cycles
22
bus_request + nWait
Figure 25: State transition(s) for Single Data Transfer instructions (1st operand as PC, 2nd operand
immediate).
bus_request + nWait
Load
D
E
Store
Figure 26: State transition(s) for Single Data Transfer (SDT) instructions (general case).
1.9.7 Single Data Swap

A single data swap (SWP) instruction consists of a read instruction followed by a write instruction.
These instructions must not be interrupted and will therefore result in an locked transfer on both the
interconnect an the local bus. The following formula shows how the execution time is calculated.
Ttotal = TD
>E
+ TE
>swap
+ Tbusr equest + TnW aitr ead + Tswap
>F
+ TnW aitw rite + TF
>D
, where
TD >E = 1 cycle
TE >swap = 1Pcycle
Tbusr equest = Ni=0 Ti (Sum of all execution times of all the nodes (N) in the request queue before you)
TnW aitr ead = 0 cycles if local, 7 cycles if global
Tswap >F = 1 cycle
TnW aitw rite = 0 cycles if local, 5 cycles if global
TF >D = 1 cycle
The best case execution time occurs when the instruction is a local swap, because no accesses to interconnect is needed. This gives a BCET of 1 + 1 + 0 + 0 + 1 + 0 + 1 = 4cycles.
The worst case is when the swap instruction must access the interconnect and all other nodes must issue
a blocked instruction that takes maximum time to execute (which is a blocked load of all 16 registers).
So, the time for Tbusr equest depends on how many nodes that are present in the system. With this in
mind, we can estimate the WCET to 1 + 1 + Tbusr equest + 7 + 1 + 5 + 1 = 16 + Tbusr equest cycles.
Instruction Type
SWP
BCET
4 cycles
WCET
23
bus_request + Nwait
swap
Nwait
Figure 27: State transition(s) for Single Data Swap (SWP) instructions.
1.9.8 Processor Status Register (PSR) Transfer

The PSR transfer instructions makes it possible to transfer the processor status register contents to or
from a general register. The MRS instruction moves contents of the PSR register to a general register
and the MSR moves the contents of a general register to the PSR register.
The MSR instruction always takes 2 cycles, where the rst cycle is for register fetching or immediate
value calculations, and the second cycle is for writing to PSR and fetching the next instruction. The
MRS instruction always takes one cycle, because it is possible to fetch the contents of the PSR register
immediately and write it to the operand (the general register), and at the same time fetch the next
instruction in only one cycle.
Figure 28: State transition(s) for MRS instructions.
Figure 29: State transition(s) for MSR instructions.

Instruction Type
MRS
MSR
BCET
1 cycle
2 cycles
WCET
Tf etch + 1 cycle
Tf etch + 2 cycles
1.9.9 Software Interrupt (SWI)

The Software Interrupt instruction is called to perform a taskswitch in the SoCrates prototype. When a
SWI occurs, the processor fetches the next instruction from the address that is provided by the exception
handler. The next instruction is typically a branch to the exception code for SWI. It takes one cycle to
fetch the branch instruction at the address provided by the exception handler. It takes another cycle
to fetch the rst instruction of the exception code. This makes a total of 2 cycles to execute a SWI
instruction.
Instruction Type
SWI
BCET
2 cycles
WCET
Tf etch + 2 cycles
24
D
exception
Figure 30: State transition(s) for Software Interrupt (SWI) instructions.
1.9.10 Prefetch
This instruction is implemented to make prefetching of data possible. In the prototype, only pre fetching
of 32-bit data is possible. The prefetch instruction operand consists of a 28-bit address (8 bits id + 20 bits
local address) that will be written to the predened address of the Network Interface prefetch register.
After the write, the instruction is nished and data should be available when it is needed. The prefetch
instruction always takes 2 cycles because one cycle is needed to write to the prefetch register at NI, and
one cycle is needed to fetch a new instruction.
D
Figure 31: State transition(s) for Prefetch (PRE) instructions.
Instruction Type
PRE
BCET
2 cycles
WCET
Tf etch + 2 cycles
2 NETWORK INTERFACE
25
2 Network Interface
This section describes the implementation of the Network Interface (NI). The functionality demands for
this component is described in Socrates specications in section 4. To made future modications easy NI
was divided into 5 components shown in gure 32 described below:
Address decoder
Listens to accesses made by the CPU and decides whether it is an local access or an external access
or an prefetch initialization.
Control Unit
The heart of the NI. Keeps the state of the NI in mind and controls Sender, Prefetch buer, the
CPU data bus, and the nWait strobe to stop the CPU whenever needed.
Sender
Handles accesses to the interconnect, and as in the case for the Socrates Prototype where the
interconnect is a shared bus the accesses will follow the bus protocol described in the Socrates 1.0
Specications.
Receiver
Listens if there is accesses to memory located at this node or to the nWaitAlwaysRegister. Receiver
controls the strobes to the memory wrapper. In case of a read, the fetched data from local memory
is delivered to the access initiator via the interconnect.
Prefetch Buer
This is a cache-like buer but the contents data within the buer is totally controlled by the local
CPU by initiating prefetches. A prefetch init is done by making a write to a specic address.
Dividing the component into a separate sender and receiver and a control unit with a clear interface
makes it easy to adapt NI if the bus protocol is changed. In this case there is only need for modications
in the sender and receiver components. Even a change to a complete dierent interconnect type will
only have aect on the sender and receiver with minor modications for the rest of NI. In the following
sections the 5 subcomponents implementation will be described.
2.1 Address Decoder

This is component is totally asynchronous to fulll the demands for one-cycle local accesses. If CPU
performs a locked local access, all other accesses that comes via the interconnect must be delayed until
access has nished. Therefore receiver is informed about an local locked access by setting the signal
local lock high. The control Unit is always informed about the access type (local access, remote read,
remote write, or prefetch init). If the local memory is locked by an another node the address decoder
will delay the access from the local CPU until the lock is released.
2.2 Control Unit

If control unit receives a remote read request from Address Decoder it accesses the prefetch buer and if
there data has earlier been prefetched and it is valid it will be delivered to the CPU. Otherwise control
unit sends an remote read request to sender. Write request coming from address decoder are directly
forwarded to sender. Prefetch inits does not stall the CPU and therefore nWait always stays high during
these accesses that are forwarded as remote reads to the sender. When prefetched data arrives from
sender it will be delivered to Prefetch Buer. If there is an another remote access that has to use the
sender during a prefetch init, and data could not be fetched from Prefetch Buer then the remote access
will be delayed until sender has nished the prefetch.
The data bus to the CPU, nWait and berr is controlled asynchronously while the rest of functionality
is done by a synchronous state machine. A general description of the FSM shown in gure 33 is given
below:
2.3 Prefetch Buer
data_NI(31:0)
address_NI(11:0)
cs_NI
address
RECEIVER
berr_i_in
bus_request
SENDER
rw_i
lock_i
data
PREFETCH_BUFFER
ADDRESS_DECODER
Figure 32:
ack_in
lock
nwait
berr
data_CPU(31:0)
trans
adderss_CPU(31:0)
rw
mas
26
PREFETCH A prefetch is ongoing on the interconnect. Waiting for the prefetch to nish. Additional accesses before completion of prefetch will be delayed until sender has prefetched data.
RW An access has been started via the interconnect. Waiting for acknowledge.
request_address(31:0)
prefetch_address(31:0)
complete
data_out(31:0) hit
CONTROL_UNIT
data
prefetched_data update
bus_grant
send_fcn
ad_strobe_out
sender_address
rw_NI(3:0)
cs_CPU(3:0)
rw_CPU
prefetch_init
mas_i
data_write
rw_i
nWaitAlways
IDLE Waiting for actions from Address Decoder.
remote_read
address
data_read
mas_i
remote_write
ack_out
local_lock
This synchronous component is trigged by the Control Unit by setting an access type (remote read or
remote write). After a completed access the control unit is informed and in case of a load the data is
delivered to control unit. A general description of the states in the FSM shown in gure 34 is given below:
2.4 Sender
Control unit can read this buer within one clock cycle since reads are performed asynchronous. Writes
are handled synchronously to avoid latches. A data line in the buer consists of: 32 bit address, 32-bit
data and a valid bit. Prefetch buer can only store one word but the dividing of prefetch buer into a
own subcomponent makes it easy to increase the size. For more info refer to Socrates 1.0 Specications
document.
2.3 Prefetch Buer

berr_decoder
local_access
lock_i
berr_sender
berr_i_out
ad_strobe_in
remote_lock
2.4 Sender
27
cmd=REMOTE_WRITE
send_fcn <= INTERCONNECT_WRITE
addr_sender <= address_CPU
data_write <= data_CPU
RW
cmd = REMOTE_READ and HIT=0
send_fcn <= INTERCONNECT_READ
cmd = REMOTE_READ and HIT=1
COMPLETE = 1
send_fcn <= INTERCONNECT_IDLE

IDLE
complete = 1 and cmd=REMOTE_READ
send_fcn = INTERCONNECT_READ
addr_sender = address_CPU
complete = 1 and cmd=CU_IDLE
OR
complete = 1 and cmd=REMOTE_READ
and (hit = 1 or update = 1)
complete = 1 and cmd = REMOTE_WRITE

send_fcn = INTERCONNECT_WRITE
addr_sender = address_CPU
data_write = data_CPU
send_fcn <= INTERCONNECT_IDLE
cmd = PREFETCH_INIT
addr_sender<=data_CPU
PREFETCH
complete = 1 and cmd=PREFETCH_INIT

addr_sender <= data_CPU
Figure 33: The Finite State Machine for the Control Unit
2.5 Receiver
28
bus_request <= 1;
IDLE
bus_request <= 1;
lock = 0
complete = 0;
lock_i <=Z;
data <= (others =>Z);
ARBITRATE_R
ARBITRATE_W
COMPLETE_STATE
bus_grant = 1
ad_strobe_out <= 1;
address<=addr_from_cu;
data <= (others => Z);
mas_i <= mas;
rw_i <= READ;
lock_i <= lock;
ack = 1 AND berr_i = 0

ad_strobe_out <= 0;
rw_i <= Z;
mas_i <= (others => Z);
address <= (others => Z);
complete <= 1;
data_to_cu <= data;
bus_request <= lock;
lock_i <= lock;
lock = 1
ack = 1 AND berr_i =0l

ad_strobe_out = 0;
rw_i <= Z;
mas_i <= (others => Z);
address<=(others => Z);
complete<=1;
bus_request <= lock;
lock_i <= lock;
bus_grant = 1
ad_strobe_out <= 1;
address <= addr_from_cu;
data <= data_from_cu;
mas_i <= mas;
rw_i <= rw_i;
lock_i <= lock;
complete <= 0;
berr_sender<= 0;
BUS_LOCKED
WRITE_INIT
READ_INIT
ad_strobe_out <= 1;
address<=addr_from_CU;
data<=data_from_CU;
mas_i<=mas;
rw_i<=WRITE;
ad_strobe_out <= 1;
address<=addr_from_CU;
data<=(others=>Z);
mas_i<=mas;
rw_i<=READ;
berr_i = 1
berr_sender <= 1;
bus_request <= 0;
berr_i = 1
berr_sender <= 1;
bus_request <= 0;
SET_BERR_SENDER
Figure 34:

IDLE Waiting for actions from Control Unit

ARBITRATE R Waiting for bus grant before starting a read access.
ARBITRATE W Waiting for bus grant before starting a write access.
READ INIT Waiting for access acknowledge for the read access.
WRITE INIT Waiting for access acknowledge for the write access.
COMPLETE STATE Access nished. Three state all shared signals on the interconnect.
BUS LOCKED The CPU has requested an locked access. No release of bus.
2.5 Receiver
The receiver is implemented with one synchronous FSM and parallel VHDL for detecting node accesses
and controlling byte lanes to memory wrapper. The FSM in gure 35 is brie y described below.
RECEIVER IDLE Listening for accesses on the interconnect whose addresses matches this nodes
ID-eld.
2.5 Receiver
29
SET_BUSERROR
berr_i <= Z;
data_NI <= (others => Z);
ack <= Z;
berr_i <= Z;
remote_lock <= 0;
if NODE_access AND mas_i = RESERVED

berr_i<=1;
ack <= 1;
local_lock=1
local_lock =0 and WRITE_req=1

data_NI <= data;
ack <= 1;
cs_NI <= 1;
berr_i<= 0;
local_lock = 0 and READ_req=1 and nWAIT_access = 1

berr <= 1;
ack <= 1;
RECEIVER_IDLE
ack <= 0;
berr_i <= 0;
data_NI<=(others =>Z);
cs_NI <= 0;
local_lock = 0 and
lock_i = 0
data<=(others =>Z);
ack <= 0;
berr_i <= 0;
local_lock = 0 and WRITE_req=1

and nWAIT_access=1
nWaitAlways <= data(0);
berr_i <= 0;
ack <= 1;
WRITE_TO_LOCAL
local_lock = 0 and READ_req=1

address_NI <= address;
local_lock=0 and WRITE_req=1
cs_NI <= 1;
berr_i <= 0;
data_NI <= data;
data<=data_NI;
ack <= 1;
ack <= 1;
cs_NI <= 1;
berr_i <= 0;
local_lock =0 and WRITE_req=1

data_NI <= data; lock=1
ack <= 1;
ack <= 0;
cs_NI <= 1;
berr_i <= 0;
berr_i<= 0;
cs_NI <= 0;
MEM_LOCKED
if NODE_access AND mas_i = RESERVED

berr_i<=1;
ack <= 1;
THREE_STATE_DATA
cs_NI <= 0;
READ_FROM_LOCAL
local_lock = 0 and READ_req=1

cs_NI <= 1;
berr_i <= 0;
lock_i=1
data<=(others =>Z);
ack <= 0;
berr_i <= 0;
Figure 35:

WRITE TO LOCAL Write access detected, setting strobes to memory wrapper to perform a write.

MEM LOCKED Locked transaction. Local memory locked by a remote node.
READ FROM LOCAL Read access detected, setting strobes to memory wrapper to perform a
read.
THREE STATE DATA Read access has been performed. Data bus on the interconnect is threestated.
SET BUSERROR Invalid access, for example invalid access size.
3 ARBITER
30
3 Arbiter
The Arbiter unit shares resources among several potential busmasters in a round-robin fashion. The
implementation is done generically, so it can be congured to handle arbitrary number of nodes. It is
implemented as an asynchronous unit.
3.1 Electrical Interface

Each node in the system has a request and response line to/from the arbiter unit (gure 36). Arbitration
begins when a node issues a request. The arbiter unit then looks at the internal queue, and decides if
it will give the requesting node mastership or to insert it last in the internal FIFO-queue of requesting
nodes.
grant(NOF_NODES1)
request(NOF_NODES1)
Arbiter Unit
grant(1)
request(1)
grant(0)
request(0)
Figure 36: The electrical interface of the arbiter unit.
3.2 Operations
The arbiter uses an internal FIFO-queue to store pending requests in a round-robin fashion. If a node
does a request, and no one is currently master, then it gets mastership until it lowers its request line.
If there were several pending nodes at the time of request, the requesting node will be inserted last in
the FIFO-queue. IF several nodes does a request simultaneously, the node with the lowest ID is inserted
in the queue rst, then the node with second lowest ID, and so on... When a current master lowers its
request line it will automatically pass on its mastership to the node that is rst in the FIFO-queue.

No explicit demands has been stated other then that it should be fair and reasonably fast (respond within
one tick).
3.2.2 Testing
The arbiter unit has been tested completely with one and two nodes, and more randomly with three
nodes. Each test has been done with do-les.
4 COMPILING & LINKING THE SYSTEM SOFTWARE
31
4 Compiling & Linking the System Software

The SoCrates platform is a 32-bit multiprocessor system where each node has its own local address space.
The local address space is addressed with the lower 24 bits of an 32-bit address. The rest of the bits (31
to 24) is an one-hot coded identication eld that separates the local address spaces from eachother 10 .
This separation of address spaces puts constraints on how the user application is compiled and linked.
Instead of letting the compiler compile and link the software, each code, data, and stack segment need to
be mapped to specic address spaces. This is accomplished by assigning code, data, and stack segments
to sections that are mapped to address spaces specied by linker scripts. This means that an executable
le is generated by several phases: compilation of user applications and linking them with precompiled
libraries. The executable le needs to be further processed by programs that adds information that is
needed by the SoCrates system. This makes a total of six phases before a executable le has been created
that it can be dowloaded onto the SoCrates hardware (gure 37).
NodeX.c
Phase 1
armcoffgcc c nodeX.c
NodeX.o
Phase 2
IO.o Ose_ker.o crt0_nodeX.o NodeX.map
armcoffld o nodeX.srec nodeX.o ...

NodeX.srec
Phase 3
srec_cat nodeX.srec > nodeX.wsrec

NodeX.wsrec
Phase 4
srec_cat nodeX.wsrec nodeY.wsrec ... > nodes.srec

Nodes.srec
Phase 5
sed /S5/d nodes.srec > nodes.noS5

sed /S7/d nodes.noS5 > nodes.noS5S7
Nodes.noS5S7
Phase 6
cat nodes.noS5S7 nWaitAlways.SREC > nodes.final

Nodes.final
Figure 37: The phases needed to create an executable le in the SoCrates system.
4.1 Description of the Phases

In order to create an executable le for the SoCrates system a compiler and linker is needed. The
programs used in this project are gcc and ld from Free Software Foundation (FSF) 11 . Also, the compiler
and linker need to be congured for an ARM architecture because the SoCrates system uses ARM cores
in the processor nodes 12 . The next phase after conguring the compiler is to start compiling the user
10 The exact conguration of the local memory in the SoCrates system can be viewed in the chapter 1 of the SoCrates
-Specicatios Revision 0.99 document
11 more information about gcc and ld can be found at http://www.gnu.org
12 see http://gcc.gnu.org for documentation on how to create a crosscompiler
32
applications. Usually, the code for the threads on a node can be implemented in a single le. An example
of a typical user application can be seen below.
#include "Ose_ker.h"
#include "io.h"
int running_thread
__attribute__ ((section (".shared"))) = 0;
#define STACK_SIZE
100
char ctrl_stack[STACK_SIZE];
void ctrl_thread(void)
{
while(1)
{
outstring("C\n");
}
}
int thread_main(void)
{
uart_setup(NO_COM_INT);
ose_init(1);
ose_thread_create(1, 1, ose_READY, 0, ctrl_thread, ctrl_stack, STACK_SIZE, 1);
on_tsw();
}
while(1) { }
The application shows how a main function ( thread main) calls Real Time Unit (RTU) functions that
initializes and creates threads for scheduling during execution. The threads also call io-functions that
makes it possible to send messages to a terminal. The supporting libraries are precompiled and can
be used by including their header le (Ose ker.h and io.h). A thread may need to communicate with
another thread(s) that resides on a dierent node. A common way of communication is to use variables
or semaphores. In order to make the variables \visible" to both threads, they need to be placed in an predened section that every node can access. With the attribute directive, one can assign variables and
functions to special secitions that can later be placed in the local memory by the linker script (more information about the attribute directive can be obtained from the gcc manual at http://www.gnu.org).
The user application is simply compiled in the following way (phase 1 of gure 37). The -c ag tells the
compiler only to compile and not to link the application. The output from this compilation is a object
le. Every node will produce its own object le that will be linked together in the next phase.
In the next phase, the precompiled libraries (or object les) are linked together with the user application
to produce an executable le. The precompiled les are:
Filename
io.o
Ose_ker.o
crt0_nodeX.o
Routines
uart_setup, outchar, getchar, getstring, outstring
init, create_thread, on_tsw, etc...
C runtime 0 functions for node X + taskswitch code
These les are linked together with the object les of user applications to produce a executable le in
srecord format (phase 2 of gure 37). During linking phase, the linker needs a script that shows where
33
all sections are placed in the local memory of each node. The following le is a typical linker script that
maps user code and data, RTU routines, startup routines, and IO routines to dierent memory areas.
OUTPUT_FORMAT("coff-arm-little");
ENTRY(_thread_main)
MEMORY
{
reset_vector
undef_vector
swi_vector
pref_abort
data_abort
reserved
irq_vector
fiq_vector
common_vars
rtu
boot
tsw_code
swi
code
data
shared_at_node1
ni_registers
io_registers
}
SECTIONS
{
.common
.reset
.swi
.irq
.tsw
.rtu
.code
.data
.bss
.rdata
.shared
.init
.io_regs1
.io_regs2
.io_regs3
.io_regs4
.io_code
}
(NOLOAD)
(NOLOAD)
(NOLOAD)
(NOLOAD)
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
{
{
{
{
{
{
{
{
{
{
{
{
{
{
{
{
{
org
org
org
org
org
org
org
org
org
org
org
org
org
org
org
org
org
org
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
0x01000000,
0x01000004,
0x01000008,
0x0100000c,
0x01000010,
0x01000014,
0x01000018,
0x0100001c,
0x01000020,
0x01000824,
0x01000e28,
0x01000f2c,
0x01001030,
0x01001134,
0x01001638,
0x01001b3c,
0x01002000,
0x80000000,
*(COMMON) }
*(.reset) }
*(.swi) }
*(.irq) }
*(.tswitch) }
Ose_ker.o }
*(.text) }
*(.data) }
*(.bss) }
*(.rdata) }
*(.shared) }
*(.init) }
*(.io_reg1) }
*(.io_reg2) }
*(.io_reg3) }
*(.io_reg4) }
io.o(.text) }
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
0x4
0x4
0x4
0x4
0x4
0x4
0x4
0x4
0x800
0x600
0x100
0x100
0x100
0x500
0x500
0x100
0x100
0x10
common_vars
reset_vector
swi_vector
irq_vector
tsw_code
rtu
code
data
data
data
shared_at_node1
boot
io_registers
io_registers
io_registers
io_registers
code
The script begins with the directive OUTPUT FORMAT which tells what the output format should be
from the linker. There are several options available, and they dier depending on which architecture the
linker is congured for. When executing objdump -i, the program will provide information about which
output formats are avaiable. This particular script uses the directive \arm-co-little" which produces a
34
binary le in little endian format. This binary le can later be converted to srecord format with the objcpy
command. Another way is to use \arm-srec" as output format directive. The next directive informs the
linker where the start function is when not using starting from a main function. We chose to always start
our applications from the function thread main and will therefore specify thread main in the ENTRY
directive.
The next directive is a MEMORY body that provides the linker with information about where each
section should reside in the local address space. Here's how a general section could be described:
section_name : org = start_address, LENGTH = area_size
The author of the linker script chooses what the section name, start address, and area size should be.
After the MEMORY body, one must connect the sections produced by the linker to the user dened
sections. This is done in the SECTIONS body, where a general line is written in the following way:
.real_section_name : { chosen_sections } > mapped_section
The real section name species which section is addressed. This could be user dened sections with the
attribute directive or the linker specied sections (e.g .text, .data, .bss, etc). The chosen section eld
describes which of the sections are chosen (in case there are several sections with the same name). For
instance, it is possible to map all the .text sections to a single section by specifying the chosen section eld
as *(.text). The last eld, mapped section, species to which user-dened section this section should be
mapped to (which means that mapped section must be mapped to some section name in the MEMORY
body). The linker script makes it possible to map all the sections according to the SoCrates memory structure (more information on where each section is mapped can be obtained by looking at the test application
linker scripts, node1.x and node2.x, or the System chapther in the SoCrates specication). Also, it is possible to force the linker to report the result of the mapping by specifying the -Map ag. There are many more
options available when writing linker scripts, but the aim of this section is not to describe all the possible
script directives. Instead, the interested reader can nd more information about how to write linker scripts
at http://www.cygnus.com/pubs/gnupro/5 ut/b Usingld/ldLinker scripts.html#Concepts. The linker produces an output le with the name specied after the -o directive.
The executable le must still go through some changes before it can be used by the SoCrates hardware. First, The linker produces a byte-oriented srecord le, which means that each srecord line can have
an non-word sized number of bytes. The IO node in the SoCrates system handles downloading of the
executable le, wants word-sized srecord lines. To convert byte-sized srecords to word-sized srecords, we
use the SRECORD package, which is a collection of programs that can manipulate srecord les. One of
the programs in the package, srec cat, takes a srecord le as input and produces a word-sized srecord le
(phase 3 in gure 37) 13 .
The next phase simply concatenates each nodes word sized srecord les onto one srecord le that can
be downloaded by the IO node (phase 4 in gure 37). In the next phase, the srecord le is stripped
from srecord lines that do not contain data (srecord lines of type S5 and S7). This is accomplished by
letting sed delete the unwanted lines (phase 5 in gure 37). Finally, one srecord line is added to the nal
executable le. This line allows (when it is written to memory) the initially stalled processors to begin
their execution of the downloaded srecord lines (the code and data) (phase 6 in gure 37). The nal
product is an executable le in srecord format that the IO node can download to each processor.
13 More
information on the Srecord package can be obtained from http://www.canb.auug.org.au/ millerp/
5 CONFIGURING THE SOCRATES PLATFORM
35
5 Conguring the Socrates Platform

5.1 System setup
Under development of Socrates the team used CVS to make sharing of les easy. Before checking out
the needed les the environment variable that points out the le repository must be set. Also, an environment variable that points the path to the implementation directory is needed by some les. Add the
rows below to .bashrc or type them at the command line.
export CVSROOT=/proj/socrates/cvs-rep
export SOCRATES implementation=[path to your implementation directory]
Go to the directory where you wish to put the whole system and enter:
cvs checkout implementation
You will now have a directory structure as shown in gure 38. Some additional les exist which are
not included in the directory structure here.
5.2 Conguring HW-platform

The system is nearly generic, which means that the user of the RTL-code can easily aect system parameters, as number of CPU-nodes, node-identity of RTU, IO-node, and the CPU-nodes, and base-addresses.
Today it is possible to generate a arbitrary number of CPU-nodes in the system but there is still some
work to do in this area. For example, the RTU is still static and is for example not adapted for a specic
number of CPU-nodes. The system parameters are set in the le system generics.vhd which is located in
the SYSTEM directory and further information can be found in comments in the le. It would be nice
in the future to have a graphical environment for conguring the Socrates platform.
5.3 Simulation
RTL-simulation can be done by using modelsim. To compile the system go to the implementation
directory and type: compile. Compile the SW as described in section 4 and make test-bench-readable
data of the software by typing loadsystem in the arm-code directory. Terminal emulation can be done
to see the output from the IO-node. To start the terminal emulators go the the arm-code directory and
write disp. Now two Unix-terminal that emulates VT100 should emerge. These terminals are connected
to the two UARTs in the IO-node. Start modelsim (or any other RTL-simulator) and load the entity
SUPERBENCH RTU to simulate the whole system.
5.4 Synthesis
Synthesis time is about 2 hours for Leonardo with optimization and half-an out for place and route14 .
A complete system including two CPU-nodes, RTU, I/O-node and arbiter occupies only 58% of the
VIRTEX 1000 FPGA. Pin placement is controlled by editing the Socrates.ucf le.
14 Using
a Sun Enterprise 450 or equivalent
5.4 Synthesis
36
SYSTEM_RTU.vhd
CPU_NODES.vhd
SUPERBENCH_RTU.vhd
system_generics.vhd
generic_or.vhd
terminal1.vhd
terminal2.vhd
terminal1.so
terminal2.so
terminal1.c
terminal2.c
socrates.ucf
modelsim.ini
Makefile
-top entity of the SYSTEM

-generates a number of CPU-nodes
-instantiates system+testbench for simulation
-package for system parameters
-generic or-grind (used for signal on the interconnect)
-entity + c-interface for a VT100 terminal for node1
-entity + c-interface for a VT100 terminal for node2
-terminal emulation for node1 (object code)
-terminal emulation for node2 (object code)
-code for terminal at node1
-code for terminal at node2
-pin placement for VIRTEX 1000 FPGA
-settings for modelsim
-compilation of VHDL-files for simulation
node1.c
node2.c
crt0_node1.S
crt0_node2.S
node1.x
node2.x
Ose_ker.c
Ose_ker.h
Ose_io.h
Ose_tcb.h
my_terminal.c
io.h
io.o
io.c
nWaitAlways.SREC
socload*
loadsystem*
srec_cat*
myterm*
disp*
Makefile
-Application for node1

-Application for node2
-Start up and task-switch for node1
-Start up and task-switch for node2
-memory map settings for node1
-memory map settings for node2
-Software interface to RTU
-Header file for Ose_ker.c
-Definitions for RTU interface
-TCB description
-code for creating a unix-terminal
-I/O routine declarations and definitions
-compiled I/O library
-I/O routines for UARTs
-SRECORD row that trigs 2-nodes
-creates test.bin from a S-rec input
-creates a downloadable file
-SRECORD concatenator and aligner
-creates an unixterminal (separate process)
-creates two VT100 terminal emulators
-compiles VHDL files for simulation
cpu_node/
CPU_NODE.vhd
Makefile
-port maps components to a CPU-node

RTU/
rtu_node_top.vhd
-VHDL source for RTU generated by RENOIR
SYSTEM
arm-code/
implementation/
src/
wrapper/
interconnect/
cpu-node/
CPU/
wr_sync_dpram_32.vhd
dpram2048x32_sim_distram.vhd*
dpram2048x32.vhd*
dpram2048x8.vhd*
Makefile
-8 kbit DPram (4 byte lanes)

wrapper entity for simulation (instantiates a wr_sync_ram)
-wrapper entity for synthesis (instantiates 4 bytelanes)
-an empty entity
arbiter.vhd
NI.vhd
address_decoder.vhd
prefetch_buffer.vhd
control_unit.vhd
receiver.vhd
sender.vhd
README
Makefile
-source for arbiter

-top entity for NI
-component of NI
-component of NI
-component of NI
-component of NI
-description of files in this directory
CPU_NODE.vhd
cc.do
Makefile
-top entity for the CPU-node. Connects internal components.

-compiles files needed for a CPU-node
CPU.vhd
-top entity for the BLACKARM cpu. Connects internal components.

-component of CPU
-component of CPU
-component of CPU
alu.vhd
barrel.vhd
exception.vhd
inc4.vhd
pipe_compensator.vhd
reg_bank.vhd
trippelport.vhd
wr_sync_dpram.vhd
Makefile
-component of CPU
-component of CPU
-component of CPU
-component of CPU
-component of CPU
sim.do*
socload/
socload.c
slsim/
IO/
socport/
src/
src/
io.vhd
uart.vhd
-converts a SRECORD to parallelport data
slsim*
slsim.c
-converts parallelport format to testbench readable format

-source fiel to socload sim
control.vhd
parstim.vhd
piso.vhd
promrdr.vhd
socport.vhd
soctest.vhd
-submodule of io-node
-io-node top entity, instantiates uarts

-serial sender/reciever to a VT100 terminal
Filerna.txt*
Figure 38: Directory structure and le descriptions for the Socrates Platform
6 CURRENT RESULTS
37
SYSTEM
CPU
MEMORY
CPU-NODE
RTU
NI
SHARED BUS
ARBITER
UART 2
MEMORY
UART 1
I/O-NODE
NI
CPU
CPU-NODE
Figure 39: The generated hardware architecture.
6 Current Results
The goal from the beginning of the Socrates project was to have a multiprocessor system on a single FPGA
running a real-time demo application communicating with a host via VT100-terminals. However, the lack
of time and problems with integration of the real-time kernel due to its compiler specic15 API, delayed
the actual integration, verication and software development by two weeks. Also, the lack of details in
the RTU-documentation increased the system debugging time. This resulted that our ambitious goals
had to be re-visioned to only include RTL-simulation of the system since we kept it important to have
the RTU integrated.
6.1 Developed hardware

Within this thesis we have developed a CPU-clone of the popular ARM7TDMI that executes a subset of
the instruction set supported by the original processor. We added an unique prefetch instruction to enable
use of compilers that supports data prefetching. Also, the pipeline is removed to increase predictability.
Further a Network Interface responsible for on-chip system communication and prefetch functionality was
developed. A CPU-clone connected to the Network Interface and dual-ported block ram constitutes a
CPU-node. These nodes are connected onto a shared bus mastered by a central arbiter unit. An I/O-unit
connected to the bus handles external communication and system download and startup. All these parts
are individually veried with RTL-simulation. An overview of the developed system hardware can be
seen in gure 39.
6.2 System software

To hide hardware details from the user, system software is necessary. By having ready-to-use routines
interfacing the actual hardware the user can concentrate the programming at a higher abstraction level.
In this subsection we describe those features except the RTU routines which are documented in Booster
RTU - Hardware Functional Specication.
15 Microtec
6.3 Linker scripts
38
6.3 Linker scripts

Since each processor has its own local memory, there is need to map the application to dierent memory
locations. To accomplish this linker scripts were developed (see Appendix D: Linker scripts or section
Compiling & Linking the System Software for further details).
6.3.1 Task-switch routine

This routine written in assembler performs a task switch either initiated by the Real-Time kernel or by
the user thread. Every processor has its own task switch in local memory. The ID of the next thread to
execute is fetched from the Real-Time Kernel unit using a well dened protocol. For further details, read
Appendix C: Task switch routines.
6.3.2 I/O primitives

To enable external communication text management primitives has been created to hide the UART
programming details from the application programmer. These routines are:
uart setup() Initiate the control register.

outchar() Send a character to a VT100 terminal.
outstring() Send a string to a VT100 terminal.
getchar() Get a character from user.
getstring() Get a string from user.
For further details, see Appendix B: I/O routines.
6.4 Demo application

All applications running on the Socrates platform utilizes thread-level parallelism in order to perform
its task. The demo-application made to validate some of the system functionality was divided into
threads to be representative for a more complex application. Our test program consists of three threads
mapped onto two CPU-nodes. A control thread is placed on CPU1 in the system whereas thread 2
and 3 are located on CPU2 for timesharing execution. The thread consists of a innite loop that prints
the string \CONTROL" to the simulated16 VT100 text terminal1. The control thread was intended to
have supervision of the other two threads, but lack of time made this goal unachievable. On the other
processor the two threads prints out a string \THREAD n" n=f1,2g to VT100 text terminal2. After
the string has been written the thread suspends it self by performing the yield system call causing the
task-switch routine on that processor to be avowed. An illustration of the simulation environment used
to test the system and the demo application can be viewed in gure 40. For complete code listing, please
read Appendix A: Demo Application.
6.5 RTL-simulation
The whole system was simulated at the RTL-level, using MentorGraphicsT M ModelSim. The demoapplication was compiled, linked with GCC cross compiler for the ARM architecture and downloaded
to a memory-model via the IO-node. It was successfully executed writing messages to the VT100 text
terminals performing task-switches for 7 seconds of simulated time lasting over a couple of days. Figure
39 shows a schematic picture of the simulated HW-components.
16 A
C add-on to Modelsim environment
6.6 Synthesis Results
39
Simulated VT100 terminals
Host Software
THREAD 1
THREAD 2
THREAD 1
THREAD 2
.
.
.
THREAD 2
THREAD 1
CONTROL
CONTROL
CONTROL
.
.
.
.
CONTROL
CPU 1
CPU 2
Thread 1
Threrad 0
Thread 2
RTU
Simulated Hardware
Figure 40: The test environment, simulated hardware with software threads executing on dierent processors interacting with the RTU and printing strings to the simulated software host.
6.6 Synthesis Results

The prototype (which can be seen in gure 39) with 2 CPU-nodes, RTU, IO-node and arbiter was
synthesized using MentorGraphicsT M Leonardo. The target technology is a XILINX VIRTEX 1000
FPGA. The synthesis was optimized for area. Results for the dierent components can be viewed in the
table below:
Unit
SYSTEM
CPU-NODE
CPU
RTU
NI
IO
ARBITER
Gates
653,349
41,929
33,834
5,873
6,625
0,271
Slices17
7,243
2,414
1,925
0,352
0,318
0,020
Percent of total slices

58 %
19 %
15 %
2%
2%
1%
Maximum frequency
14.7 MHZ
18 MHz
29.6 MHz
50 MHz
64.9 MHz
-
7 FUTURE WORK
40
7 Future work
The objective of the implementation phase of this master thesis has been to specify, develop and implement
the rst prototype of the Socrates scalable system on a chip. A processor clone able to execute a subset of
the ARM7TDMI instruction set has been developed. The processor clone also features an addition to the
original instruction set, a prefetch instruction has been designed and implemented. A network interface,
handling internal as well as external bus transactions is also developed. The system also features a Realtime Unit RTU capable of handling several processors. The interface between hardware and software is
handled by kernel routines. For software development a cross compiler18 has been used, which will enable
us to create software in a serious manner, it also gives us the possibility to reuse other peoples c-code.
The linker scripts for linking the code on to the system is also developed and in use. However as rst
stated this is the very rst prototype and things do not become perfect at once. This also apllies to this
prototype, both on a system and architectural level as well as in component design and implementation.
The following sections will deal with the eventual short comings exist and how to solve them.
7.1 CPU improvements

The main reason for choosing a ARM processor for the Socrates project was it's wide industrial usage.
During the development of the ARM-clone we discovered that the processor was immense in it's complexity. This complexity makes the processor hard to fully understand therefore implicitly not very easy
to implement and verify. The question that arises is whether we shall continue to use this clone, buy a
real ARM processor or search for other alternatives. If the project sta decides to keep the ARM-clone,
improvements and eorts to increase the compatibility with the real ARM processor must be done. The
remainder of the section is dedicated to explore some improvement areas of the processor clone.
7.1.1 Increase compatibility

The clone described in this document only have the ability to execute a subset of the original ARM7TDMI
instruction set. Those instruction that are implemented might not be sucient for all code the compiler
produce. In order to be sure that all compiled code is executable on the clone all of the instructions must
be implemented. This the rst version of the clone do not support all possible modes the original does,
which also might be necessary to support, motivated by the reuse of already existing applications which
otherwise might be reduced, or even none at all.
7.1.2 Register le
The use of the sysa 1 dpram module might save us some CLBs on the FPGA, but the module is inecient
when it comes to speed. There must exist other modules worth examine and evaluate if they can be used
instead of the module currently in use. If no better alternative can be found the only reaming option is
to design an new ram module within the project, in vhdl or layout. For the instruction19 that need to
obtain three register at once, the number of ports needs to be increased.
7.1.3 Control unit

The control unit as mentioned earlier in this document may not be implemented in the most eective
way. A better approach would be to divide the unit in to two processes, where one of them synchronously
handles state transitions. the other process asynchronously assigns the out signals depending on the
actual state the processor currently resides in.
18 A c-compiler, gcc.
19 Data processing with
operand 2 as register register.
41

The prefetch functionality in this the rst version is restricted to only prefetch data of one word20 . For
this kind of data prefetch to become useful, the amount of data that is prefetched must be increased
and in some way controlled by the prefetch instruction. This improvement will aect both the processor
and the network interface. Today the prefetch instruction is handled by the processor as any other
instruction. The prefetch instruction could be managed by a prefetch unit snooping the instruction
stream, re leaving the processor from handling that extra instruction. A device like that would also be
preferable if a commercial processor is to be used in the future. Why restrict the prefetch to data, when
instructions can be prefetched as well? Let the system prefetch the next task to be executed, or prepare
for execution of the next task by starting to prefetch data for that task. Allow several prefetches to be
issued concurrently.
7.3 Allow process migration

In this Socrates prototype tasks are restricted only to execute on the home node, no migration of processes
between the processors is possible. In order to build a highly dynamic platform for real time systems
process migration can be highly desirable.
20 Four
bytes for the ARM processor.
Conclusions
We have shown that it is possible to develop (including a whole CPU core) a multiprocessor SoC that
ts on a single FPGA in very short time. By placing memory on-chip and reducing complexity in the
CPU core to gain system predictability, we take advantage of the benets that come with developing
systems on a single chip. When working closely in a small group of 3-5 people, information, new ideas,
and bug-reports during verication, are propagated immediately. A platform concept gives rapid design
time and keeps the designers focused on debugging the application and not the platform itself. One of
the biggest challenges remaining to be solved is the actual verication of the platform because the SOCs
all have embedded components that are hard to debug. Another issue is how to make software development easier to integrate together with the platform. Although our ambitions was to have a complete
mulitprocessor system running on a FPGA the nal stage of the master thesis ended at RTL-simulation
of a demo-application running on the target system.
SoCrates - Appendix
CONTENTS
II
Contents
1
Appendix A: Demo Application
Appendix B: I/O Routines
Appendix C: Task swich routines
Appendix D: Linker scripts
1.1 Application code for node No. 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2 Application code for node No. 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1 I/O header le . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 I/O source code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1 Task switch routine for node No. 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Task switch routine for node No. 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
2
3
4
6
8
4.1 Linker script for node No. 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.2 Linker script for node No. 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
11
Appendix E: Date 2000 Conference, Designers Forum, publication
13
APPENDIX A: DEMO APPLICATION
Appendix A: Demo Application
The demo application consists of two source les: node1.c and node2.c. The rst node (node1) creates
and executes a control thread that simply prints out messages to the terminal. The other node (node2)
creates and executes two threads where each thread prints a message and lets the other thread take over
the execution. Every node has a thread main function that creates and starts up threads.
1.1 Application code for node No. 1

/************************************************************************/
/*
File
: node1.c
*/
/*
Authors
: Mladen Nikitovic, Raimo Haukilahti, Mikael Collin */
/*
Date
:
*/
/*
Description : Application code for node 1
*/
/************************************************************************/
#include "io.h"
int running_thread
int busy_thread

#define CTRL_ID
#define CTRL_PRIO
#define STACK_SIZE
1
1
100
char ctrl_stack[STACK_SIZE];
void ctrl_thread(void)
{
while(1)
{
outstring("Control \n");
}
}
{
ose_init(1);
ose_thread_create(1, 1, ose_READY, 0, ctrl_thread, ctrl_stack, STACK_SIZE,
1);
on_tsw();
while(1) {
}
1.2
Application code for node No. 2
1.2 Application code for node No. 2

/************************************************************************/
/*
File
: node2.c
*/
/*
Authors
: Mladen Nikitovic, Raimo Haukilahti, Mikael Collin */
/*
Date
:
*/
/*
Description : Application code for node 2
*/
/************************************************************************/
#include "io.h"
int running_thread __attribute__ ((section (".shared")))=0;
int busy_thread __attribute__ ((section (".shared")))=0;
#define STACK_SIZE 100
char t1_stack[STACK_SIZE];
char t2_stack[STACK_SIZE];
void thread_1(void)
{
while(1) {
outstring("task 1\n");
ose_thread_yield();
}
}
void thread_2(void)
{
while(1) {
outstring("task 2\n");
ose_thread_yield();
}
}
{
ose_init(2);
ose_thread_create(2, 0, ose_READY, 0, thread_1, t1_stack, STACK_SIZE,2);
ose_thread_create(3, 0, ose_READY, 0, thread_2, t2_stack, STACK_SIZE,2);
on_tsw();
while(1) {
}
APPENDIX B: I/O ROUTINES
Appendix B: I/O Routines
The I/O routines are communicating with an UART (9600 baud) in polled mode. This means that each
character is received or transmitted by reading/writing to the UART control registers. Each node that
need to write messages to a terminal need to initialize its UART with uart setup. After initialization,
sending or receiving messages can be done with one of the follwing routines listed in the header le.
2.1 I/O header le

/************************************************************************/
/*
File
: io.h
*/
/*
Author(s)
: Mladen Nikitovic
*/
/*
Date
:
*/
/*
Description : I/O routine(s) header file
*/
/************************************************************************/
extern
extern
extern
extern
unsigned
unsigned
unsigned
unsigned
#define
#define
#define
#define
#define
extern
*/
extern
extern
extern
extern
char
char
char
char
UART_TX;
UART_RX;
UART_CR;
UART_SRG;
COM1_INT_REC
COM2_INT_REC
COM1_INT_SEND
COM2_INT_SEND
NO_COM_INT
1
16
2
32
0
/* UART registers */
/* Control register constants */
void uart_setup(unsigned char setup);

void outchar(unsigned char);
void outstring(unsigned char*);
unsigned char getchar(void);
void getstring(unsigned char*);
/* I/O routine declarations
2.2
I/O source code
2.2 I/O source code

/************************************************************************/
/*
File
: io.c
*/
/*
Author(s)
: Mladen Nikitovic
*/
/*
Date
:
*/
/*
Description : I/O routine(s) source file
*/
/************************************************************************/
unsigned
unsigned
unsigned
unsigned
char
char
char
char
UART_TX
UART_RX
UART_CR
UART_SRG
__attribute__
__attribute__
__attribute__
__attribute__
((section(".io_reg1")));
#define ASCII_CR 13
void uart_setup(unsigned char setup)
{
UART_CR = setup;
}
void outchar(unsigned char c)
{
while( (UART_SRG & (unsigned char)32) != 32)
;
}
UART_TX = c;
void outstring(unsigned char *ptr)

{
while(*(ptr) != '\0')
{
while( (UART_SRG & (unsigned char) 32) != 32)
;
UART_TX = *(ptr++);
unsigned char getchar(void)

{
while( (UART_SRG & (unsigned char)16) != 16)
;
}
return UART_RX;
void getstring(unsigned char *ptr)

{
do
{
2.2
I/O source code
while( (UART_SRG & (unsigned char) 16) != 16)

;
*(ptr++) = UART_RX;
} while(UART_RX != ASCII_CR);
}
*ptr = (unsigned char) '\0';
APPENDIX C: TASK SWICH ROUTINES
Appendix C: Task swich routines
The task switch routine must save the context of the current running thread, then fetch the identiaction
of the next thread, and nally restore the context of the new thread. The dierence between switching
threads at node 1 and node 2 is where the EXECTHREAD is located in memory.
3.1 Task switch routine for node No. 1

/************************************************************************/
/*
File
: task_sw1.S
*/
/*
Authors
: Mikael Collin, Raimo Haukilahti
*/
/*
Date
:
*/
/*
Description : Task switch routine for node 1
*/
/************************************************************************/
.section .tswitch
.code 32
.align 0
.extern _thread_main
.extern _ose_get_next_thread
.extern _task_sw
_task_sw:
/* NOTE: here we are in SUPERVISOR MODE, irq disabled (HW) */
str LR, [SP, #-4]! /* store supervisor_LR on stack (r14_svc==PC) */
mov LR,#4352
/* Move address to EXECTHREAD to supervisor_LR */
add LR,LR,#32
/* (2 op-codes required) */
add LR,LR,#16777216 /* Add ID address to EXECTHREAD to supervisor_LR*/
ldr LR,[LR]
/* Get the pointer at EXECTHREAD to svc_LR */
add LR,LR, #20
/*** SAVE CONTEXT ***/
stmia LR,{R0-R14}^
/* Write R0-R14 to the TCB of EXECUTING thread */
mrs R1,spsr
/* restore old processor status */
str R1,[LR, #80]
/* Saves old CPSR in TCB */
ldr R1,[SP, #4]!
/* Read old PC from stack */
str R1,[LR, #84]
/* store old PC in TCB */
/*** context saved ***/
bl _ose_get_next_thread
/* EXECTHREAD is updated from RTU-reg */
/*** RESTORE NEW THREADS CONTEXT ***/

/* move the NEW EXECTHREAD to LR */
mov LR,#4352
add LR,LR,#32
ldr LR,[LR]
add LR,LR, #20
/* the address to the next running threads TCB is now in LR */
ldr R1,[LR, #80]
/* save the new status reg in R1 */
3.1
Task switch routine for node No. 1
msr spsr, R1
/* save the status reg
ldmia LR,{R0-R14}^
/* restore registers
ldr LR, [LR, #84] /* restore pc from tcb
adds PC, PC, #0
/* Force mode change,
in saved status reg

from tcb */
(to LR) */
SUPER->USER mode */
*/
3.2
Task switch routine for node No. 2
3.2 Task switch routine for node No. 2

/************************************************************************/
/*
File
: task_sw2.S
*/
/*
Authors
: Mikael Collin, Raimo Haukilahti
*/
/*
Date
:
*/
/*
Description : Task switch routine for node 2
*/
/************************************************************************/
.section .tswitch
.code 32
.align 0
.extern _thread_main
.extern _ose_get_next_thread
_task_sw:
/* NOTE: here we are in SUPERVISOR MODE, irq disabled (HW) */
str LR, [SP, #-4]! /* store supervisor_LR on stack (r14_svc==PC) */
mov LR,#4352
add LR,LR,#32
ldr LR,[LR]
add LR,LR, #20
/*** SAVE CONTEXT ***/
stmia LR,{R0-R14}^
/* Write R0-R14 to the TCB of EXECUTING thread */
mrs R1,spsr
/* restore old processor status */
str R1,[LR, #80]
/* Saves old CPSR in TCB */
ldr R1,[SP, #4]!
/* Read old PC from stack */
str R1,[LR, #84]
/* store old PC in TCB */
/*** context saved ***/
bl _ose_get_next_thread /* EXECTHREAD is updated from RTU-reg */
/*** RESTORE NEW THREADS CONTEXT ***/
/* move the NEW EXECTHREAD to LR */
mov LR,#4352
add LR,LR,#32
ldr LR,[LR]
add LR,LR, #20
/* the address to the next running threads
ldr R1,[LR, #80]
/* save the new status
msr spsr, R1
/* save the status reg
ldmia LR,{R0-R14}^
/* restore registers
ldr LR, [LR, #84] /* restore pc from tcb
adds PC, PC, #0
/* Force mode change,
TCB is now in LR */
reg in R1 */
in saved status reg
from tcb */
(to LR) */
SUPER->USER mode */
*/
APPENDIX D: LINKER SCRIPTS
Appendix D: Linker scripts
A detailed explanation of the meaning of each row in the linker script can be found in the Compiling
Linking the System Software section.
4.1 Linker script for node No. 1

/************************************************************************/
/*
File
: node1.x
*/
/*
Authors
: Raimo Haukilahti, Mladen Nikitovic
*/
/*
Date
:
*/
/*
Description : Linker script for node No. 1
*/
/************************************************************************/
ENTRY(_thread_main)
MEMORY
{
reset_vector
undef_vector
swi_vector
pref_abort
data_abort
reserved
irq_vector
fiq_vector
common_vars
rtu
boot
tsw_code
swi
code
data
shared_at_node1
ni_registers
io_registers
}
SECTIONS
{
.common
.reset
.swi
.irq
.tsw
.rtu
.code
.data
.bss
.rdata
.shared
.init
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
{
{
{
{
{
{
{
{
{
{
{
{
org
org
org
org
org
org
org
org
org
org
org
org
org
org
org
org
org
org
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
0x01000000,
0x01000004,
0x01000008,
0x0100000c,
0x01000010,
0x01000014,
0x01000018,
0x0100001c,
0x01000020,
0x01000824,
0x01000e28,
0x01000f2c,
0x01001030,
0x01001134,
0x01001638,
0x01001b3c,
0x01002000,
0x80000000,
*(COMMON) }
*(.reset) }
*(.swi) }
*(.irq) }
*(.tswitch) }
Ose_ker.o }
*(.text) }
*(.data) }
*(.bss) }
*(.rdata) }
*(.shared) }
*(.init) }
>
>
>
>
>
>
>
>
>
>
>
>
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
0x4
0x4
0x4
0x4
0x4
0x4
0x4
0x4
0x800
0x600
0x100
0x100
0x100
0x500
0x500
0x100
0x100
0x10
common_vars
reset_vector
swi_vector
irq_vector
tsw_code
rtu
code
data
data
data
shared_at_node1
boot
&
4.1
10
Linker script for node No. 1
.io_regs1
.io_regs2
.io_regs3
.io_regs4
.io_code
(NOLOAD)
(NOLOAD)
(NOLOAD)
(NOLOAD)
:
:
:
:
:
{
{
{
{
{
*(.io_reg1)
*(.io_reg2)
*(.io_reg3)
*(.io_reg4)
io.o(.text)
}
}
}
}
}
>
>
>
>
>
io_registers
io_registers
io_registers
io_registers
code
4.2
11
4.2 Linker script for node No. 2

/************************************************************************/
/*
File
: node2.x
*/
/*
Authors
: Raimo Haukilahti, Mladen Nikitovic
*/
/*
Date
:
*/
/*
Description : Linker script for node No. 2
*/
/************************************************************************/
ENTRY(_thread_main)
MEMORY
{
reset_vector
undef_vector
swi_vector
pref_abort
data_abort
reserved
irq_vector
fiq_vector
common_vars
rtu
boot
tsw_code
swi
code
data
shared_at_node1
ni_registers
io_registers
}
SECTIONS
{
.common
.reset
.swi
.irq
.tsw
.rtu
.code
.data
.bss
.rdata
.shared
.init
.io_regs1
.io_regs2
.io_regs3
.io_regs4
.io_code
(NOLOAD)
(NOLOAD)
(NOLOAD)
(NOLOAD)
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
{
{
{
{
{
{
{
{
{
{
{
{
{
{
{
{
{
org
org
org
org
org
org
org
org
org
org
org
org
org
org
org
org
org
org
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
0x02000000,
0x02000004,
0x02000008,
0x0200000c,
0x02000010,
0x02000014,
0x02000018,
0x0200001c,
0x02000020,
0x02000824,
0x02000e28,
0x02000f2c,
0x02001030,
0x02001134,
0x02001638,
0x01001b3c,
0x02002000,
0x80000100,
*(COMMON) }
*(.reset) }
*(.swi) }
*(.irq) }
*(.tswitch) }
Ose_ker.o }
*(.text) }
*(.data) }
*(.bss) }
*(.rdata) }
*(.shared) }
*(.init) }
*(.io_reg1) }
*(.io_reg2) }
*(.io_reg3) }
*(.io_reg4) }
io.o(.text) }
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
0x4
0x4
0x4
0x4
0x4
0x4
0x4
0x4
0x800
0x600
0x100
0x100
0x100
0x500
0x500
0x100
0x100
0x10
common_vars
reset_vector
swi_vector
irq_vector
tsw_code
rtu
code
data
data
data
shared_at_node1
boot
io_registers
io_registers
io_registers
io_registers
code
4.2
12
APPENDIX E: DATE 2001 CONFERENCE, DESIGNERS FORUM, PUBLICATION
13
Appendix E: Date 2001 Conference, Designers Forum, publication
This article, which is a summary of the Socrates projet, has been accepted at the Date 2001 Conferance.
As we are currently waiting for instructions and feedback from the revision cometee, we cannot provide
the camera-ready version.
SoCrates
- A Multiprocessor SoC in 40 days
Mikael Collin, Raimo Haukilahti, Mladen Nikitovic and Joakim Adomat
Department of Computer Engineering, Malardalen Real-Time Research Center (MRTC)
Malardalen University, P.O Box 883, S-721 23 Vasteras, Sweden
fmci, rhi, mnc, jatg@mdh.se
Abstract
The design time of System-on-a-Chip (SoC) is today
rapidly increasing due to high complexity and lack of efficient tools for development and verification. The article describes the design and implementation of a Multiprocessor
SoC (MSoC) conducted by three master students. We propose a generic platform generator as way to reduce timeto-market and verification time. With the project, we have
shown that it is possible in a short time to develop a MSoC
that fits on a single FPGA.
1. Introduction
This paper describes the design of the first prototype of
Socrates, a generic scalable platform generator that creates
a synthesizable HDL description of a multiprocessor system. The platform is a result of a master thesis by three
students. The goal was to build a predictable multiprocessor system with mechanisms for data prefetching on a single
FPGA. This means that all development has to be done in a
very short time and all the software and hardware must fit
on a single FPGA.
2. Motivation
Design time including verification, has become one of
the largest challenges in SoC design. The productivity gap
shows the problem with developing complex SoC. To reduce this gap new design methods are needed. In order to
meet the demands of fast verification and time-to-market,
the system needs to be designed at a higher abstraction
level. This can be obtained by using a platform generator, where the individual components are already verified.
This means that the platform can instantly be tested and
SoCrates stands for SoC for real-time systems.
verified at system-level, which reduces the overall development time. To decrease design time we propose a parameterised system generator. SoCrates can easily scale to a
large number (1-20) of processing nodes and adapt the remaining components to given parameters at compile time.
3. System Overview
Socrates is a distributed shared memory (DSM) multiprocessor, with non-uniform memory access time. The architecture is based on a shared bus 1 on to which nodes are
connected (figure 1). A typical node contains local memory,
a network interface, and a CPU or DSP, where the CPU is an
ARM7 [5] clone. There exists a version with a DSP node,
but due to area limitations, no DSPs are included in this
version [3]. Software applications are divided into threads
and distributed onto the processors. Scheduling of threads
is managed and controlled by a Real-Time Unit (RTU) [2].
Interprocess communication is performed through shared
variables, whereas communication between hardware devices are made via memory mapped registers. In order to
compensate for the non-uniform memory latency, the system supports a prefetch functionality. Remotely located
data words are possible to fetch in advance before the point
in execution where data is required. I/O is managed by a
centralised node, which is responsible for code download
and external communication.
3.1. Hardware Description

The choice of an ARM architecture is motivated by
its popularity and wide industrial use. Since no university license for the processor source could be obtained and
the costs for ordering the processor as a hard IP was too
high, the remaining option was to implement an ARM7
clone. For complexity reasons the clone only supports ARM
1 Not the most attractive solution but commonly used for small multiprocessor systems [1].
MEM
NI
CPU/DSP
NI
CPU/DSP
MEM
MEM
NI
CPU/DSP
I/O
CPU/DSP
3.3. Software Development Flow

NI
Interconnect
Arbiter
NI
MEM
CPU/DSP
MEM
CPU/DSP
NI
RTU
MEM
Interrupt lines
To make software development easier there is a need for

a well-defined flow, from application writing to download
of linked object code. Each nodes code, stack, and data
must be mapped to their local address spaces. Therefore,
each node is compiled and linked separately before final
concatenation as shown in figure 2. A software interface
is supported for the RTU and for I/O handling. The SW
flow is based on GNU tools [4].
APPLICATION
External Interrupts
Figure 1. The architecture of the SoCrates

hardware platform
node1.c
int main(void)
node2.c
int main(void)
createThread(...);
createThread(...);
createThread(...);
createThread(...);
.
.
.
.
.
.
n1Thread1.c n1Thread2.c
state, and system/user and supervisor mode. The cache and

pipeline are removed to further simplify the design, minimise area, and to increase predictability. Hence, the processor is fully predictable at the instruction level although the
performance is decreased. The processor executes a subset of the ARM instruction set where multiplication, coprocessor instructions, and the state changing instruction
are removed. In addition, the instruction set is augmented
with a prefetch instruction.
The memory on each node is a dual-ported, zero wait-state
Xilinx block RAM. Memory accesses are made transparent
through a wrapper module, which makes it easier to change
between different target technologies.
Both internal and external communication is managed by a
Network Interface (NI). The NI encapsulates the complexity of communication and address translation, acting as a
simple MMU. The eight most significant bits of the system
addresses is a one-hot coded node identification field, the
remaining 24 bits constitutes the local address.
All nodes have their NIs connected to the shared bus mastered by a round robin arbiter. The system supports unicast, multicast, and broadcast transfers. For atomic transfers locked accesses can be utilised.
3.2. Real-Time Kernel
n2Thread1.c n2Thread2.c
gcc cross-compiler
gcc cross-compiler
io.o
OSkernel.o applicationNode1.o node1map.x
io.o
OSkernel.o applicationNode2.o node2map.x
node1.SREC
node1.SREC
application.SREC
Figure 2. Flow from source to downloadable

object file.
3.4. Tools and FPGA-board

The platform for the physical implementation is a XILINX [6] VIRTEX 1000 FPGA board (Figure 3). The FPGA
has 16 Kb of block RAM that is shared among the processor
nodes. A standard PC parallel port is used for code download and text terminals are used for application communication. In simulation, all physical I/O is modelled by C-code
add-ons to Modelsim [7].
4. Problems
The hardware implemented real-time kernel performs
thread scheduling and synchronisation. RTU communication with the nodes is implemented with a memory mapped
register-based handshake scheme. Interrupts are sent out
upon context switch and the nodes fetches their next thread
ID. The kernel is parameterised with respect to the maximum number of threads, priorities and CPUs.
Despite fast workstations, time for simulation and verification is clearly a problem. Run-times for several days were
not unusual, especially for back-annotated data. System
simulation on pre place-and-route timing simulation proved
to be useful as Place and Route took several days for the
final implementation.
7. Conclusions
Figure 3. A photo of the FPGA board.
5. Current Results
A demo-application that runs on two CPU-nodes and
utilises the RTU and I/O nodes has been successfully implemented. The whole system including hardware and software fits on a single FPGA. The developed linker scripts
maps the user software to local address spaces and the system is parameterised by using generics. Figure 4 shows synthesis results of each hardware component.
Component
System
CPU node
RTU
CPU core
NI
Used gates
653 349
47 897
33 834
41 929
5 873
% of used FPGA gates

58
21
15
19
2
Figure 4. Synthesis results of SoCrates components.
6. Future Work
Design for Test (DFT) is assumed to be implemented
with standard tools and design flows. This can be accomplished due to our all synthesizable approach. Our scalability relies on a perfect interconnect. Today, a shared bus
is used, which must be improved in the future with pointto-point/crossbar options in the generator. Software targeting is an important area and was solved by letting the user
partition the application at thread level. This can be done
automatically with respect to thread communication (locality) and code/data size. Interprocess communication HW
support will be integrated from existing research projects
at MRTC. With automatic generation of synthesis scripts, a
pushbutton design flow seems within reach. Virtual memory and dynamic memory allocation HW support will be
added.
This project shows that it is possible to develop (including a whole CPU core) a multiprocessor SoC that fits on
a single FPGA in very short time. By working closely
in a small group of 3-5 people, information as new ideas
and bug-reports during verification, are propagated immediately. A platform concept gives rapid design time and keeps
the designers focused on debugging the application and not
the platform itself. An all synthesizable approach is far
from feasible in most projects but it is a convenient way of
getting rapid results. A system view is crucial when designing SoC. All parts interact and focusing on narrow HW or
SW issues does not give a satisfying solution. The platform
concept will be dominating until IP companies have solved
todays problems with reusability and interoperability.
References
[1] David E.Culler, et al., Parallel Computer Architecture,
A Hardware/software approach, Morgan Kaufmann Inc,
San Fransisco California, 1999, ISBN 1-55860-343-3.
[2] L Lindh, et al., From Single to Multiprocessor RealTime Kernels in Hardware, IEEE Real-Time Technology
and Applications Symposium, Chicago, USA, May 15 17, 1995
[3] Carnegie-Mellon Low Power Group
http://www.ece.cmu.edu/ lowpower/benchmarks.html
[4] The GNU project
www.gnu.org
[5] ARM Ltd.
www.arm.com
[6] Xilinx Inc.
www.xilinx.com
[7] Mentor Graphics Corporation
www.mentor.com/modelsim

Mpsoc

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Mpsoc

Încărcat de

Drepturi de autor:

Formate disponibile

SoCrates

- A Scalable Multiprocessor System On Chip

Mikael Collin, Mladen Nikitovic, and Raimo Haukilahti

Johan Starner and Joakim Adomat

Department of Computer Engineering

stands for SoC for Real-Time Systems

Document 2 Socrates Speci cations

Document 3 Socrates -Implementation details

Computer Architecture for System on Chip

Mikael Collin, Mladen Nikitovic, and Raimo Haukilahti

Johan Starner and Joakim Adomat

1.1 What is SoC ? . . . . . . . . . . . . . . . . . . . . . .

3.1 Introduction and basic de nitions . . . . . . . . . . . . . . .

4.1 Semiconductor memories . . . . . .

4.3.3 Storage strategies . . . . . . . . . . .

1.2 Soc Designs

1.3 Why SoC?

1.2.2 An Example of a SoC

Figure 1: An example of a SoC

1.3 Why SoC?

1.3.2 State of Practice and Trends

chip or chipset for a speci c application [59].

1.4 Introduction to Computer System Architecture

1.5 Research & Design Clusters

Figure 2: A typical computer system

1.5 Research & Design Clusters

1.5 Research & Design Clusters

1.5.1 Hydra: A next generation microarchitecture

Architecture and Compiler Power Issues in System on a Chip

1.5 Research & Design Clusters

MediaWorm: A Single Chip Router Architecture with Quality of Service Support

Focus is made on complexity of multimedia algorithms and development of mathematical software

2.2 The Building Blocks of an Embedded CPU

Write Data Register

32-bit Data Bus

Figure 3: A typical embedded CPU

accumulator architecture general-purpose architecture

2.2 The Building Blocks of an Embedded CPU

2.2 The Building Blocks of an Embedded CPU

Figure 4: A General Pipeline

in case one or several of the instruction's operand(s) is a register.

2.3 The Microprocessor Evolution

2.3 The Microprocessor Evolution

capability of microprocessor doubles every 18 month.

2.3 The Microprocessor Evolution

microcoding, more complex instructions

pipeline, simple instructions for speed

Single Chip Multiprocessor

duplicated HW resources (regs, PC, SP)

execute multiple instructions

Figure 5: The evolution of microprocessors.

2.4 Design Aspects

2.4 Design Aspects

2.5 Implementation Aspects

consumption is often measured as milliwatt per megahertz (mW/MHz).

2.6 State of Practice

2.6 State of Practice

Advanced Micro Devices

2.7 Improving Performance

2.7 Improving Performance

Figure 6: A VLIW instruction packet example.

Figure 7: An example of a superscalar architecture. Source: Marc Torrant[97]

2.7 Improving Performance

Johan Starner and Joakim Adomat

Document 2 Socrates Specications

Johan Starner and Joakim Adomat

3.1 Introduction and basic denitions . . . . . . . . . . . . . . .

chip or chipset for a specic application [59].

task is dened as a piece of work done by a program.