Documente Academic
Documente Profesional
Documente Cultură
Authors
Supervisors
Examinator
Lennart Lindh
Abstract
This document is the result of a Master Thesis in Computer Engineering, describing the analysis,
specication and implementation of The rst prototype of Socrates, a congurable, scalable and
predictable platform for System-on-chip Multiprocessor system for real-time applications. The design
time of System-on-a-Chip (SoC) is today rapidly increasing due to high complexity and lack of ecient
tools for development and verication. By combining all the functions into one chip the system
becomes smaller, faster, and less power consuming but increasing the complexity. To decrease the
time-to-market SoCs are entirely or partially build with IP-components. Thanks to SoC, a whole new
domain of products, like small hand held devices, has emerged. The concept has been around a few
years now, but there are still challenges that needs to be resolved. There is a lack of standards for
enabling fast mix and match of cores from dierent vendors. Further needs are new design methods,
tools, and verication techniques. SoC solutions needs special kind of CPUs that consumes less power,
is cheaper, smaller, but still has high-performance requirements. To fulll all these demands, they are
getting more and more complex as the number of transistors are rapidly growing which has led to the
emerging of multiprocessors systems-on-a-chip. Our initial question is to investigate if it is possible
to build these complex multiprocessors systems on a single FPGA and if these solutions can lead
to shorter time-to-market. The consumer demands for cheaper and smaller products makes FPGA
solutions interesting. Our approach is to have multiple processing nodes containing processing unit,
memory and a network interface all together connected on a shared bus. A central in-house developed
hardware real-time unit handles scheduling and synchronization. We have designed and implemented
a MSoC that ts on a single FPGA in only 40 days, which has to our supervisors knowledge not been
accomplished before. Our experience is that a tightly coupled group can produce fast results since
information, new ideas and bug reports propagates immediately.
SoCrates
Introduction
This report describes the design of the rst prototype of SoCrates, a generic scalable platform generator which creates a synthesizable HDL description of a multiprocessor system. The goal was to build
a predictable multiprocessor system on a single FPGA with mechanisms for prefetching data and an
in-house developed integrated hardware real-time unit.
The report consist of three parts where the rst part, Computer Architecture For System on Chip,
is a state of the art report introducing basic SoC terminology and practice with a deeper analysis in
CPUs, interconnects and memory hierarchies. The purpose of this analysis was to learn about state of
the art techniques on how to design complex multiprocessor SoCs. The design process resulted in part
two, SoCrates - specications, which describes the prototype and all the individual parts functionality
and specic demands. Part three, Socrates -implementation details, describes the implementation on all
parts, how to congure the system, and how to compilie and link the system software. We also present
synthesis results and suggest future work that can be done to improve the system.
SoCrates
-Document index
Document 1 Computer Architecture for System on Chip - A State of the Art Report
1.
2.
3.
4.
5.
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Embedded CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27
Memory System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . .1
CPU Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7
Network Interface . . . . . . . . . . . . . . . . . . . . . . . . . . 16
IO Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Arbitration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Boot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29
Memory Wrapper . . . . . . . . . . . . . . . . . . . . . . . . . . . .30
CPU . . . . . . . . . . . . . . . . . .
Network Interface . . . . . . . . . . .
Arbiter . . . . . . . . . . . . . . . .
Compiling & Linking the System Software
Configuring the Socrates Platform . . .
Current Results . . . . . . . . . . . .
Future work . . . . . . . . . . . . . .
Conclusions . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.1
25
30
31
35
37
40
42
Document 4 Appendix
1.
2.
4.
5.
6.
Demo Application . . . . . . . . . . .
I/O Routines . . . . . . . . . . . . .
Task switch routines . . . . . . . . .
Linker scripts . . . . . . . . . . . .
DATE 2001 Conference, Designers Forum,
. . . . . .
. . . . . .
. . . . . .
. . . . . .
publication
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. 1
. 3
. 6
. 9
.12
Abstract
This state of the art report introduces basic SoC terminology and practice with deeper analysis
in three architectural components: the CPU, the interconnection, and memory hierarchy. A short
historical view is presented before going into todays trends in SoC architecture and development. The
SoC concept is not new, but there are challenges that has to be met to satisfy customer demands for
faster, smaller, cheaper, and less power consuming products today and in the future. This document
the rst of three documents that forms a Master Thesis in Computer Engineering.
II
CONTENTS
Contents
1 Introduction
2 Embedded CPU
2.1 Introduction . . . . . . . . . . . . . . . . . .
2.2 The Building Blocks of an Embedded CPU
2.2.1 Register File . . . . . . . . . . . . .
2.2.2 Arithmetic Logic Unit . . . . . . . .
2.2.3 Control Unit . . . . . . . . . . . . .
2.2.4 Memory Management Unit . . . . .
2.2.5 Cache . . . . . . . . . . . . . . . . .
2.2.6 Pipeline . . . . . . . . . . . . . . . .
2.3 The Microprocessor Evolution . . . . . . . .
2.4 Design Aspects . . . . . . . . . . . . . . . .
2.4.1 Code Density . . . . . . . . . . . . .
2.4.2 Power Consumption . . . . . . . . .
2.4.3 Performance . . . . . . . . . . . . .
2.4.4 Predictability . . . . . . . . . . . . .
2.5 Implementation Aspects . . . . . . . . . . .
2.6 State of Practice . . . . . . . . . . . . . . .
2.6.1 ARM . . . . . . . . . . . . . . . . .
2.6.2 Motorola . . . . . . . . . . . . . . .
2.6.3 MIPS . . . . . . . . . . . . . . . . .
2.6.4 Patriot Scientic . . . . . . . . . . .
2.6.5 AMD . . . . . . . . . . . . . . . . .
2.6.6 Hitachi . . . . . . . . . . . . . . . .
2.6.7 Intel . . . . . . . . . . . . . . . . . .
2.6.8 PowerPC . . . . . . . . . . . . . . .
2.6.9 Sparc . . . . . . . . . . . . . . . . .
2.7 Improving Performance . . . . . . . . . . .
2.7.1 Multiple-issue Processors . . . . . .
2.7.2 Multithreading . . . . . . . . . . . .
2.7.3 Simultaneous Multithreading . . . .
2.7.4 Chip Multiprocessor . . . . . . . . .
2.7.5 Prefetching . . . . . . . . . . . . . .
2.8 Measuring Performance . . . . . . . . . . .
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
1
1
1
2
2
2
3
3
4
4
5
6
6
6
6
7
8
9
9
10
10
10
10
10
12
14
14
14
15
15
15
16
16
17
17
17
17
17
18
18
18
18
18
20
21
22
23
25
III
CONTENTS
2.8.1 Benchmarking
2.8.2 Simulation . .
2.9 Trends and Research .
2.9.1 University . . .
2.9.2 Industry . . . .
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
..
..
..
..
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
..
..
..
..
..
..
..
3 Interconnect
4 Memory System
.
.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
25
26
26
26
26
27
27
27
28
29
29
29
29
30
31
31
31
31
32
32
33
34
34
35
35
36
36
36
37
37
38
39
40
40
40
40
41
41
42
43
43
43
44
45
45
45
46
48
48
49
49
IV
CONTENTS
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
50
51
51
52
53
55
56
57
58
58
59
60
62
63
64
64
65
65
65
66
66
68
1 INTRODUCTION
1 Introduction
This State of the Art Report covers computer architecture topics with emphasis on System on Chip (SoC).
The reader is introduced to the basic ideas behind SoC and general computer architecture concepts before
presenting an in-depth analysis of three important SoC components: CPU, Interconnect and Memory
architecture.
1.1 What is SoC ?
SoC stands for System-on-Chip and is a term for putting a complete system on a single piece of silicon.
SoC has become a very popular word in the computer-industry, but very few agree on a general denition
of SoC [19]. There are several alternative names for putting a system on a chip, such as system on
silicon, system-on-a-chip, system-LSI, system-ASIC, and system-level integration (SLI) device [33]. Some
might say a large design automatically makes it a SoC, but that would probably include every existing
design today. A better approach would be to say that a SoC should include dierent structures such as
a CPU-core, embedded memory and peripheral cores. This still is a wide denition which could imply
that any modern processor with a on-chip cache should be included into the SoC-community. Therefore
a more suitable denition of SoC would be:
A complete system on a single piece of silicon, consisting of several types of modules including at least one
processing unit designated for software execution, where the system depends on no or very few external
components in order to execute its task.
In the beginning, almost all SoC's were simply integrations of existing board-level designs [20]. This way
of designing a system looses many benets that otherwise could be taken advantage of if the system would
be designed from scratch. Another approach is to use already existing modules, called IP-components,
and to integrate them to a complete system suitable for a single die.
1.2.1 Intellectual Property
When something is protected through patents, copyrights, trademarks or trade secrets it is considered as
a Intellectual Property (IP). Only patents and copyrights is relevant for IP-components [13] (also referred
as macros, cores and Virtual Components (VC) [10]). An IP-component is a pre-implemented, reusable
module, for example a DMA-controller or a CPU-core. There are several companies that makes their
living by building, licensing
and selling IP-components, which the semiconductor companies pays both
fees and royalties for1. There exist three classes of IP-components with dierent properties regarding to
portability and protection characteristics. As the portability decreases through the classes, the protection
will increase.
Soft This class of IP-components have their architecture specied at Register-Transfer Level (RTL),
which are synthetizable. Soft IP's are functionally validated and are very portable and modiable.
Since they are not mapped to a specic technology, the behavior according to area, speed, and
power consumption will be unpredictable. Much work still needs to be done before the component
can be utilized and the end-result is dependent of the used synthesis tools.
Firm The rm class components are in general soft components that have been
oorplanned and synthesized into one or several dierent technologies to get better estimations of area, speed, and power
consumption.
1 There are exceptions where one can acquire IP-components without any licensing or royalty fees. More information can
be found at http://www.openip.org/.
Hard-IP's are further renement of rm components. They are fully syntesized into mask-level
and physicaly validated. Very little work has to be done in order to implement the functionality
in silicon. Hard IP's are not modiable nor portable, but the prediction of their area, speed, and
power consumption is very accurate.
Hard
A typical SoC consists of a CPU-core, a Digital Signal Processor (DSP), some embedded memory, and a
few peripherals such as DMA, I/O, etc (Figure 1). The CPU could perform several tasks with the assistance of a DSP when needed. The DSP is usually responsible for o-loading the CPU by doing numerical
calculation on the incoming signals from the A/D-converter. The SoC could be built of only third-partyIP-components, or it could be a mixture of IP-components and custom-made solutions. More recently,
there has been eorts to implement a Multiprocessor System on Chip (MSoC) [6], which introduces new
challenges regarding cost, performance, and predictability.
The rst computer systems consisting of relays and later vacuum tubes, used to occupy whole rooms and
their performance were negligible compared to todays standard workstations. The advent of the transistor
in 1948 enabled engineers to minimize a functional block to an Integrated Circuit (IC). These IC's made it
possible to build complex functions by combining several IC's onto a circuit board. Further development
of process technology increased the number of transistor on each IC, which led to the emerging of systems
on board. After this, there has been a constant battle between semiconductor companies to deliver the
fastest, smallest and cheapest products, resulting in today's multi-billion dollar industry. Even though
the SoC concept has been around for quite some time, it has not really been fully feasible until recent
years due to advances like deep sub-micron CMOS process technology.
1.3.1 Motivation
There are several reasons why SoC is an attractive way to implement a system. Todays rened manufacturing processes makes it possible to combine both logic and memory on a single die, thus decreasing
overall memory access times. Given that the application memory requirement is small enough for the
on-chip embedded memory, memory latency will be reduced due to elimination of data trac between
separate chips. Since there is no need to access the memory on external chips, the number of pins can
also be reduced and the use of on-board buses becomes obsolete. Encapsulation counts for over 50% of
the the overall process cost for chip manufacturing [15]. In comparison to a ordinary system-on-board,
SoC uses one or very few IC's, reducing total encapsulation cost and thereby total manufacturing cost.
These characteristics as well as less power consumption and shorter time-to-market enables smaller,
better, and cheaper products reaching the consumers in an altogether faster rate.
Until now, much of SoC implementation has been about shrinking existing board-level systems onto a
single chip, with no or little consideration to all benets that could be gained from a chip-level design.
Another approach to SoC is to interconnect several dies and place them inside one chip. This kind of
modules are called Multi Chip Modules (MCM). The choice of implementation of the Hydra Multiprocessor
Project was at rst a MCM , which later evolved to a SoC [14, 6].
Today it is too time-consuming for companies to implement a system from scratch. Instead, a faster and
more reliable way is to use own or third party pre-implemented IP-components [3], which makes designing
a whole system more about integrating components rather than designing them. There exist three design
methodologies, each with it's own eciency and cost regarding SoC design [16, 18]. The vendor design
approach, which shifts the design responsibilities from the system designers to the ASIC vendors, can
result in the lowest die cost. But it can also lead to higher engineering costs and longer time-to-market.
A more
exible method is the partial integration approach, which divides the responsibilities of the design
more equally. It lets the system designers produce the ASIC design, while the semiconductor vendors
are responsible for the core and integration. This method gives the system designers more control of the
working process in comparison to the vendor method. Yet more
exible is the desktop approach which
leaves the semiconductor vendors only to design the core. This reduces time-to-market and requires low
engineering costs. A key property for IP-components in the future are parameterization of soft cores [16].
There is a continuous growth in the demand for \smart products" which is expected to make our lives
better and simpler. Recently, SoC products has begun to emerge on several markets in form of Application
Specic Standard Products (ASSP) 2 or Application Specic Instruction-set Processors (ASIP) 3 :
Set-top-boxes A Set-Top-Box (STB) is a device that makes it possible for television viewers to access
the Internet and also watch digital television (D-TV) broadcasts. The user has access to several
services: weather and trac updates, on-line shopping, sport statistics, news, e-commerce, etc. By
integrating the STB's dierent components into a SoC, it will simplify system design and be a more
competitive product with its shorter time-to-market, be less expensive and less power-consuming.
The Geode SC1400 chip is an example of a SoC used in a STB that meets the demands of delivering
both high-quality DVD video and Internet accessibility [34].
Cell phones A SoC in a cell phone will reduce its size and weight, make it cheaper and less power
consuming.
Home automation Many domestic appliances at home will be "smarter". For example, the refrigerator
will be able to notify its owner when a product is missing and place an order on the Internet.
Hand-held devices A new generation of hand-held devices is coming, that can send and receive email
and faxes, make calls, surf the Web etc. A SoC solution is especially suited for portable applications
such as hand-held PC's, digital cameras, personal digital assistants and other hand-held devices
because its built-in peripheral functions minimizes overall system cost and eliminates the need to
use and congure additional components.
1.3.3 Challenges
One of the emerging challenges is to standardize interfaces of IP-components to make integration and
composition easier. A lot of dierent on-chip bus standards has been created by the dierent design houses
to make it possible to fast integrate IP-components, which has resulted in noncompability caused by the
dierent interfaces. To solve this dilemma the Virtual Socket Interface Alliance (VSIA) was founded to
enable the mix and match of IP-components from multiple sources by proposing a hierarchical solution
2 High integration
3 A eld or mask
that enables multiple buses [17]. Still some criticizes VSIA for only addressing simple data
ows [11].
More can be read about dierent on-chip bus standards in section ??.
Since the time-to-market is decreasing, the testing and verication of the SoC must be done very fast.
By reusing IP-components it is possible that the test development actually takes longer time than the
work to integrate the dierent functional parts [12]. The fact that the components are from dierent
sources and may have dierent test methodologies complicates the test of the whole system. At the board
level design many of the components has their primary inputs visible which made the testing easier, but
SoC's contain deeply embedded components where there is no or very little possibility to observe signals
directly from one IP-component after manufacturing. Since the on-chip interconnect is inside the chip,
it is also hard to test due to the lack of observability.
As the future is lurking behind the door, integration is not likely to stop with IP-components and
dierent memory technologies, we are also likely to see a variety of analog functionality. Analog blocks
are very layout and process dependent, requires isolation and utilizes dierent voltages and grounds. All
these facts makes them the dicult to integrate in the design [10]. Are the limits to the integration
urge? As the process technologies becomes more sophisticated, transistor switching speed will increase
and the voltage for logical levels will decrease. Dropping the voltages will make the units more sensible
for noise. Analog devices with higher voltage needs can encounter problems working properly in those
environments [17].
Apart from the lack of eective design and test methologies [29] and all the technical problems with
mapping a complex design consisting of several IP-components from dierent design houses onto a particular silicon platform, there are complex business issues dealing with licensing fees and royalty payments
[30].
1.4 Introduction to Computer System Architecture
SoC is about putting a whole system on a single piece of silicon. But what is a system? This section serves
as a introduction to computer system architecture and tries do give the reader a better understanding of
what is actually put onto a SoC.
1.4.1 Computer System
In general, a typical computer system (gure 2) consists of one or more CPU's that executes a program by
fetching instructions and data from memory. To be able to access the memory, the CPU needs some kind
of interface and a connection to it. The interface is usually provided by the Memory Management Unit
(MMU) and the connection is handled by the interconnect. The local interconnect is often implemented
as a bus consisting of a number of electrical wires. Sometimes, the CPU needs assistance in fetching large
amount of data, in order to be eective. This work can be done in parallel with the CPU by the Direct
Memory Access (DMA) component. The system needs some means to communicate with the outside
world. This is provided by the I/O system. We proceed with a closer look at the important components
that comprises a computer system.
CPU The CPU is where the arithmetic, logic, branching and data transfer are implemented[8]. It consists
of registers, a Arithmetic Logic Unit (ALU) for computations and a control unit. The CPU can
be classied as a Complex Instruction Set Computer (CISC), if the instruction set is complex (e.g
has a lot of instructions, several addressing modes, dierent word-length on instructions etc). The
idea behind a Reduced Instruction Set Computer (RISC) is to make use of a limited instruction set
to maximize performance on common instructions by working with a lot of registers while taking
penalties on the load and store instructions. RISC has uniform length on all instructions and
very few addressing modes. This uniformity is the main reason why this approach is suitable for
instruction pipelining, in order to increase performance. There are other architectures that further
increase performance, for example superscalar, VLIW, and vector computers. A machine is called a
n-bit machine if it is operating internally on n-bit data[8]. Today a lot of the embedded processors
are still working with 8 or 16 bit words while the majority of workstations and PC's are 32 or 64
bit machines.
Address
Cache
CPU
Data
Main
Memory
Data
DMA
controller
DMA
device
Address
System bus
Data
Data
Memory System
There is a lot of research eort done on computer architecture, which of course is related in some degree
SoCs, since they all are actually computers. Unlike most research areas, SoC research is lead by the
industry and not by the universities. Of those universities that have SoC related research projects, very
few have reached the implementation stage.
The Stanford Hydra single-chip multiprocessor [6] started out as an MultiChipModule(MCM) in 1994
but evolved in 1997 to become a Chip MultiProcessor(CMP). The project are supervised by Associate
Professor Kunle Olukotun accompanied by Associate Professor Monica S. Lam and Mark Horowitz, also
incorporated in the project are a dozen students. Early development of the project was performed by
Basem A. Nayfeh nowdays a Ph.D. The Hydra projects focus on combining shared-cache multiprocessor architectures, innovative synchronization mechanisms, advanced integrated circuit technology and
parallelizing compiler technology to produce microprocessor cost/performance and parallel processor
programmability The four integrated MIPS-based processors will demonstrate that it is feasible for a
multiprocessor to gain better cost/performance than achieved in wide superscalar architectures for sequential applications. By using MCM, communication bandwidth and latency will be improved resulting
in better parallelism. This makes Hydra a good platform to exploit ne grained parallelism, hence a
parallelizing complier for extracting this sort of parallelism is under development. The project is nanced
by US Defense Advanced Research Projects Agency(DARPA) contracts DABT and MDA.
1.5.2 Self-Test in Embedded Systems (STES)
STES is a co-operational project between ESLAB, the Laboratory for Dependable Computing of Chalmers
University of Technology, the Electronic Design for Production Laboratory of Jonkoping University, the
Ericsson CadLab Research Center, FFV Test Systems, and SAAB Combitech Electronics AB. ESLAB
are responsible of developing a self-test strategy for system-level testing of complex embedded systems.
Which utilizes the BIST(Built In Self Test) functionality at the device, board, and MCM level. Except
the involved commercial participants the project are founded by NUTEK.
1.5.3 Socware
An international Swedish Design Center/cluster has recently been builded that will be in close cooperation with the technical universities in Linkoping/ Norrkoping, Lund and Stockholm/Kista. The
Socware, formerly known as Acreo System Level Integration Center (SLIC), aims to have nearly 40
employees/specialists in the beginning but this number will is grow to 1500 in the near future with a
special research institute located in Norrkoping. The Design Center will serve as an bridge between
the industry and research activity and the universities, enabling research results rapidly converting into
industrial products.
The focus of research and development will be directed to design of system components within digital
media technology. initially special focus will be on applications in mobile radio and broadband networking.
Project is nanced by the government, the municipality of Norrkotoping and other local and regional
agencies. More information can be found in [35].
1.5.4 The Pittsbourgh Digital Greenhouse
The Pittsburgh Digital Greenhouse is a SoC design cluster that focuses on digital video and networking
markets. The non-prot organization is an initiative taken by the U.S government, universities, and
industry that started in June 1999. It involves the Carnegie Mellon University, Penn State University,
University of Pittsbourgh, and several industry members like Sony, Oki, and Cadence.
Some ongoing research activities closely related to SoC are:
Congurable System on a Chip Design Methodologies with a Focus on Network Switching
This project focuses on development of design tools for hardware/software co-design as those required for next generation switches on the Internet and cryptography.
The program is focused on to create a software system that characterizes the power of the major components of a SoC design and allows the design to be optimized for lowest possible power
consumption.
The research has focus on the design, fabrication, and testing of a new high performance switched
network router, called Mediaworm. It is aimed to be used in computer clusters where there are
demands on level Quality of Service (QoS) guarantees.
Lightweight Arithmetic IP: Customizable Computational Cores for Mobile Multimedia Appliances
In February 1998, there was an opening of Cadence Design Centre with the purpose of creating one of
the electronic industry's largest and most advanced SoC design facilities. The centre is located on The
Alba Campus in Livingston, Scotland and is the largest European design centre. The centre oers expertise within the spheres of Digital IC, Multimedia/Wireless, Analogue/Mixed Signal, Datacom/Telecom,
Silicon Technology Services, and Methodology Services. In 1999, The Centre became authorized as the
rst ARM approved design centre, through the ARM Technology Access Program (ATAP). Current research projects conducted at the centre involves a single-chip processor for Internet telephony and audio,
a
exible receiver chip suitable for among other things pinpointing location by picking up high-frequency
radio waves transmitted by GPS satellites, and a fully customized wireless Local Area Network (LAN)
environment. There are three main pieces of the center, the Virtual Component Exchange (VCX), the
Institute for System Level Integration (SLI) and The Alba Centre. VCX opened in 1998, which is an
institution dedicated for establishing of structured framework of rules and regulations for inter-company
licensing of IP blocks. Members of VCX include ARM, Motorola, Toshiba, Hitachi, Mentor Graphics,
and Cadence. The SLI institute is an educational institution dedicated to system level integration and
research. The institute was established by four of Scotland's leading universities, Edinburgh, Glasgow,
Heriot Watt and Strathclyde. Finally, the Alba centre is the headquarter of the whole initiative and
provides a central point for information about the venture and assistance for interested rms.
2 EMBEDDED CPU
2 Embedded CPU
There are several dierent interpretations of the term CPU. Some say it is "The Brains of the computer"
or "Where most calculations take place", and that it "Acts as a information nerve center for a computer".
A more concrete denition is given by John L Hennessy and David A Patterson[8]:
Where arithmetic, logic, branching, and data transfer are implemented.
This chapter serves as an introduction to CPU's that are especially suitable for SoC solutions, namely
embedded CPUs. In this case, the term "embedded" does not only refer to how suitable these CPUs are
for embedded systems, or as stand-alone microprocessors, but also to how they are good candidates to be
"embedded" into a SoC. The purpose of this chapter is to look at the possibilities of embedded processors
as a SoC component and what aspects need to be considered when designing and implementing a solution.
Techniques for improving and measuring performance is discussed as well as where the research is today
together with a look at the future of embedded processors.
The chapter begins with an introduction to embedded CPUs that explains some of the factors behind
their popularity. Section 2.2 is a presentation of the building blocks of a modern embedded CPU. Section
2.4 discusses the major factors aecting the design. Section 2.3 looks at which paradigm is currently in
front regarding embedded CPUs. Section 4.3.7 considers options on how to implement an embedded CPU.
Section 2.6 presents case studies of embedded CPUs available in the market today. Section 2.7 shows
several techniques of how to improve the performance. Next, section 2.8 consider how the performance
of a embedded processor could be measured. Finally, section 2.9 looks at where the research is today and
what the trends are in the embedded processor market.
2.1 Introduction
The latest advances in process technology has increased the number of available transistors on a single
die almost to the extent that todays battle between designers is not about how to t it all on a single
piece of silicon, but how to make the most use of it. This evolution has also made it possible for designers
to put a complete processor, together with some or all of its periphal components on a single die, creating
a new class of products, called Application Specic Standard Products (ASSPs). The demand for ASSPs
has also created a new domain of processors, embedded 32-bit CPUs, that are cheap, energy-ecient,
especially designed for solving their domain of tasks.
Before getting into all the wonders of embedded CPU's, some clarifactions should be made about what
they are and what they are not. When CPU's are discussed, the thoughts often goes to the architectures
from Intel, Motorola, Sun, etc. These architectures are mainly y designed for the desktop market and
have dominated it for a long time. In recent years, there has been a increasing demand for CPU's designed
for a specic domain of products. Among those noticing this trend was David Patterson[21]:
Intel specializes in designing microprocessors for the desktop PC, which in ve years may
no longer be the most important type of computer. Its successor may be a personal mobile
computer that integrates the portable computer with a cellular phone, digital camera, and video
game player... Such devices require low-cost, energy-ecient microprocessors, and Intel is far
from a leader in that area.
The question of what the dierence is between a desktop and an embedded processor is still unanswered.
Actually, some embedded platforms arose from desktop platforms (such as MIPS, Sparc, x86), so the
dierence can not be in register organization, the instruction set or the pipelining concept. Instead, the
factors that dierentiate a desktop CPU from an embedded processor will be power consumption, cost,
integrated periphal, interrupt response time, and the amount of on-chip RAM or ROM. The desktop world
values the processing power whereas an embedded processor must do the job for a particular application
at the lowest possible cost[22].
This section serves as an introduction to the components of a modern embedded CPU. Readers that are
familiar with the basics of computer architecture and processor design might skip this section.
A CPU basically consists of three components: register set, ALU, and a control unit. Today, it is often
the case that the CPU includes a on-chip cache and a pipeline, in order to achieve an adequate level of
performance (Figure 3). The following text will give a brief introduction to the components function and
purpose in the CPU.
32-bit Address Bus
ALU Bus
Increment Bus
PC Bus
Address Register
Address
Incrementer
Instruction
Decoder
&
Control Logic
32-bit Registers
Cache
32 x 8 Multiplier
Barrel Shifter
Instruction
32-bit ALU
Pipeline
The organization of registers or how information is handled inside the computer is part of a machines
Instruction Set Architecture (ISA)[8, 9]. An ISA includes the instruction set, the machine's memory,
and all of the registers that is accessible by the programmer. ISAs are usually divided into three main
categories regarding to how information is stored in the CPU: stack architecture, accumulator architecture,
and general-purpose register (GPR) architecture. These architectures dier in how an operand is handled.
A stack architecture keeps its current operands on top of the stack, while a accumulator architecture keeps
one implicit operand in the accumulator, and a general-purpose register architecture only have explicit
operands which can reside either in memory or registers. Following example shows how the expression A
= B + C would be evaluated in these three architectures.
stack architecture
PUSH C
PUSH B
ADD
POP
A
The machines in the early days used stack architectures and did not need any registers at all. Instead,
the operands are pushed onto the stack and popped o into a memory location. Some advantages was
that space could be saved because the register le was not needed, and no operands were needed during
arithmetic operation. As the memories became slower compared to the CPU's, the stack architecture
10
also became ineective, due to the fact that most time is spent while fetching the operands from memory
and writing them back. This became a major bottleneck, which made the accumulator architecture a
more attractive choice.
The accumulator architecture was a step-up regarding performance by letting the CPU hold one of
the operands in a register. Often, the accumulator machines only had one data accumulator register,
together with the other address registers. They are called accumulators, due to their responsibility to act
as a source of one operand and destination of arithmetic instructions, thus accumulating data. The accumulator machine was a good idea at the time when memories were expensive, because only one address
operand had to be specied, while the other resided in the accumulator. Still, the accumulator machine
has it's drawbacks when evaluating longer expressions, due to the limited amount of accumulator registers.
The GPR machines solved many problems often related to stack and accumulator machines. They
could store variables in registers, thus reducing the number of accesses to main memory. Also, the
compiler could associate the variables of a complex expression in several dierent ways, making it more
exible and ecient for pipelining. A stack machine needs to evaluate the same complex expression from
left to right which might result in unnecessary stalling. Many embedded CPUs are RISC architectures
which means that they have lots of registers (usually about 32).
2.2.2 Arithmetic Logic Unit
The Arithmetic Logic Unit (ALU) performs arithmetic and logic functions in the CPU. It is usually
capable of adding, subtracting, comparing, and shifting. The design can range from using simple combinational logic units that does ripple carry addition, shift-and-add multiplication, and a single-bit shifts,
to no-holds-barred units that do fast addition, hardware multiplication, and barrel shifts[9].
2.2.3 Control Unit
The control unit is responsible for generating proper timing and control signals (usually implemented as
a state-machine that performs the machine cycle: fetch, decode, execute, and store to other logical blocks
in order to complete the execution of instructions.
2.2.4 Memory Management Unit
The Memory Management Unit (MMU) is located between the CPU and the main memory and is
responsible for translating virtual addresses into a their corresponding physical address. The physical
address is then presented to the main memory. The MMU can also enforce memory protection when
needed.
2.2.5 Cache
There are few processors today that don't incorporate a cache. The cache acts as a buer between the
CPU and the main memory to reduce access time, taking advantage of the locality of both code and
data. There are usually several levels of cache, each with their own purpose. The rst level is usually
located on-chip, thus together with the CPU. The cache is often separated into a instruction- and a
data-cache. Cache is especially important in a RISC architecture with frequent loads and stores. For
example, Digital's StrongARM chip devotes about 90% of its die area to cache[89]. The reader can learn
more about cache and how it is used in section 4.2.
2.2.6 Pipeline
As with the case of cache, there are very few processors today that doesn't use any kind of pipelining
in order to improve their performance. This section will serve as an introduction to pipelining and
the benets and drawbacks of using it. Pipelining is implementation technique that tries to achieve
11
Instruction Level Parallelism (ILP) by letting multiple instructions overlap in execution. The objective
is to increase throughput, the number of instructions completed at a time. By dividing the execution of
an instructions into several phases, called pipeline stages, an ideal speedup equal to the pipeline depth
could theoretically be achieved. Also, by dividing the pipeline into several stages the workload will be less
each stage, letting the processor run at a higher frequency[8]. Figure 4 shows a typical pipeline together
with it's stages. This particular pipeline has a length of ve and consists of unique pipeline stages, each
with their own purpose.
IF
ID
EX
MEM
WB
12
(WAR). RAW hazard are the most common ones and occurs when a write
instruction is followed by a read instruction and both instructions operate on the same register,
causing the one instruction to wait until the write has been issued in the WB stage. This can
be handled by forwarding, thus introducing "shortcuts" in the pipeline, so that instructions can
take part of results before the current instruction reaches the WB stage[8]. WAW hazards cannot
occur in pipelines like the one showed earlier (gure 4). The reason for this is that in order for
a WAW hazard to occur, either the memory stage has to be divided into several stages, making
several simultaneous writes possible, or some mechanism where an instruction can bypass an another
instruction in the pipeline. WAR hazards are rare and happens when a instruction tries to write
to a register read by an instruction that is ahead in the pipeline. As with WAW hazards, WAR
hazards cannot occur in a general pipeline because register contents are always read at the ID stage.
Some pipelines do read register contents late and can create a WAR hazard [8].
control hazards are caused by the instructions that changes the path of execution, called branches.
By the time a branch instruction calculates it's destination address in the EXE stage, instructions
following the branch has reached the IF and ID stage. If the branch was unconditional, the instructions that is in the IF and ID stage has to be removed, because the branch changes the program
counter and the new instructions have to be fetched from a new address, namely the destination
address of the branch. On the other hand, if there was an unconditional branch, the condition
need to be evaluated in order to decide if the branch should be taken or if the program counter
only should be incremented. One way of dealing with this problem is to automatically stall the
pipeline until condition is evaluated. These stalls are issued in the ID stage, where the branch is
rst identied. Also, in order to evaluate the condition of a conditional branch and calculate the
destination address simultaneously, extra logic for condition evaluation is added together with the
ALU in the ID stage. This way, only one stall cycle will be wasted when a branch instruction
occurs.
Most structural hazards can be prevented by adding more ports and dividing the memory into data
and instruction memory segments. The memory can also be improved by adding cache or increasing the
cache area. Data hazards can be handled by letting the compiler reschedule the instructions in order to
reduce the number of dependencies. Control hazards can be reduced by trying to predict the destination
of a branch. The prediction is based on tables storing historical information about whether the same
branch did or did not jump in earlier executions. Such tables, called Branch History Table (BTB) or
Branch Prediction Buer (BPB), are doing just that. Other tables such as Branch Target Buer (BTB)
acts as a cache storing the destination address of many previously executed branches. The interested
reader can continue it's reading in several books and articles addressing dierent branch penalty reduction
techniques[8, 64, 63].
Write After Read
This section serves as a "walk-through" of the dierent phases in microprocessor evolution (gure 5).
As this section may seem unrelevant to embedded processors, the embedded processor design has always
been in
uenced by the microprocessor and may continue to do so in the future. The reader who feels
unfamiliar with the principles behind the RISC and CISC paradigms, should reread section 1.4.1 before
proceeding with this text.
In the early days, there was a limiting amount of transistors available for the CPU designer. Usually,
the chips where lled with logic that was seldom used (e.g decoding of seldom used instructions). CISC
computers used microcoding, which made it easier to execute complex instructions. As the years went
by, it became harder for CISC designers to keep up with Moore's law 6 . Building more complex solutions
each year was not enough. Some designers realized that the rule locality of reference is something that
needs to be taken into consideration. It states that A program executes about 90% of its instructions
6 The
13
CISC
RISC
RISC/CISC
merging of architectures
Multithreaded Processors
Superscalar/VLIW
duplicated processors
Simultaneous Multithreading
any context can execute each cycle
The RISC designers thought that if they could implement the 10 % of most used
instructions and throw out the other 90%, then there would be lots of free die area left for other ways of
increasing the performance. Some of the performance enhancing techniques are listed below.
Cache Memory references was becoming a serious bottleneck, and a way to reduce the access time is to
use the extra on-chip space for cache. With the on-chip cache, the processor did not need to access
the main memory for all memory references.
Pipelining By breaking down the handling of an instruction into several simpler stages, the processor
is able to run faster, resulting in higher frequency.
More registers When compiling a program into machine code, the handling of variables usually is
taken care of by registers. Sometimes, there are stalls in the pipeline, due to dependencies between
registers (e.g one can not use a register until it is available), which can be avoided by register
renaming. This is possible when increasing the number of registers.
Computers using some or all of these advantages include RISC I and IBM 801 [2]. These enhancements
gave the RISC designers the upper hand for several generations in the 80's and 90's. But when the
number of available transistors on a chip passed the million-mark the number of transistors as a limiting
factor dissapeared. The CISC designers could level the score by introducing more complex solutions that
increased their performance a couple of percent, with little concern to how much die are was used. Even
though the CISC processor was several factors more complex than the corresponding RISC processor,
it was still keeping up with the RISC. Nowadays, the RISC and CISC paradigms are merged together
and uses techniques from both of the original paradigms. Now, when there are 10, 20 million or more
transistors available, the problem the designer is facing is more about making the most use of all the
transistors than how to t it all on one die. A simple processor can now be realized on only a fraction of
the available space. There are limits in the performance gains when increasing the cache size, deepening
the pipeline and increasing the number of registers. So, the question is what to do with the available
space? To gain more performance, new architectures like Multithreading, Simultaneous Multithreading
(SMT), Very Long Word Instruction (VLIW), and Single Chip Multiprocessor (CMP) are emerging.
These architectures will be discussed in section 2.7.
14
The designers of embedded processors are under market pressure when it comes to producing cheap, low
power-consuming, fast processors[22]. To meet the market demand for a SoC solution, the designers of
an embedded processor need to look at several design aspects, listed below.
2.4.1 Code Density
The size of a program may not be an issue in the desktop world, but is major challenge in embedded
systems. The embedded processor market is highly constrained by power, cost, and size. For control
oriented embedded applications a signicant portion of the nal circuitry is used for instruction memory.
Since the cost of an integrated circuit is strongly related to die size, smaller programs imply cheaper
smaller dies is needed, which in turn means cheaper dies can be used for embedded systems [81, 82].
Thumb and MIPS16 are two approaches that tries to reduce the code density of programs by compressing
the code. Thumb and MIPS16 are subsets of the ARM and MIPS-III architecture. The instructions used
in the subset are either frequently used or does not require full 32-bits or are important to the compiler
for generating small code. The original 32-bit instructions are re-encoded to be 16-bits wide. Thumb and
MIPS16 is reported to achieve code reductions of 30% and 40%, respectively. The 16-bit instructions are
fetched from instruction memory and decoded to equivalent 32-bit instructions that is run as normally
by the core. Both approaches have drawbacks:
Instruction widths are shrunk at the expense of reducing the number of bits used to represent
registers and immediate values
Conditional execution and zero-latency shifts are not possible in Thumb
Floating-point instructions are not available in MIPS16
The number of instructions in a program grows with compression
Thumb code runs 15-20% slower on systems with ideal instruction memories
Both Thumb and MIPS16 are execution-based selection form of selective compression which is a technique that selects procedures to compress according to a procedure execution frequency prole. The other
form is miss-based selection which is invoked only on an instruction cache miss. All performance loss will
occur on a cache miss path. This way, miss-based selection is based on the number of cache misses and
not the number of executed instructions as in execution-based selection. Speedup can be achieved by
letting the procedures with the most cache misses to be in native code.
Jim Turley has a dierent view on the techniques for reducing code density[89]: Claimed advantages
in code density should be considered in light of factors such as compiler optimization (loop unrolling,
procedure inlining, etc), the addressing (32-bit vs. 64-bit integers or pointers), and memory granularity.
Finally, code density does little or nothing to aect the size of data space. Applications working with
large data sets requires much more memory than the executable, thus code reduction does little help
here.
2.4.2 Power Consumption
Many products using embedded processors use batteries as power supply. To preserve as much power as
possible, embedded processors usually operate in three dierent modes: fully operational, standby mode
and clock-o mode [22]. Fully operational means that the clock signal is propagated to the entire processor, and all functional units are available to execute instructions. When the processor is in standby
mode, it is not actually executing a instruction, but the DRAM is still refreshed, register contents is also
available. The processor returns to fully operational mode upon a activity that requires units that are
not available in standby mode, without loosing any information. Finally, in clock-o mode, the system
has to be restarted in order to continue, which almost take as much time as a initial start-up. Power
15
Unlike the desktop market, performance isn't everything in the embedded processor market. Instead
factors like price, power consumption is equally important. A typical embedded processor usually executes
about one instruction per cycle. Today, performance is still measured as Million Instructions Per Second
(MIPS) which basically only reveals the amount of instructions executed per second, not if they were
any useful instructions executed. MIPS is not a good way of measuring performance, and section 2.8.1
looks at other alternatives. Sometimes, the usual performance of one executed instruction per cycle for
an embedded processor is not enough and other alternative architectures must be considered in order to
increase the performance. Section 4.3.7 discusses possible alternative architectures.
2.4.4 Predictability
Architectures that supports real-time systems must have the ability to achieve predictability [84]. Predictability is dependent on the Worst Case Execution Time (WCET) which is in turn dictated by the
underlying hardware. Much focus is on improving an architectures performance, and little thought goes
to make it predictable. This has lead to architectures that includes cache, pipeline, virtual storage management, etc, all which has improved the average case execution time, but has worsen the prospects for
predictable real-time performance.
Caches have not been popular in the real-time competing community, due its unpredictable behavior.
This is true for multi-tasking, interrupt driven environments which are common in real-time applications [87]. Here, the individual task execution time can have dier from time to time due to interactions
of real-time tasks and the external environment via the operating system. Preemptions may modify
the cache contents and thereby cause a nondeterministic cache hit ratio resulting in unpredictable task
execution task times.
Pipelines also introduces similar problems to caches concerning worst case execution time. There are
eorts to achieve predictable performance of pipelines without using a cache and without the hazards
associated with them [88]. This approach, called Multiple Active Context System (MACS), uses multiple
processor contexts to achieve increasing performance and predictability. Here, a single pipeline is shared
among a number of threads and context of every thread is stored within the processor. On each cycle,
a single context is selected to issue a single instruction to the pipeline. While this instruction proceeds
through the pipeline, other contexts issue instructions to ll consecutive pipeline stages. Contexts are
selected in a round-robin fashion. A key feature of MACS architecture is that its memory model allows
the programmer to derive theoretical upper bounds on memory access times. The maximum number of
cycles a context will wait for a shared memory request is dictated by the number of contexts, the memory
issue latency, the number of shared memory competing threads, and the number of contexts scheduled
between consecutive threads.
2.5 Implementation Aspects
There are several options available for the designer who wants to integrate an embedded processor into
a SoC. Besides building a processor from "scratch", there are other options available. The rst option
16
is to acquire the processor core as an hard IP-component7 in form of a specic semiconductor fabrication process and are delivered as mask data. Several hard IP-cores will be examined in section 2.6.
The second option is to acquire the CPU as a rm IP-component which is usually delivered in form
of a netlist. The third and last option is to acquire a soft IP-component in form of VHDL or Verilog
code or to produce a synthesizable core with a parameterizable core generator. There has been several research eorts to develop generators of parameterizable RISC-cores[73, 76]. One conducted at the
university of Hanover has developed a parameterizable core generator that outputs fully synthesizable
VHDL code. The generated core is based on a standard 5 stage pipeline (Figure 4). The designer has
many choices when using the generator (e.g pipeline length, ALU and data width, size of register le, etc).
The generated cores are simple RISC-processors with a parameterizable word and instruction width.
Instruction and data memories are provided as a VHDL template le for simulations, but they are not
suitable for synthesis. Instead, they should be taken from a technology specic library. Since the cores
are based on RISC-principles, the instruction set consists of only few instructions and addressing modes.
A typical 32-bit RISC core with 32 bit data path and 8 32 bit registers can with a 3LM 0.5 micron.
standard-cell library deliver about 100 MHz achievable clock-frequency.
Commercial core generators are also available from Tensilica, ARC, and Triscend[100, 101, 99].
2.6 State of Practice
The 4, 8 and 16 bit microprocessors was and still are dominating the embedded control market. In
fact, it was forecasted that eight times more 8-bit than 32-bit CPU's will be shipped during 1999[89].
The 32-bit embedded processor market diers from the desktop market in that there are about 100
vendors and a dozen instruction set architectures to choose from. The thing that makes 32-bit embedded
CPU's attractive is their ability to handle emerging new consumer demands in form of ltering, articial
intelligence, multimedia, still maintaining a low level of power consumption, price, etc. Next will follow
a brief presentation of available embedded processors commonly used today.
2.6.1 ARM
The Advanced RISC Machines (ARM) company is a leading IP provider that licenses RISC processors,
periphals, and system-on-chip designs to international electronics companies. The ARM7 family of processors consists of ARM7TDMI and ARM7TDMI-S processor cores, and the ARM710T, ARM720T and
ARM740T cached processor macrocells.
An ARM7 processor consists of an ARM7TDMI or ARM7TDMI-S (S stands for Synthesizable and
means that it can be acquired as VHDL or Verilog code) core that can be augmented with one of
the available macrocells. The macrocells provides the core with 8KB cache, write buer, and memory
functions. ARM710T also provides a virtual memory support for operating systems such as Linux and
Symbain's EPOC32. ARM720T is a superset of ARM710T and supports WindowsCE.
When writing a 32-bit program for an embedded system there might be a problem to t the entire
program in the on-chip memory. This kind of problem is usually referred to as a code density problem.
In order to address the code size problem ARM has developed Thumb, a new instruction set. Thumb is
an extension to the ARM architecture, containing 36 instruction formats drawn from the standard 32-bit
ARM instruction set that have been re-coded into 16-bit wide opcodes. Upon execution, the Thumb
codes are decompressed by the processor to their real ARM instruction set equivalents, which are then
run on ARM as usual. This gives the designer the benets of running ARM's 32-bit instruction set and
reducing code size by using Thumb.
7 Those
who are not familiar with the dierent layers of IP-components, can read the section
SoC Design.
17
The ARM9 family is a newer and more powerful version of ARM7 and designed for system-on-chip
solutions due to its built-in DSP capabilities. The ARM9E-S solutions are macrocells intended for integration into Application Specic Integrated Circuits (ASICs), Application Specic Standard Products
(ASSPs) and System-on-chip (SoC) products.
CPU core
Die Area
Power
Frequency Performance
ARM7TDMI 1.0 mm2 on 0.25 m 0.6 mW/MHz @ 3.3V 66 MHz 0.9 MIPS/MHz
ARM9E-S 2.7 mm2 on 0.25 m 1.6 mW/MHz @ 2.5V 160 MHz 1.1 MIPS/MHz
2.6.2 Motorola
The Motorola M-CORE microprocessor, introduced in 1997, was targeting the market of analog cellular
phones, digital phones, PDAs, portable GPS systems, automobile braking systems, automobile engine
control, and automotive body electronics. The development of the M-CORE architecture was designed
from the ground up to achieve the lowest milliwatts per MHz. It is a 32-bit processor that has a 16-bit
xed length instruction format, and a 32-bit RISC architecture. The M-CORE minimizes power usage
by utilizing dynamic power management.
Motorola has also developed a modern version of the 68K architecture, the Coldre, which is positioned
between the 68K (low end) and the PowerPC (high end). This architecture is also known as VL-RISC,
because although the core is RISC-like, the instructions are variable length (VL). VL instructions help to
attain higher code density, Coldre has a four-stage pipeline consisting of two subpipelines: a two-stage
instruction prefetch pipeline and two-stage operand execution pipeline.
2.6.3 MIPS
MIPS Technologies designs and licenses embedded 32- and 64-bit intellectual property (IP) and core
technology for digital consumer and embedded systems market. The MIPS32 architecture is a superset
of the previous MIPS I and MIPS II instruction set architectures.
2.6.4 Patriot Scientic
Patriot Scientic Corporation was one of the rst developing a Java microprocessor, the PSC1000. The
PSC1000 is targeted for high performance, low-system cost applications like, network computers, set-top
boxes, cellular phones, Personal Digital Assistants (PDA's) and more. The PSC1000 microprocessor is
a 32-bit RISC processor that oers ability to execute both Java(tm) programs as well as C and FORTH
applications. It oers a unique architecture that is a blend of stack- and register-based designs, which
enables features like 8-bit instructions for reduced code size. The idea behind the PSC1000 is to make
Internet connectivity for low cost devices such as PDA's, set-top cable boxes and "smart" cell phones.
2.6.5 AMD
(AMD)'s 29000K was an early leader which was frequently used in laser printers and network buses. The 29K family comprises three product lines, including three-bus Harvardarchitecture processors, two-bus processors, and a microprocessor with on-chip peripheral support. The
core is built around a simple four-stage pipeline: fetch, decode, execute, and write-back. The 29K has a
triple-ported register le of 192 32-bit registers. In 1995, AMD cancelled all further development of the
29K to concentrate its eorts on x86 chips.
2.6.6 Hitachi
Hitachi SuperH (SH) became popular when Sega chose the SH7032 for its Genesis and Saturn video game
consoles. Then, it expanded to cover consumer-electronics markets. Its short, 16-bit instruction word
gives SuperH one of the best code density compared with almost any 32-bit processor. The SH family
18
uses a ve-stage pipeline: fetch, decode, execute, memory access, and write-back to register. The CPU
is built around 25 32-bit registers.
2.6.7 Intel
Intel i960 emerged early in the embedded market which made it successful in printer and networking
equipments. The i960 is well supported with development tools. The i960 combines a Von Neumann
architecture with a load/store architecture that centers on a core of 32 32-bit general-purpose registers.
All i960s have multistage pipelines and use resource scoreboarding to track resource usage.
2.6.8 PowerPC
The PowerPC is one of the best-known microprocessor name next to Pentium and is steadily gaining
ground in the embedded space. IBM and Motorola are pursuing dierent strategies with their embedded
PowerPC chips, with the former inviting customer designs and the latter leveraging its massive library
of peripheral I/O logic.
2.6.9 Sparc
Sun's SPARC was the rst workstation processor to be openly licensed and is still popular with some
embedded users. The microSPARC are built around a large multiported register le that breaks down
into small set of global registers for holding global variables and sets overlapping register windows. The
microSPARC's pipeline consists of an instruction-fetch unit, two integer ALUs, a load/store unit, and a
FPU.
2.7 Improving Performance
Pipelining is a way of achieving a level of parallelism, resulting in a low CPI count. In order to be
even more eective, linear pipelining will not suce and other techniques have to be considered. These
techniques have the ability to execute several instructions at once, resulting in a CPI count below 1.0.
The most popular techniques includes Multiple-issue Processors (such as Very Long Instruction Word
(VLIW) and Superscalar Processors), Multithreading, Simultaneous Multithreading (SMT) and Chip Multiprocessor (CMP). Also, another technique will be discussed that tries to come to terms with the ever
growing memory-CPU speed gap. There is one technique, called prefetching or preloading, that hides
the memory latency by fetching and storing required data or instructions in a buer before it is actually
needed.
2.7.1 Multiple-issue Processors
Although there are techniques that can remedy most of the stalls in an ordinary pipeline, the ideal
result is still only a CPI count of 1.0, thus executing exactly one instruction for every machine cycle.
This performance is not always enough and other ways of achieving a higher level of parallelism need
to be considered. Multiple-issue processors tries to execute several instructions in a machine cycle, thus
achieving a higher rate of Instruction-Level Parallelism (ILP). There are mainly two types of processors
using these techniques, namely Very Long Instruction Word (VLIW) and superscalar processors. Also,
in addition to the two architectures, a third alternative processor, called Multiple Instruction Stream
Computer (MISC) will be discussed.
As the name implies, a VLIW processor issues a very long instruction packet that consists of several
instructions. An example of a instruction packet can be seen in (Figure 6), were there is room for two
integer/branch operations, one
oating point operation, and two memory references. In the case of VLIW
processors, the task of nding independent instructions in the code is done by the compiler instead of dynamic hardware as in superscalar processors. Additional hardware is saved because the compiler always
19
Integer
FP
Memory
Memory
20
scheme
2.7.2 Multithreading
A process or thread is an abstract entity that performs tasks8 [1]. The aim of multithreading is to divide
a program into several threads that can switch among each other when a communication event occur.
Hopefully, events with long latencies can then be tolerated by switching to a new thread when they occur.
Multithreading implemented in hardware has several benets:
If the program is divided by the programmer into several threads, then no special software analysis
is needed.
It handles unpredictable situations (such as cache miss, and communication misses) well.
It does not reorder instructions like some of the other techniques, thus keeping the memory consistency.
Its drawbacks are that signicant changes has to be done to the microprocessor architecture, and it
has not been particularly successful in uniprocessor systems. Therefore this technique will be discussed
brie
y and give room for another technique that is an extension of multithreading, called simultaneous
multithreading.
ThreadID
5
4
3
2
1
Time
21
Figure 9: The partitioning of issue slots between superscalar,multithreading, and simultaneous multithreading. Source: Dean M. Tullsen [75]
2.7.3 Simultaneous Multithreading
Multiple issue processors has their limitations because they only rely on the parallelism of instructions,
which is inherently limited due to the natural dependencies within a program [74, 75]. A multiprocessor
consisting of several multiple issue-processors is a good solution that tries to combine ILP with Thread
Level Parallelism (TLP), but has limited performance when the threads run out of parallelism. Simultaneous multithreading is a processor design that combines the earlier techniques in order to exploit
parallelism at both instruction-level and thread-level as much as possible. From superscalars it uses the
ability to issue multiple instructions each cycle and as in multithreaded processors it contains hardware
state for several threads. The result is a processor that can issue multiple instructions from multiple
threads each cycle. Also, in addition to that it is capable of parallelism at both instruction and thread
level, it performs well when executing a single thread, because it has all the resources to it self, thus
matching the performance of a ordinary superscalar architecture. On the hardware side, SMT is simply
an add-on component that gives a conventional superscalar architecture the ability to handle multiple
threads.
The dierence between superscalar, multithreading and simultaneous multithreading can be seen in gure 9. Each row represents the issue slots for a single execution cycle. A lled box means that a processor
has found an instruction to execute in that issue at that particular cycle. Empty boxes indicates unused
slots. Figure (a) shows how a conventional superscalar might execute its instructions. A superscalar tries
to execute as many instructions as possible in a program or thread. When it is hindered by dependencies
between instructions, it must stall and wait for them to be resolved, resulting in both horizontal and
vertical waste. Figure (b) shows a multithreaded architecture executing the same program. Here the
architecture still suers form the dependencies between instructions, but can switch threads when long
latency events occur. This results in a similar horizontal waste as in the superscalar architecture, but
an improvement in vertical waste, due to it's ability to tolerate latency. Figure (c) shows how a SMT
architecture can issue several instructions from several threads at each cycle.It achieves ILP by choosing
instructions for several threads. If one thread has a high level of ILP, then it can ll the issue slots,
whereas if there is poor ILP among several threads, they can run together to ll the issue slots. This
results in little waste both vertically and horizontally.
The overall benets of SMT has been discussed, but one serious drawback is that SMT never has been
implemented[97]. The results from evaluating the architecture has been acquired by simulation. There
22
is a positive side to this, due to Digital Corporation that has announced in their roadmap that the next
Alpha processor will be using SMT[131]. Even though SMT haven't been implemented, there are many
research groups:
Washington - Simultaneous Multithreading
Illinois Urbana-Champaign - I-ACOMA
UC Irvine - Multithreaded Superscalar
UC Santa Barbara - Multistreamed Superscalar
Michigan - Simultaneous Multithreading
2.7.4 Chip Multiprocessor
To understand what multiprocessor systems are about, a "standard-taxonomy" among computers originally stated by Flynn is often used. Computers are classied by the way instructions and data is provided
to the system[2]:
SISD Single Instruction Single Data (SISD) computers includes the ordinary computer that decodes a
instruction at a time.
SIMD Single Instruction Multiple Data (SIMD) is the classical form of array processors. Here, several processor units are controlled by a single control unit. The processor units receive the same
instructions and addresses but they operate on dierent data.
MISD Multiple Instruction Multiple Data (MISD) usually referres to pipelined computers. The pipeline
stages can be seen as several instruction streams that
ows through a gradually transforming data
stream.
MIMD Multiple Instruction Multiple Data (MIMD) applies to multiprocessor systems, on-chip or not.
The processor units in a multiprocessor system are coupled in order to be able to exchange data
and to synchronize with eachother.
Todays processor design often make use of sophisticated architectural features to try to nd independent instructions within a program or thread. Examples of such techniques are out-of-order execution and
speculative execution of instruction after branches predicted with dynamic hardware branch prediction
techniques. Future performance improvements will require "wider" uniprocessors that can issue more
instructions at a time. This will of course make life harder for people that have to design and verify
these processors. One way of exploiting multiple threads of control is to spread them out over several
simpler processors in a multiprocessor system. A Multiprocessor system implemented on a single die is
often referred to as Chip Multiprocessor (CMP) or Multiprocessor System-on-chip (MSoC). The advantage CMP has over the other architectures, is that it consists of duplicated simple cores. This way, the
designer only needs to verify a single core, which is much easier than verifying a SMT or multiple-issue
processor. Because CMP uses relatively simple single-thread level processor cores, they will not be able
to achieve the same level of ILP as a SMT architecture. Everything isn't simple about CMP, because it
must deal with the same issues as normal multiprocessor systems, namely cache coherence, consistency,
and synchronization.
An example of a CMP project is The Stanford Hydra (see Research section in Introduction for information).
23
Prefetching is a technique for hiding the memory latency that is growing every year, due to the growing
dierences in speed between the CPU and memory. The memory latency is hidden by overlapping it
with useful instructions. This technique can be implemented both in hardware and in software. The
hardware- approach often involves communication with or modications to the cache, therefore it will be
discussed in the section 4.6. Although there are techniques for both instruction and data prefetching,
most focus will be on data prefetching.
Software-directed prefetching is often done during the optimization phase and rely on the compiler to
do static program analysis in order to insert prefetch instructions at selected places in the code. The
strategy is to place the prefetch instruction early enough before the data is needed, so that the entire
latency is being hidden. The instruction can not be placed to early though, because if it stays there
to long, the data might be replaced in the cache just before it is needed. Early prefetches might also
replace currently used data, resulting in a cache pollution. The following example will show how several
prefetching techniques are applied [70].
The following code is presented by the programmer and the aim is to start prefetch the data, in this
case elements of array a, before it is needed. This code will also show how loop-based prefetching works.
for(i = 0; i < N; i++)
sum = a[i] + sum;
This loop sums the elements of an array. Assuming a cache block size of four words, this code segment
will cause a cache miss every fourth iteration. To try to avoid this cache miss a prefetch instruction is
inserted just before the computatation of the sum. The prefetched element is fetched one loop iteration
before it is needed.
for(i = 0; i < N; i++) {
fetch( &a[i+1]);
sum = a[i] + sum;
}
The observant reader can see in the code listed above that there are to many prefetching instructions
issued, because the cache block had a size of four words, thus four consecutive instructions is already
fetched when executing a single prefetch instruction. A solution could be to insert a predicate that
checks whether the loop has iterated four times since the last issued prefetch instruction. This results
into wasted cycles, so another technique, loop unrolling, is considered. With loop unrolling, the loop body
is replicated by the factor equal to the cache block size, in this case four times. The prefetch instruction
will now prefetch only the fourth element of the array, as shown in the code below.
for(i = 0; i < N; i+=4) {
fetch( &a[i+4]);
sum = a[i] + sum;
sum = a[i+1] + sum;
sum = a[i+2] + sum;
sum = a[i+3] + sum;
}
This will still result in unnecessary cache misses, because during the rst loop iteration, the prefetch will
never be used. Also, unnecessary prefetching is issued in the last iteration of the loop. Another technique
showed below, software pipelining, extracts the rst and last iteration of the loop into a prologue that
issues a prefetch and a epilogue that does not issue a prefetch.
fetch( &sum );
fetch( &a[0] );
24
Now, all references are covered because a prefetch instruction is issued before the corresponding data
is used. One problem remains though, and that is the issue of the compiler having to be sure that the
prefetch will fetch the data fast enough so that it will be available when it is used. If the time it takes to
prefetch an instruction takes more than one cycle, as in the shown examples above, the prefetch distance
have to be calculated. The prefetch distance, , is calculated by dividing the average cache miss latency,
l, with the estimated number of cycles in the shortest possible execution path through one loop iteration,
s, including any prefetch overhead.
=d e
Assuming an average cache latency of 100 and a minimum loop iteration time of 45, the prefetch
distance will be 3. This means that the prefetch instruction have to be issued three prefetch iterations
earlier, shown in the code below.
l
s
fetch( &sum );
for(i = 0; i < 12; i +=4)
fetch( &a[i] );
for(i = 0; i < N-12; i+=4) {
fetch( &a[i+12]);
sum = a[i] + sum;
sum = a[i+1] + sum;
sum = a[i+2] + sum;
sum = a[i+3] + sum;
}
for( ; i < N; i++)
sum = a[i] + sum;
A natural question is to ask what performance gains can be expected by using these techniques? The
answer is that these techniques are usually restricted to loops in scientic programs. This kind of loops are
common in scientic programs and prefetching can nearly double the performance []. General applications
often use the same data in loops, resulting in high cache utilization, which diminishes the benets of
prefetching. prefetching has also some negative "sideects":
Increased total code size as a result of inserted prefetching instructions.
Executing prefetch instructions (especially to calculate the prefetch address) takes time, thus increasing total execution time.
Prefetching will also introduce new challenges in a multiprocessor environment[67]:
Prefetches increases memory trac.
25
Technique
VLIW
Superscalar
Multithreading
SMT
CMP
Benets
Reduced architecture complexity
Can reach high ILP
Hides latencies well
Exploits both TLP and ILP
Simple cores. Good TLP, moderate ILP
Drawbacks
Complex compiler needed. Exploits only ILP.
Complex architecture. Hard to test and verify
Needs changes to architecture
Complex architecture. Not implemented yet
Must consider multiprocessor aspects
A performance improvement is not worth much if it can't be measured. It is also important to measure it
the right way, thus telling the truth about the characteristics of an embedded processor. There are several
options available, each with their advantages and drawbacks. Some of these options will be discussed
here.
2.8.1 Benchmarking
To measure the performance of an embedded processor has been problematic for a long time now. Benchmarks such as Whetstone and Dhrystone does not really give a true picture of a processor's performance.
There is no meaning in measuring a embedded processor's MIPS count, because it does not actually
say if those executed instructions was doing anything (the could be NOPs). As an even more serious
error is then to divide the MIPS count with the processors power consumption, resulting in MIPS/Watt
comparisons that should not be taken seriously. Moreover CISC and RISC architectures are dierent in
terms of the amount of "work" a single instruction performs [78].
To come to terms with this problem, the EDN Embedded Microprocessor Benchmark Consortium (EEMBC
- pronounced "embassy") was founded in April 1997 to develop performance benchmarks for processors
in embedded applications. EEMBC comprises a suit for Automotive/Industrial, Consumer, Networking,
Telecommunications, and Oce Automation and Telecommunications market. These benchmarks target
specic applications that include engine control, digital cameras, printers, cellular phones, modems, and
more. With the assistance of several industry experts, the consortium created 37 individual algorithms
the constitute EEMBC's Version 1.0 suite of benchmarks. Today, EEMBC consists of 30 semiconductor
company members and 3 compiler company members. The annual fee for a new member is $30 000. The
EEMBC Benchmark suite code is available for all board members.
One of the key components developed at the EEMBC is a portable benchmarking test harness that runs
on a variety of host platforms, interfacing to a number of "benchmark target" platforms. The test harness
consists of:
Standard API for benchmark support including File I/O download.
Benchmark loader
Consistent, repeatable execution environment
Framework for fully-automated execution of benchmarks
Standardized reporting, diagnostics, and log les
26
The test harness has been ported to little an big endian target boards, 16-64 bit boards. It includes
support for both Z-modem and uuncoding uploads/downloads, and takes up very small amount of memory on the target. An initial look at the rst benchmark results reveals that several results from a specic
processor is available, depending on which compiler is used. Performance is seems to be measured as
iterations/sec (no explanation found) and the code size in Kb. The processor who reaches the most iterations per second and has the smallest code size is on top of the list, according to the benchmark results.
Other benchmark organizations include Berkeley Design Technology (Independent DSP Analysis and
Optimized DSP Software), Nullstone Corporation (Automated Compiler Performance Analysis), Standard Performance Evaluation Corporation (SPEC) (Develops standardized set of relevant benchmarks
and metric for performance evaluation of modern computer systems), Business Application Performance
Computing (BAPC) (develop and distributes a set of objective performance benchmarks based on popular
computer applications and industry standard operating systems).
2.8.2 Simulation
A more inexpensive and
exible way of measuring the performance of an embedded processor might be to
make a simulator. Most (or all?) available simulators seem to be custom made to suit a specic purpose.
Simulation is a good way to evaluate dierent options when designing a system, but should not be used
to draw major conclusions concerning performance. After all, simulation is done in software and cannot
give a 100% true picture. An example of a simulator is the SMTSim Multithreading Simulator 9 written
by Dean M. Tullsen which has been used in two of his published articles concerning multithreading.
2.9 Trends and Research
This section takes a look at what might lie ahead for the embedded processor of tomorrow and where
there are research eorts to continue its development.
There has been an increase in DSP functionality among embedded processors in recent years. This is
happening because many embedded systems today often include a CPU and a DSP, where the DSP helps
the CPU with numerical calculations. As the CPUs get faster, the can take over the job of the DSP and
replace it in the future. On the other hand, DSPs are getting faster and more complex by the day and
there are arguments that suggests that the DSP should include "CPU functionality". The future will
show who wins this battle or if there will always exist separated CPUs and DSPs[?].
The steady improvement of process technology indicates that the future embedded processor might
reach a core voltage below 1.0 in the near future, resulting an a very low power consumption. More
companies are also beginning to realize the benets of selling embedded processors as IP-components.
This will likely contribute to a increasing competition and hopefully cheaper products.
2.9.1 University
The Stanford Hydra project has already been discussed in the introduction but will only be mentioned
here as one of the largest research projects conducted on CMP architectures. Other smaller CMP eorts
are conducted at the Wisconsin Multiscalar, Carnegie-Mellon Stampede, and MIT M-Machine.
2.9.2 Industry
ARM has a Technology Access Program (ATAP) where they help several companies or research groups in
their design of embedded processor and SoC's. Companies currently in cooperation with ARM through
ATAP are: Barco Silex, Cadence, Sican, and Wipro Infotech. IBM announced plans to make CMPs in
form of IBM Power4. This is the rst commercially proposed CMP targeted at servers and other systems
that already makes use of conventional multiprocessors. Also, Sun has plans to build a CMP with the
Sun MAJC. It has a shared primary cache and is designed to support Java execution.
9 SMTSIM
27
3 INTERCONNECT
3 Interconnect
This chapter focuses on the communication channels that ties the components (CPUs, memory, peripherals etc.) together in a computer system, called the interconnect. This topic is vast and a number of
books and articles have been written about interconnects, so concentration is made on the architectural
part of the interconnect, restricted to single processor computers and small scale parallel computers with
emphasis on SoC. First some basics about the bus based interconnection are presented, before going into
more complex switched interconnection networks. At the end of the chapter, SoC interconnection details
are presented before an overview of existing SoC interconnections.
3.1 Introduction and basic denitions
Since the dierent components in a computer system need to communicate with each other, there is need
for communication channels between them. Communications can be divided into hierarchical layers [33].
The lowest layer, the physical layer deals with the physical wires and drivers. Timing of information at
this level is of key importance. At the next layer, the transfer layer, there is a set of rules called a protocol
that sets up how the transactions are performed. A transaction layer can be introduced to enable pointto-point communications between components. At the top, the application layer, there is no information
of how data is transferred, only that it is sent from the source and delivered to the destination. The
two main architectures for interconnections are shared media and switched media interconnections[8].
Designs using shared media approach are often called bus-based solutions and switched media is referred
as point-to-point designs. In point-to-point architectures the communication channel is always shared
between two devices, in contrast to the bus based architectures where there is possibility for more than
two to share a channel. The gure 10 shows the dierent architectures. Point-to-point interconnections
will be further examined in section 3.4.
B
A
Figure 10: The left gure shows a bus-based system where the devices A, B and C communicates via
a common bus. In the true point-to-point approach shown on the right side, any of the devices can
communicate with another without interference on the communication channels between them.
3.2 Bus based architectures
A bus consists of a number of electrical wires and rules that deals with how transfers of data is done on
the bus. In completely parallel buses each bit used in the protocol has its own dedicated wire in contrast
to completely serial buses where all the information is multiplexed onto a single wire. Multiplexing means
that a line is shared by two or more individual signal sources sending information over the line at dierent
point of times. Which of the sources that are using the line is controlled by a multiplexer. For example,
using multiplexing to form only one bus from two separate 32-bit address and data buses, reduces the
number of signal pins from 64 to 32. In practice most computer buses are parallel, where in some cases
multiplexing is used on a subset of the signals. Transfers on a bus is initiated by a bus master. The
28
master writes or reads data from other units attached to the bus, called slaves. An simple example is
a CPU (acting as a master) reading data from memory (slave). A generic term for masters and slaves
is devices. The sequence of actions, from retaining bus ownership to transferring data and breaking the
connection is called a transaction. Several devices may have the ability to be a master on a bus, but
only one is allowed to be in control of the bus at any time. Therefore there is a need to resolve potential
contention between them via an arbitration process handled by an arbitration unit.
3.2.1 Arbitration mechanisms
There are two major aspects concerning arbitration. One is the algorithm that decides which device will
get the master-ship over the bus. The second one is where the arbitration is done. The basic arbitration
algorithms are explained below.
Round-robin
The round-robin
Time-shared
A time-shared bus
time-slot, in which
or fair algorithm will serve bus ownership requests sequentially. It behaves like a
FIFO-queue where every device that wants to become a master is put at the end of a queue. The
device at the front of the queue will retain bus ownership and when the transactions of data is
nished the entry will be removed from the queue.
Priority algorithms grants the bus to the device that has been given the highest priority. Depending
on implementation, a lower priority ongoing transaction may be interrupted if a device with higher
priority wants to use the bus. Priorities can be static or dynamic.
A combination of the priority based algorithm and round-robin is the hierarchical round-robin
where several levels of FIFO-based queues exist. If there are any devices waiting for ownership at
the highest level, it will be granted the bus. Otherwise a master is selected from the nearest level
below containing a non-empty queue.
uses arbitration where every device has been given a specic time-interval, a
it will be a bus master. Every device will be mastering the bus at a dierent
point of time, thus eliminating contention. The schedule of ownership is repeated at given rate as
the example shows in g 11.
T
bus mastership
A B X C A B X C A B X C
time
Figure 11: The three devices A, B, and C has been given an own time-slot in which they act as a master.
The slot marked X is an unused slot. The schedule is repeated periodically with the period T.
The centralized arbitration architecture has a single arbitration unit receiving separate request signals
from each master. After requesting the bus, competitive devices must wait for the arbiter to set the
corresponding grant signal to the device before using the bus. By separating the arbitration and transaction buses, arbitration is permitted to occur in parallel with data transfers and thus increasing bus
performance. Distributed arbitration is used in many newer buses. In this solution most the arbitration
circuitry is located at each master. During the arbitration each competitor for the bus sets its own request
line and then listens to the other request lines before taking decision whether master-ship is retained or
not.
29
After a device has gained control over the bus, it can begin to carry out bus cycles. A bus cycle is exchange
of information between master and slaves, including data and timing information. In synchronous bus
architectures the bus includes a clock in the control lines and a protocol that samples the lines at xed
times with respect to the clock [8]. All the devices must obey the same distributed clock which results in
a limitation of the bus length to avoid clock-skewing. The duration of a bus cycle is constrained by the
rate of the clock which often is constrained by the slowest unit attached to the bus [60]. By introducing a
wait protocol decrease of overall performance can be avoided. An asynchron bus does not have a clocked
line at all. They instead use handshaking protocols between master and slave by using timing signals
to indicate the validity for the information. Handshaking protocols in asynchronous system takes full
advantage of the speed of fast responding devices without having to be concerned about slower devices.
Asynchron buses does not have the same restrictions on length since there is no worry for clock skewing
or synchronization problems.
3.2.3 Performance metrics
Bus performance is a function of bus width, bus arbitration, clock speed, bus protocol, and the application
using the bus. Two very common performance measure for buses is bandwidth (also known as throughput )
and latency. Bandwidth is often measured in Megabytes per second (MB/s) and dened as the amount
of data transmitted or received per unit time. When talking about bandwidth people usually mean burst
throughput or peak throughput, which is the theoretical rate at which reads can be performed using the
largest possible burst cycle. A burst transfer is characterized by only sending one address on the bus
followed by multiple data transfers. This mechanism is also called the block transfer mechanism [55] and
is for example implemented in PCI-bus and FutureBus+ (see section 3.3). In addition to throughput,
latency is also an important measure especially for multimedia systems. Bus latency is the response time
from the point of time where a device wants read (or send) data until the rst data is read (or sent).
The fact that CPUs are getting faster quicker than the memories has made the initial latency a dominant
factor in bus usage [55]. More time is spent on waiting for the initial latency than transferring the data
from memory to CPU. After transferring the last data, some buses require time before leaving the bus
in an idle state. This time is referred as turn-around latency. By decreasing the latency throughput on
the bus can be increased signicantly.
3.2.4 Pipelining and split transactions
When an address is transferred between a master and a slave the address bus is occupied until the
slave accepts the address and completes the handshake. The situation is similar for the data bus. By
implementing a latch into the bus interfaces at both sides, decouples the bus from the device. This allows
an address, address packet n to hold by the latch at the target, freeing the initiator of the transaction to
to send the next address, address packet n+1 time. At the same time the initiator interface can send the
address packet n. Using this technique means that the actions are serialized but several requests are in
the air. This pipelining technique is applicable to both data, address and arbitration. The arbitration
unit may respond to a requesting device: \you may use the bus next time it goes idle".
In a split-transaction bus the read request is separated from the data transfer, enabling devices to
make read requests during the initial latency for the rst request. Patterson and Hennessey does not
separate split-transaction buses from pipelined buses [8]. Using split-transactions does not decrease the
initial latency, but it increases the utilization of the bus. A drawback with these two techniques are the
increased complexity.
3.2.5 Direct Memory Access
A Direct Memory Access (DMA) controller allows devices to transfer data from one location to another
without intervention from the CPU. This means that the DMA controller must act as a master on the
bus. The DMA controller has at least four register, which all can be loaded by software from the CPU:
30
the memory address to be written, count of how many bytes to be transferred, the I/O devices number
or its address space, and the direction of the data (from memory to I/O or from I/O to memory). The
DMA controller has two modes of operation:
Using the burst mode means that during the DMA data transfer the CPU can not use the bus. The
CPU will be blocked until the DMA has been nished but data will be moved very fast.
In the cycle stealing mode the DMA is only allowed to make transfers when the CPU is not using
the system bus. Using this method the DMA transfer rate is limited to the bus width per CPU
instruction cycle.
To initiate a DMA transfer the CPU must set up the registers. After DMA has nished the transaction
it interrupts the CPU [8]. In order to prevent the DMA from monopolizing the bus and causing very high
latency to other devices, bus accesses has maximum DMA block size is typically placed on the amount
of data that can be transmitted as a single DMA block. Lahiri, Raghunuthan and Dey [130] observed
that DMA block transfer size can signicantly aect system performance and that the optimal value for
the size depend heavily on the characteristics of the trac seen on the bus.
3.2.6 Bus hierarchies
Choosing only one single bus for the whole system can be a performance bottleneck since only one master
can communicate with one or more slaves at a given point of time. When extending the number of buses
to more than one, parallelism is introduced and therefore potentially higher performance. These kind
of buses are called multisegment buses [60] and the buses in the system are called segments. A system
of n segments could have n masters making n transactions simultaneously. An example of a 2 segment
bus i shown in gure 12. The two segments are interconnected via a segment interconnect, often called
a bridge.
M
S
Segment 1
B
Segment 2
Figure 12: A multisegment bus example. The devices marked with M and S are masters and slaves
respectively. Segments are interconnected via a bridge marked B.
The hierarchical structure of multiple buses within a system is typically organized by bandwidth
demands. An example is the gure 12, where the rst segment could be the system bus connecting
CPU to high performance units. The second segment is a slower bus for less peripherals demanding less
bandwidth. Splitting the bus into several segments has great benets:
By splitting the bus, smaller drivers can be used which results in signicantly power consumption [57]. Using a lower clock frequency on the less demanding bandwidth buses, decreases the
consumption even more since power consumption is proportional to the frequency.
31
Reducing the bus wire length reduces the amount of capacitive coupling noise [57] (this type of noise
occurs also sometimes in telephony systems during a call, another call is heard in the background
at low volume).
Hsieh and Pedram [57] found that using a split-architecture, power consumption can be reduced from
16 to 50 percent in comparison with a single bus structure. Instead of together the two buses via a bridge
it is also possible to have them totally separated from each other by connecting them to two dierent
interfaces at the CPU.
3.2.7 Connecting multiprocessors
A multiprocessor system where all the processor nodes are connected to a common bus is called a shared
multiprocessor. The shared bus is a very popular solution to interconnect several processors as
evidenced by the number of commercial multiprocessor systems using this solution [5]. Examples of
shared bus based computers are Silicon Graphics (SGI) Challenge and Sun Enterprise. There are several
reasons for the popularity. First of all, a shared bus is very easy to implement and expand and it makes
a natural medium for broadcasting. This is also a preferable choice for Shared Memory Multiprocessors
(see section 4.5.1), since the bus structure enables implementation of snoopy cache coherence protocols
[5]. Bus snooping is for example implemented in the FutureBus+ described in section 3.3.1. To allow the
system software to build complex locking mechanisms for mutual exclusion, a bus can be extended with
a with a bus locking mechanism in hardware [55].
bus
In this section three bus standards and their properties are brie
y reviewed. As the heading indicates
the buses has some form of support for multiprocessing, which of course does not prevent to be used in
single processor systems. Using standard buses simplies the design and implementation of a complex
system.
3.3.1 FutureBus+
The work on the original IEEE FutureBus standard began already in 1979, which later evolved to the
FutureBus+. FutureBus+ is an architecture, processor, and technology independent standard with no
technology upper limits [55]. As technology improves the asynchronous bus can go faster and faster
since the only physical limitation is speed of light. The specication includes split transactions and a
message passing protocol for ecient multiprocessor communication. Fault tolerance has also been in
mind, resulting in parity checks on all lines and distributed arbitration to reduce the risk of a single
point failure. FutureBus+ basically contains two individual buses. One is a 64-bit wide multiplexed
bus responsible for all address and data transfers. The second bus is an arbitration bus that can be
used parallel with the other bus with hides some of the latency associated with a distributed arbitration
protocol for gaining bus ownership. A bus locking mechanism is also supported. To provide support
for cache coherence FutureBus integrates a MESI cache coherent protocol (see section 4.5.5). Bus-tobus bridges are used to connect FutureBus to other well used buses as VME, Multibus II and Scalable
Coherent Interface (SCI). FutureBus+ was used as a global bus in the implementation of shared memory
multiprocessor DICE [62].
3.3.2 VME
32
16 up to 64 bits, and address widths from 24 up to 64 bits. The burst throughput is 80 MB/s. All
popular microprocessors are supported by VME including Motorola 680X0-series, the SPARC, ALPHA,
and X86-based processors [55].
3.3.3 PCI
The Peripheral Component Interconnect (PCI)[55] is a local bus that solves the compability problem
in a elegant way. By interfacing to PCI instead of the CPU bus, peripherals remain useful even if the
CPU is exchanged. The system can incorporate with an ISA, EISA or MicroChannel bus and adapters
compatible with these buses. In PCI a basic transfer is a burst. Addresses and data are multiplexed onto
the same 32-bit address/data bus. With PCI its possible to build several hierarchies of buses connected
to each other via bridges. Arbitration is done centrally but the arbitration algorithm can be specied by
the user. A master on the bus may remain as owner of the bus as long as no other device request the
bus. This feature is called bus parking. The PCI-bus is used in the Scalable Architecture For Real-time
Applications (SARA), a current research project in scalable parallel systems [90].
3.4 Point-to-point interconnections
Between the two extreme interconnection architectures, the single shared bus10and a fully connected pointto-point interconnect where a node has a link to directly to all other nodes , a number of interconnect
topologies has been examined in the literature. Figure 13 shows an abstract view of a general interconnection network. In this section the unknown content of the interconnection network and some design
aspects will be brie
y discussed.
M
INTERCONNECTION NETWORK
Figure 13: An abstract view of an interconnection network. The modules marked M, P and H are
memories, processors and peripherals respectively.
In contrast to the shared bus approach, using switches that allows communication directly from source
to destination makes it possible for several pairs of nodes to communicate simultaneously. The technique
is used in shared memory multiprocessors where the switches routes requests from a processor to one
or several dierent memory modules. Moving from the shared bus architecture to a more complex
interconnection network introduces many new terms and design considerations. The design considerations
concerns the topology, switching strategy, routing algorithm and
ow control mechanism:
10 A
Topology
The interconnect structure at the physical level, i.e how the nodes are connected to each other is
called its topology. The topology is a major concern when designing parallel systems.
node consists of either a processor, a memory module, or a switch.
Routing algorithm
Switching strategy
When using the circuit switching strategy
Flow control
33
The routing algorithm decides how data is forwarded in the network. There exist many routing algorithms with dierent properties. For further information refer to Parallel Computer Architecture
[1].
a direct connection is established between two nodes and
the path is reserved until the communication has been nished. This strategy is used telephone
networks where a direct connection is established between the caller and callee. A packed switched
network slices the data to send to packets of data which are individually routed from one node to
another. For further information refer to Parallel Computer Architecture [1].
Flow control is necessary, when two or more data packets tries to move through the same route at
the same time. The
ow control mechanism determines when the message will be moved along its
path to the destination. For further information refer to Parallel Computer Architecture [1].
An interconnection network can be reliable or unreliable. In a reliable interconnection network, a
message sent from one node to another is guaranteed to arrive at the destination node in contrast to
unreliable interconnect where messages may be lost and must be retransmitted. A typical message consist
of routing and control information along with the payload consisting of data. The largest research eort
is done on large scale multiprocessors consisting of several hundreds or thousands processors resulting in
less research results considering design of point-to-point solutions suitable for small-scale multiprocessors,
especially those put on a single die.
3.4.1 Interconnection topologies
The collection of nodes in a system communicate via point-to-point links, that typically are unidirectional.
A number of topologies has been proposed and examined in the literature and a number of them are
presented below. Before describing a set of important topologies in static networks, basic terms needs to
be discussed. Two nodes are considered as neighbors if there is a link connecting them and the degree of
a node is the number of neighbors attached to the node. The diameter of a topology is the longest path
between any two nodes.
Linear arrays
In a fully connected topology every node has a link to its neighbor. The diameter of such an
interconnect is always xed to one, independent of how many nodes are connected. A major
drawback with fully connected networks is that they scale poorly since the complexity of the crossbar
increases nonlinearly.
Linear arrays are the simplest networks with bidirectional links between the nodes. The bisection
width for linear arrays is 1 link.
Rings
Meshes
Trees
Hypercubes
Rings is an extension of linear arrays with the two edge nodes connected hence forming a ring.
A 2D mesh is a matrix of nodes each with connections to its nearest neighbors.
Trees has the nice property of increasing routing distance logarthmially.
Communication in hypercubes are based on the binary representations of a nodes identity which
leads to a simple routing algorithm. Sending messages in a n node hypercube can be done in n
cycles. Hypercubes are scalable and has been used in several parallel machines.
34
Torus
A torus is a 2D mesh where the top of the grid has connections to the bottom of the grid. It can
also be visualized as a cylinder.
A fully connected network can be implemented by using a crossbar switch [8]. Using a fully connected
network enable a node to communicate with each other at any given point of time. A network using the
crossbar to connect processors, placed at one edge of the network of switches, and memories on the other
edge is shown in g 14. In order to send information from one side to another the switches are congured
so a path is setup between the edges [124].
P
Figure 14: Crossbar switch. The processors (P) on the left side are connected to the memories (M) at the
bottom via the cross-point switch elements (S). No contention arises while all the processors are trying
to access dierent memories.
3.5 Interconnect performance & scaling
Patterson and Hennessy [8] dened six performance measures for an interconnection network: bandwidth,
time of
ight, transmission time, transport latency, sender overhead and receiver overhead. The total
latency is the time that it takes to transfer a number of bytes from source to destination. An additional
well used measure for dening worst case performance for a parallel systems is bisection bandwidth. It is
calculated by dividing the interconnect into two roughly equal parts, each containing half of the nodes
and then summarizing the bandwidth links that the imaginary dividing line crosses [8].
35
The benets of low cost and easy implementation may be traded against a major drawback with the
shared bus approach: the limited bandwidth that comes from the serialization of all communications on
it. This serialization gives this kind of system poor scalability since all processors must compete for the
single bus. The performance curve
attens when there are more than 10 processors11 attached to the bus
[56]. Adding an 11th processor hardly increases performance at all.
As microprocessors become faster and faster, they demand more bandwidth which makes the shared
bus become an even more serious bottleneck in a system. The last ten years indicate that the situation
will only get worse [62]. To reduce the problem there are three solutions:
development of a faster and wider bus
use of smart bus protocols (pipelining)
serve memory requests locally (use of caches/buers)
Pipelining signals on bus can increase the available bandwidth from 2 to n times, depending on a
number of system parameters, where n is the number of processors attached to the bus [5]. Introduction
of caches moves the \knee" of the performance curve to be around 20 or 30 processors [56], but the
technique introduces other diculties as the cache coherence problem[1]. A performance curve comparing
a non-cached bus-based architecture and the same architecture extended with caches is shown in g 15.
Figure 15: Performance knee curve. Using caches with a hit ratio around 90 percent, move the knee
right from being around 10 processors to be settle somewhere near 30 processors (Source: Computer
Architecture [56]).
3.5.3 Point-to-point architectures
This strategy takes the advantage of handling multiple transfers in parallel giving these much higher
aggregate bandwidth. Point-to-point communication is also faster because there is no arbitration process.
A benet is also the electrically simpler interface[8].
Contention still occurs in if more than one processor wants to use the route at the same time. The
application that will run on the system has great in
uences the decision of network, since dierent
application has dierent types of bandwidth and communication needs. Expecting a network under a
heavily load and assuming worst case patterns, preferably leads to a high-dimensional network where all
the links are short. If looking at communication patterns where each node is communication with only
11 Where
the knee is located depends of course of the application and the actual implementation of the system.
36
one or two near neighbors, a low dimensional network is preferred since only a few of the dimensions are
used[1].
Assumed locality of data has also in
uence the design issues. A performance study of hierarchical ringand mesh connected wormhole routed shared memory processors shows that with little locality meshes
scale better than ring networks caused by the limited bisection bandwidth [125]. For workloads with
some memory access locality, the hierarchical rings outperform meshes by 20-40% for system sizes up
to 128 processors. The study also shows that by putting 1-
it size buers in the routers rings perform
better than mesh regardless of the mesh router buer size for systems up to 36 processors.
There are number of tradeos to consider when choosing an point-to-point architecture. For example,
bandwidth of the links may be traded against the complexity of the switches. To use the fully connected
architecture is unrealistic for systems where the number of nodes is large since it needs N*(N-1) number
of links, where N is the number of nodes[61]. This is one reason why almost all multiprocessors uses
topologies between the two extremes.
The table below shows the performance (as bisection bandwidth) and the cost (number of ports per
switch and total number of links) for dierent topologies for a system with 64 nodes:
Measure
Bisection bandwidth
Ports per switch
Total number of links
Source:
Bus
1
N/A
1
Ring
2
3
128
2D Torus
16
5
192
Fully connected
1024
64
256
The evolution in the semiconductor industry has made it possible to for whole systems to t on a single
piece of silicon which moves the interconnection between functional modules from being on a PCB to be
inside a single chip. To meet the time to market requirements, and as earlier mentioned, SoC designers can
use IP-components. In order to eectively use them a design methodology including hardware/software
co-simulation techniques and a design friendly On-Chip Bus (OCB) is required [4]. Interconnecting the
components is not an easy task. Telecom and multimedia applications require low latency and high
bandwidth communications [123] and there is a wide variety of components with dierent interfaces
which implies that glue logic must be used to adapt a component to a specic bus. This is nothing new
since the method with glue logic has successfully been used in the world of PCB [16].
3.6.1 VSIA eorts
To enable the mix and match of IP-components onto a SoC, the Virtual Interface Socket Alliance (VSIA)
tried to specify an on-chip bus standard. The mission appeared soon to be infeasible so focus is now instead
on a standard that separates the bus interface logic from the IP-components internal behavior through
a bus wrapper [80]. A bus wrapper (see gure 16) is a component that is physically located between the
bus interface and the IP-component and used for communication between them. This strategy enables
IP-components to easily to be adapted to dierent designs because changes is only needed into its bus
wrapper. Separating the internal behavior from the bus wrapper can introduce extra latency when the
on-chip bus and the internal bus are very dierent. Lysecky, Vahid and Givarrgis has proposed techniques
to reduce this latency by introducing prefetching in the buers [80]. VSIA is also standardizing the bus
between the bus wrapper and the IP-component, called the Virtual Component (VC) interface.
3.6.2 Dierences between standard and SoC interconnects
Existing buses such as PCI and ISA were designed to connect discrete components at the board level.
At this level a key issue is to minimize the number of signals because pin count directly translates into
package and PCB costs [4]. When moving to on-chip solutions, signal routing consumes silicon area but
does not aect the size or cost if packages, PCB:s and connectors. SoC architectures have a rich set of
37
To support a wide variety of IP-components and embedded systems the interconnect must be suciently
exible and robust. The proposed on-chip buses uses a variety of dierent design choices. In the following
section a number of todays existing on-chip buses will be studied.
3.7.1 AMBA 2.0
The Advanced Microcontroller Bus Architecture (AMBA) [?] is an open standard, processor and technology independent on-chip bus specication. The rst version was released in 1995, but here the newer
AMBA Revision 2.0 will be studied. The AMBA architecture consists of a system buses and one peripheral bus shown in gure 17 and explained below:
The system bus has two specications, namely the Advanced High-performance Bus (AHB) and the
Advanced System Bus (ASB).
The Advanced Peripheral Bus (APB)
This bus is aimed for slower general purpose peripherals such as timers, UARTs, interrupt controllers, I/O ports etc. Connection to the main system bus is made via a bridge.
Embedded processors are connected to high-performance peripherals, on-chip memory, and interface
functions via the system bus. The ASB supports multiple masters, pipelining and burst transfers. The
38
AHB specication is extended with split-transactions using separate read and data buses with data width
supported from 32 up to 1024 bits. The peripheral bus is a simple, single master bus controlled by the
APB bridge that connects the buses. The APB has an unpipelined architecture and a low gate count.
ARM Primecell IP-components can be directly attached to the buses without using any glue logic.
Figure 17: A typical AMBA based SoC (Source: ARM webpage [38])
Commercially AMBA has been implemented by several companies in a variety of products, including cell
phones, set top boxes, digital cameras, and general purpose microcontrollers. AMBA was was originally
an ARM bus, but has evolved into a license and royalty-free specication compatible with other CPU
architectures as well. More detailed information can be found in the AMBA bus specication[?] and at
the ARM homepage [38].
3.7.2 CoreConnect
The IBM CoreConnect [54] is an on-chip bus architecture available under a no-fee, royalty free license
agreement12 from IBM. Components that are connected to the bus must be compliant with IBM Blue
Logic designs. The hierarchical architecture provides three synchronous buses (see gure18):
Processor Local Bus (PLB)
On-Chip Peripheral Bus (OPB)
Device Control Register (DCR) Bus
The PLB architecture is very similar to the AMBA High Performance Bus with separate read and
write data buses, allowing multiple bus masters, pipelining, split transactions, burst and line transfers.
Its purpose is to provide high performance, low latency, and design
exibility when connecting high
bandwidth devices, such as CPU, external memory interfaces and DMA controllers. The data width is
32 or 64 bits wide extendable to 128 and 256 bits. The PLB and OPB have dierent structures and
signals so IP-components attached to the buses has dierent interfaces. The concurrent read and write
transfers yields a maximum bus utilization of two data transfers per clock cycle. Controllable maximum
latency is supported by the architecture by using master latency timers. The PLB devices can increase
bus throughput by using long burst transfers. When the bandwidth on a single PLB bus exceeds the
limits of its capability, a possibility is to place high data rate masters and their target slaves on separate
buses. An IP-component, PLB Cross-Bar Switch (CBS) can be utilized for this purpose as shown in
gure 19. The CBS allows multiple simultaneous data transfers on both PLB buses which uses priorities
to handle multiple requests on a common slave port. A high priority request interrupts an ongoing lower
priority transaction. CoreConnect can be used to build multiprocessors systems.
12 The license agreement includes the PLB arbiter, OPB arbiter and PLB/OPB bridges designs including bus model
toolkits and bus functional compilers for the buses.
39
Figure 19: PLB Crossbar switch. Source: CoreConnect Bus Architecture [54].
The CoreConnect architecture is used in the PowerPC 405GP embedded controller, to connect an
PowerPC 405 CPU core, PCI Bridge, and SDRAM controller.
3.7.3 CoreFrame
PALMCHIPs CoreFrame [40] is a processor and foundry independent integration architecture developed
by PALMCHIP Corporation. Designs targeted are set-top boxes, digital cameras, communications, mass
storage, printing, intelligent I/O, and networking. It is inter-operable with the AMBA peripherals, which
means that peripherals designed for the AMBA bus can be directly attached to the CoreFrame making
the portfolio of IP-components larger. The CoreFrame architecture diers from many of the other on-chip
buses by using point-to-point signals and multiplexing instead of shared three-state lines. It uses a shared
memory architecture with simple protocols to reduce the design and verication time. Some feature are
listed below:
400 MB/s bandwidth at 100 MHz
Support for 128, 64, 32, 16 and 8-bit buses
Unidirectional buses only
Positive edge clocking only
40
3.7.4 FPIbus
The Flexible Peripheral Bus (FPI Bus) is an on-chip bus designed for memory and I/O mapped data
transfers and it is a part of the Inneon Technologies TRICORE architecture[91]. The TRICORE 32-bit
microcontroller/DSP architecture is designed to be used in real-time embedded systems. The FPI-bus
connects the Tricore CPU/DPS to memory, other CPUs, external and internal peripherals. Up to 16
master devices are supported by the synchronous bus that uses a
exible bus protocol which can be
tailored to specic application needs. Address and data buses are demultiplexed with up to 32 address
bits and 64 data bits. Peak throughput is 800 MB/s at 100 MHz. There is no upper limit to the number
of peripheral connected to the FPI Bus. Arbitration is done centrally by the FPI Bus controller which
supports both single and multiple data transfers.
3.7.5 FISPbus
The FISPbus is available under license from Mentor Graphics and is delivered as fully synthetizable
VHDL-RTL source code with a VHDL test bench including functional test vectors [92]. FISPbus architecture consist of a single bus supporting multiple masters and distributed arbitration, which implies
that all IP-components attached to the bus must contain the FISPbus State Machine. The FISPbus
interface is a generic microprocessor interface specication specially developed for soft IP-components.
Microprocessor cores non-compatible with the architecture can be attached to the bus by using adapted
softcores as a bridge between FISPbus and the microprocessor bus. There exists a number of ready-to-use
soft IP-components that can be directly attached to the bus.
3.7.6 IPBus
The Integrated Devices Technology (IDT) Peripheral Bus (IPBus) [83] is an synchronous high speed onchip bus running at more than 100 MHz and providing processor independence. No license is available so
implementations are still only in products from IDT. It is a multimaster bus that provides DMA support
and features as multiplexed address and data, pipelining and burst capability. The IPBus interface is
a small piece of code that is part of all functional cores. The gate count for the interface adapted to
slaves is about 500 gates and for a master/slave core the implementation can be done under 1000 gates.
Standard design tools works well with the IPBus.
3.7.7 MARBLE
The Manchester Asynchronous Bus for Low Energy (MARBLE)[132] is on-chip bus which provides an
interconnection for asynchronous IP-cores. MARBLE uses pipelining of the arbitration, address and data
cycles. In addition to basic bus functionality it supports bus-bridging to interconnect asynchronous and
synchronous subsystems on the same chip. Arbitration is done centrally using two separate arbiters, one
for data and one for address. All transfers are tagged with an unique identier of the initiator of the
transfer. Figure 20 shows a typical MARBLE system.
If the embedded on-chip memory is not large enough, an external memory bridge can be used for direct
connection to SRAM or DRAM.
41
The Peripheral Interconnect Bus (PI-Bus) was developed within the European Union ESPRIT Open
Microprocessor Initiative (OMI) project. It has been incorporated as OMI on-chip bus Standard OMI3243D with the ve companies, ARM, Philips Semiconductors, SGS THOMSON Microelectronics, Siemens,
and Temic/Matra MHS owning the patent rights. Licenses has been available since 1995. The purpose of
the bus is to be used in modular highly integrated microprocessors and is designed for memory mapped
data transfers between its bus agents. A bus agent are on chip functional blocks connecting the cores to
bus. A PI-Bus agent can both act as a bus master and a slave. In order to operate the PI-bus requires
a central bus controller which performs arbitration, address decoding and time-out control. A typical
architecture using PI-Bus is shown in gure 21. The peak transfer rate is 200 MB/s at 50 MHz. More
features of the PI Bus are listed below:
Processor independent
Demultiplexed operation
Clock synchronous
Address and data bus scalable up to 32 bits
Multimaster capability
8, 16, and 32 bit data access
Missing features in the specication are cache coherency support and broadcasts. A toolkit for analyze
and integration of cores to the PI-Bus has been developed at the University of Sussex. The toolkit
contains models of the PI-bus agents, together with test frameworks. Most models are compatible with
the IEEE 1076-1993 VHDL standard. The documentation and Toolkit can be downloaded from the
Sussex University's webpage [85].
3.7.9 SiliconBackplane
The SiliconBackplane [123] is a highly congurable SoC communication system developed by Sonics and is
a part of the Sonics Integration Architecture (SonicsIA). The architecture consists of a pair of proprietary
protocols, an open IP-component interface, and supporting EDA tools. The Open Core Protocol (OCP)
is a point-to-point interface that provides a standard set of data, control and test signal
ows enabling
the cores to communicate. SiliconBackplane diers from many of the conventional SoC interconnections
by using only one single bus where all the components are attached via the SiliconBackplane agents. An
42
WISHBONE is a open standard on-chip specication developed by Silicore Corporation [93] and available
on a no cost basis. The standard can be used to interconnect soft, rm and hard IP-cores with any target
architecture (such as FPGA or ASIC devices), and it is independent of development tools. The speed
of the interconnect is limited by semiconductor technology. A nice property with WISHBONE is that
43
the interconnection topology supports both shared bused and crossbar switches. Additional feature are
listed below[95]:
Multiprocessing capabilities
Processor independent
Full set of popular data transfer bus protocols
Supports both BIG ENDIAN and LITTLE ENDIAN data ordering
Master/Slave architecture
Arbitration algorithm dened by the user
3.7.11 Motorola Unied Peripheral Bus
Motorola has developed an on-chip CPU core independent peripheral bus, the Unied Peripheral Bus
(IP Bus) [96]. The IP Bus interface Specication, bus functional models and bus monitors are publicly
available. IP Bus supports datawidths from 8 bits up to 64 bits and address width up to 64 bits. It is
aimed to connect peripherals to the local processor bus via a bus bridge.
3.8 Case studies of SoC multiprocessor interconnects
The HYDRA multiprocessor architecture [6] is a research project at Stanford University. HYDRA is
composed by four superscalar processors with individual L1-instructions and data caches. The four L1caches are supported by one unied on-chip L2 cache an o-chip cache and external memory as shown
in gure 23.
Figure 23: Schematic overview of HYDRA. The caches are reading and writing their data via the 256
bit wide read and the 64 bit wide write buses respectively. Both buses uses pipelining and centralized
arbitration that occurs at least a cycle before the use of the bus. L1-caches are connected to the CPU
via a 32-bit wide bus. Source: Stanford University [6]
44
The general-purpose read bus matches the cache line size in both caches in order to allow entire lines
to be transmitted across the chip at once. It is used to move cache lines between all the on-chip caches
and the o-chip interfaces. Hammond and Olutkun found that even though HYDRA uses a shared read
bus solution, it is typically occupied less than 50 percent of the time and thus not a bottleneck in the
system[6]. The write bus has one specic task: to transmit write-through data from the CPUs to the
L2 caches. Since it transfers data from one CPU at a time, it only needs to be 64 bits bits wide (64
bits is the widest CPU instruction). The write bus is not scalable at all for larger number of processors.
Benchmarks indicate that even in the worst cases, the read and write bus slows performance only by a
few percent over a perfect crossbar.
3.8.2 Silicon Magic's DVine
4 MEMORY SYSTEM
45
4 Memory System
Nothing is stronger than it's weakest part, this also applies to computer systems and the processor/memory speed gap. An arbitrary processor can not execute instructions faster than they can be
obtained from the memory system, which often are the most severe bottleneck in computer systems. The
average performance of the whole system is dominated by parameters concerning the memory system[23].
Todays high performance processors, requires more data per time unit throughput, than memory chips
can provide[24].
The improvement rate in processors are much higher than for memories, CPU performance improves
with 50% every year, whereas memory access time only increase with 5-10%[26]. The divergence between
CPU and memory are known as the cpu-memory speed gap. As processor performance increases, the
number of idle cycles encountered for a continuous memory reference, latency will inherently grow as the
divergence continues [27]. The ideal solution would be for researchers and engineerers to provide technologies for memory chips to scale with the performance of processors. Since no method to achieve this
currently exist, the challenge lies in architectural improvements and well designed hierarchical solutions
[28, 24, 23].
In the folowing sections semiconductor memories will be coved, followed by an introduction to memory
hierarchy. Cache memories, their functionality and improvements that can be made are coverd. A breif
look at memory management units are followed by multiprocessor systems and data prefetching.
4.1 Semiconductor memories
Semiconductor memories can be split it to two separate categories non-volatile and volatile. Non-volatile
memories are able to retain information without power supply, while volatile memories will lose the
information after power is turned o[7]. Both classes shows similar reading characteristics but nonvolatile memories suers severe delays with writings. This makes the non-volatile memories most suitable
for Read Only Memories (ROM), whereas volatile memories are usable for Random Access Memory 13
(RAM), issuing both reads and writes.
4.1.1 ROM
Read Only Memories contains information that has been preprogrammed, and possess dierent levels of
erasability and re programmability, from permanent contents to byte level erasability. ROM's are often
used to store low-level information for the hardware or the operating system. They are also applicable
when stored information only requires infrequent updates in order to maintain functionality like channel
information for a TV or video set[9]. ROM's can also be used to store logic congurations for recongurable hardware(FPGAs). Dierent types of ROM's are classied based upon their programmability and
erasability see table below. The table also contains manufacturing, price, and performance information
of the dierent ROM types.
Masked programmed ROM (ROM): The true Read Only Memory are programmed at manufacture, the customer provides the semiconductor vendor with a specic conguration le, containing
the information to be stored. The le will be interpreted into a photo mask, whereas each binary
one in the data le corresponds to a transistor on the mask. The mask is then used to process the
actual chip, which will be delivered to the costumer after several weeks.
Programmable ROM (PROM): PROM's are also program once circuits, the dierens are that they
are programmed by the user at the user. This enables the manufacturer to mass produce standard
PROM's for arbitrary use. The chip is programmed via a special programming device burner, that
13 The term Random Access generally means that every memory cell can be accessed in equal time regardless of their
physical location [7, 9], which apply to both RAM and ROM. Therefor in this document Random Access are refered to
Read/Write memories RAM.
46
will burn the connection of those transistors that are not to be regarded as ones according to the
pattern of ones and zeros the user want to program.
Erasable PROM (EPROM): As the name indicates, the border between read only and random
access starts here. This are memory circuits that can be programmed and re-programmed several
times. The memory array consists of a special MOS-transistors with a
oating gate which threshold
can be altered in order to program ones into the chip[7]. EPROM circuits do not have their die area
covered, instead they have a quartz window which will let ultraviolet light of the right wavelength
through, which will erase the contents of the memory.
FLASH: Flash circuits consists of of a special stacked gate transistor with the upper transistor control
gate behaves like a ordinary transistor whereas the lower one are the
oating gate. Flash devices
contains embedded algorithms and functionality to perform the dierent tasks of programming and
erasing. Programming the cell is done by negatively charge it's
oating gate which will increase the
threshold of the transistor[25]. Erasing is done by electricly let the negative charged
oating gate,
release the electrons.
Electrical Erasable PROM (EEPROM): Implied by the name this memory are to be erased by
the use of electricity. Like the FLASH memory it also make use of a
oating gate that electricly are
altered. Unlike Flash EEPROM programming and erasing can only be performed at single bytes
and they are the most expensive of the non-volatile memories[9].
Type
rom
prom
eprom
flash
eeprom
Cost
Programmability
Programingtime
Very inexpensive
Inexpensive
Moderate
Expensive
Very expensive
Once in factory
Once, by user
Many times
Many times
Many times
Weeks
Seconds
Seconds
100s
100s
4.1.2 RAM
Source:
Erase-time
N/A
N/A
20minutes
1second
10ms
computer systems design and architecture [9]
Erasable size
N/A
N/A
Entire chip
Blocks
Byte
Although the read time of ROM's are in the vicinity of nanoseconds, their write time can be found in the
area from minutes down to a hundred micro seconds, which for todays processors are unreasonable slow.
The lack of homogeneity between read and write in non-volatile memories makes them an inadequate
alternative for main memory in computers. Semiconductor memories with uniform read and write time
in the area of nanoseconds, can expressly be made of volatile memories [7].
There exist several types of RAM-circuits
In this section Static RAM (SRAM) and Dynamic RAM (DRAM) are the main concern, only a brief
look at others will also be included.
SRAM: The memory cell in SRAM make use of transistors to store information, the active element
resembles an SR-latch [7]. As long a power is not turned o, stored information will remain intact.
By utilizing transistors to store information, the bit-cell constantly consumes power hence the prex
static. Two types of SRAM cells are currently in use, a four transistor cell (4T) or a six transistors
(6T). Since the transistors in SRAM basicly are the same CMOS process used for ordinary logic.
There exist more advanced processes invoking self-aligned contacts or local interconnects in order
to reduce size. SRAM are often integrated into several logic devices as fast buers, registers, and
on-chip caches for processors.
Some problem have unfortunately appeared concerning stability of the memory cell for 4T memories
when voltage goes below 2,5 volts [31], which 6T cells manage to overcome. For this reason and
the improved process- technologies can produce 6T cells the size of ordinary processed 4T cells are
to be the dominating SRAM cell until new improvements are achieved [31].
47
The primary objective of DRAM is to provide the market with the largest memory capacity
chips, at the lowest cost. This is mainly achieved through process optimization for lower cost,
highest density with the lowest cost/bit and high production volume.
In 1968 the one-transistor DRAM storage emerged, utilizing a capacitor to store bits and make
use of one transistor for cell selection[28]. The amounts of charge in the capacitor represent the
binary states of 1 & 0. It is only possible for the capacitor to retain a charged value for a few
milliseconds, It periodicly need to be refreshed, reload previous contents approximately every 4 to
50 ms[7, 9, 28], refresh are also performed when a value is read. Power consumption in an DRAM
cell is lower than SRAM cell are due to the capacitor, which only consumes power when refresh.
The need to refresh the capacitors requires special logic for refresh administration, which slightly
complicates the design. Enlarges the chip or in some solutions a external refresh chip is used. The
timing overhead forced by the actual refresh should be known, but have insignicant implications
for the user of the circuit, since it's a very small fraction of the whole refresh cycle.
Worse is the latency in cycle time and data rate experienced with DRAM's, which are dealt with
in dierent ways yielding dierent models and special purpose DRAM circuits.
Extended Data Out DRAM (EDO): In EDO, or sometimes called hyper page mode the output buer have been provided with a extra pipeline step in order to improve the data rate
for column addressing. This type of memory gives improved system performance with minor modications to conventional memory controllers[28], due to the preserved asynchronous
interface.
Burst EDO (BEDO): BEDO memories are enhanced EDO RAM to allow much faster access
time, allowing faster busses to be used. This is achieved by combining the pipeline with special
latches(counters). This makes it possible to make four reads or writes in one bus cycle, hence
only every forth byte need to be addressed.
Synchronous DRAM (SDRAM): This type of DRAM employs a synchronous interface, the
circuit is synchronized with the bus clock. The data path is pipelined and data are bursted
out on the bus, this for improved data rate. By interleaving multiple memory banks random
access performance can be improved[32]. SDRAM access time is not measured in nanoseconds
but in MHz.
Double Data Rate SDRAM (DDR SDRAM): DDR SDRAM have much in common with
SDRAM, but DDR SRAM doubles the bandwidth of the memory. The Double Data Rate is
achieved by transferring data on both edges of the clock.
DRDRAM: DRDRAM or Direct Rambus DRAM. The DRDRAM works more like an internal bus
than a memory subsystem, it is based on the Direct Rambus Channel. The Rambus Channel is
a high-speed memory interface that are able to operate 10 times faster than ordinary DRAM
interfaces[126].
Synchronous-Link DRAM (SLDRAM): SLDRAM represent the next step in DRAM evolution[129]
from EDO, SDRAM to DDR and nely SLDRAM. the technique is base on SDRAM and DDR
with the addition of a packetized address/control protocol.
Cached DRAM (CDRAM): With the integration of small amounts of SRAM into DRAM
circuits or by splitting DRAM into disjoint banks, the problem of row access performance has
been addressed. The result are a DRAM circuit with a small integrated cache of fast SRAM
cells[28].
Video RAM (VRAM): An example of a special purpose DRAM, introduced in mid-1980,
developed for graphics applications. The improvement objective of VRAM was not speed but
to gain massive parallelism in data rate. As a special purpose the demand for this circuits are
smaller hence smaller series are manufactured which automaticly makes them more expensive.
FERAM: In ferroelectric RAM consists the the memory element of a ferroelectric capacitor. The
ferroelectric eect is the ability of a material to return an electric polarization in the absence of
DRAM:
48
an applied electric eld[127]. This property is used to construct memories where the memory
element is a ferroelectric crystal. After the atoms in the crystals has been polarized and the
electric eld been removed the crystal will remain as it is. The polarization of the crystal
constitutes the logical value of 1 or 0.
EDO, SDRAM and VRAM are all standardized by JEDEC14
4.2 Memory hierarchy
SECONDARY
CPU
MAIN RAM
REG
CACHE(s)
BLOCK
PAGE/
BLOCK
EXTERNAL
STORAGE
WORD
Level 0
LOW
HIGH
Level 1 - (n-1)
Level n
CAPACITY
SPEED
HIGH
LOW
Cache memories are small, fast and expensive memories. Small in the context that they only are able
store a fraction of available main memory at any instant time. They are inherently fast due to their
small size and by the use of high performance SRAM technology. Their main function is to store recently
referenced data close to the processor in order to exploit the locality gained from recent referenced
addresses. Caches one or multiple at dierent levels of the memory hierarchy, in dierent sizes, dierent
storage and operational strategies constitutes the levels in the memory hierarchy between CPU and main
memory.
14 The JEDEC Solid State Technology Association (Once known as the Joint Electron Device Engineering Council), is
the semiconductor engineering standardization body of the Electronic Industries Alliance (EIA), a trade association that
represents all areas of the electronics industry.
49
BLOCK ADDRESS
TAG
SET
BLOCK
TAG
DATA
BLOCKS
TAG
DATA
BLOCKS
TAG
DATA
BLOCKS
There exist three major types of cache misses that occur independent whether the cache is in a single
or multiprocessor system. However a fourth type of misses known as coherence misses are introduced in
multiprocessor systems, which originates from data sharing between processors.
Compulsory misses also referred as cold starts or rst reference. Compulsory misses occurs whenever
program execution begins and the cache is empty(i.e booting). All references are the rst to the
actual memory block and must be obtained form main memory until the programs working set has
been established in the cache. Related to cold start are warm start [2], which occurs when a whole
working set are successively swapped out in multiprogrammed system issuing a task switch.
Capacity misses are the eect of when a programs working set are to large to entirely t in the cache.
Capacity misses are easily reduced by increasing the cache size.
Con
ict misses or collision, even though the cache only have a single block stored, capacity misses can
occur. This situation appears when a memory references are mapped to a already occupied cache
entry. The problem of collisions are addressed by increased associativities, see section 4.3.7.
Coherence misses In a cache coherent shared memory multiprocessor system two new type of cache
misses occur. The problem arise when data are spread among the processors who have the data
cached locally. Sharing of data can be divided in two categories.
15 The
tag contains only a subset of the bits constituting the full address, size of the tag depends on cache organization.
50
True-sharing occurs when a data word produced by one processor is used by another processor.
False-sharing are when independent data words are being used by dierent processors, are
cached in the same line and at least one access are a write.
True sharing misses occur when a processor modies some word in a cache block, resulting in
invalidations among sharers of that block. Later when one of the sharers try to access that
word it will nd it invalidated, hence a cache miss.
False sharing misses will occur whenever a processor writes to a word in a cache block, yielding
cache invalidations of that cache block among the processors sharing it. Any processor trying
to access another word in that cache block will nd it invalidated and results in a cache miss.
4.3.3 Storage strategies
This section focus on the dierent classes of common cache memories. Since much of the work done by
the cache is mapping memory blocks into the memory provided in the cache, todays three common types
of cache memory techniques are named after how this mapping and block placement is performed.
Direct mapped: When a single memory block expressly can be mapped to one and only that one
cache line, that cache is proclaimed to be a direct mapped cache. Mapping of a block are modulo
division between the address of the block and number of cache lines, except from some bits for
block information. The direct mapped cache are benecial due to its simple construction, fast and
small, but in order to obtain good performance it is reliant upon locality in referenced data. Direct
mapped caches also shows great compatibility with processor pipelines, and steps in the cache access
can be built in the pipeline steps[51].
A major disadvantage of the direct mapped cache are its inability to store two memory blocks that
are to be mapped onto the same cache line. If consecutive accesses are made to two or more blocks
mapped to same line, severe trashing will occur switching cache context for that line every access.
Thus, performance improvements will fail.
Fully associative: In a fully associative cache memory blocks can be placed on any arbitrary cache
line. Mapping is only to store the remaining bits of the memory address in the tag-eld16, after
bits for block information has been reduced.
When retrieving cached information a parallel bit-wise comparison between all tags and the memory
address must be performed, a sequential compare of all tags would be to slow. To perform the
parallel address/tag matching, considerably more logic is required than for the direct mapped
cache. This will result in increased die size hence more expensive, which limits its usefulness to
only be feasible in relatively small systems[9]. In comparison to the previously introduced direct
mapped, associative caches are more
exible. Trashing is more unlikely to occur and the memory
are better utilized since the cache is \free" to place a block on any line.
Set associative: The set associative cache is a combination of the previous two, in an attempt to
combine attractive properties from both strategies. A set associative cache can be viewed as a
direct mapped cache consisting of several numbers17 of parallel cache blocks forming a matrix.
Each parallel line of blocks are called a set whereas a column in the set is called way, hence the
sometimes used n-way set associative refers to a set associative cache with n-ways of associativity.
Like the direct mapped cache a memory address can only be mapped onto a single set, but blocks
designated for the same set can be stored in dierent ways in a fully associatively manner.
16 See gure 25
17 Usually 2,4 or
51
All in this document described cache types except the direct mapped cache18 needs a block replacement
strategy when the cache is full or when all ways are occupied in a set associative cache.
With the objective to maintain a high hit ratio, the choice of replacement algorithm is of great
importance[49].
(LRU) aims to reduce the miss-rate relying upon temporal locality among
cache entry's. In an attempt not to swap out a data block that probably will be referenced in
a short time, information about block accesses history are gather by the cache controller and
maintained in the cache lines. LRU is one of the most popular and implemented algorithm for
block replacement[48, 50, 53]. The scheme is relatively easy to implement, and yield good results
in keeping the miss-rate down[48, 53]. However with increased associativity the probability of the
LRU-line are the best line victimize are declining[53].
Random: The objective of random replacement is to spread the allocation in an uniform way through
out the entire cache. The hardware randomly selects a block to be discarded, overwritten or written
back to the main memory. The random algorithm do not take any previous execution history in
consideration but has similar performance as LRU with somewhat larger caches [8, 9]. random replacement are very easy to implement. Due to the algorithms non-deterministic behaviour a pseudo
random scheme can be used, it is fundamentally random but have a predictable behavior which can
be used for hardware debugging[8].
The cache controller maintains information about the order the dierent blocks arrived in the
cache. That information do not say anything about how the dierent block are being accessed by
the processor.
LFU: Least Frequently Used, to evict the cache line that been used least of the lines residing in the
cache over a nite period of time could be a good approach. The algorithm demands some way to
keep track of time and implementing a clock for that is to expensive.
MRU/notMRU: The Most Recently Used algorithm keeps track of the cache line that was accessed
last of the lines in the cache. The line chosen for replacement are randomly picked by those that
are not Most Recently Used.
Dynamic Exclusion: Dynamic exclusion is a replacement method that tries to decide whether a cache
line shall be replaced or the new entry bypassed the cache and go directly to the processor[128].
This occurs when two references are mapped to the same line. The protocol tries to keep one of
them in the cache and the other one outside. Dynamic exclusion tries to avoid replacing a line
with a line that could degrade performance. A small nite state machine is used to recognize the
common reference patterns, where storing an new reference would reduce performance.
FIFO:
Most of the requests from the CPU are read operations (i.e all instruction fetch are reads). When a CPU
issues a read it will stall until the request is fullled, therefore optimize the cache to reduce the latency is
the main objective for read operations[46]. Data blocks can simultaneously be read as the tag comparison
is being performed, depending upon the result of the tag comparison it is a hit or a miss.
Read hit: When the tag check yields a hit for the desired word, the cache will provide the CPU with
requested data from the pre initiated read. Replacement information will be updated in order
according to the rules of the used algorithm.
18 In
direct mapped caches memory blocks can only be mapped onto one specic line, there is no other to chose.
52
In the case when requested data is not to be found in the cache the premature read of
data will be suspended, and data must be brought in from the next level in the memory hierarchy.
A fast but costly and complex hardware method is to deliver the data to the processor in parallel
as the cache is being updated. A slower but less complex alternative is to provide the processor
with requested data after the cache line has be updated with new data.
Read miss:
Only about 7% of the total interaction between CPU and memory are writes[8]. It is a small but not
neglectabel part of the memory/CPU interactions, so nding the best way to enhance performance for
those 7% can be worth the eort of nding a optimal write policy for the actual architecture.
Write hit
When a CPU write results in a cache hit, are the primary objective to reduce the used bandwidth[46].
The amount of used bandwidth to the next level in the memory hierarchy when write hits occur
depends on the policy used by the cache. The choice of write policy directly aects how the caches
handles the coheres problem that comes with writes. For handling write hits two policies exists:
Write-trough or sometimes called store-through caches updates both the cache block and the
memory at next level on every write. During writes to the cache including the nearest lower
level using write-through the processor will encounter a write-stall due to the latency of writing
to lower levels. To overcome this dilemma a write-buer can be installed. During the writethrough the processor issuing its write to the cache as usual but instated of present the write
directly to the lower level the processor write to the write-buer and then continues with its
work. When data has been presented to the write-buer it is responsible for updating the
lower level memory.
The write-through method is easy to implement compared to its counterpart write-back. When
causing write-through the next level in the memory hierarchy always contains the most recent
copy of the data.
A opertunistic approach for direct-mapped caches using write-through is to simultaneously
perform the tag check and write data to next level write-before-hit [46]. In case the access turn
out to be a miss, there is no harm done since that line should have be replaced in any case.
The write-back cache must conrm a hit in order to modify cached data, This is vital in case
of a miss, dirty data can be overwritten by the modication and result in inconsistency. For
set-associative caches no matter write policy used, tag conrmation is always needed before
write[46]. As recently mention a set-associative cache or a any cache using write-back needs to
perform tag-check and write in two steps, which complicates the cache access to be invoked in
the CPU-pipeline. There is a need for interlocks between the write-back step and the memory
step which will increase latency[46], whereas as for a direct mapped write-through cache are
easy to integrate withe the CPU-pipeline
Write-through caches have a better error tolerance than write-back caches[46]. The better
tolerance originates in the fact that write-through caches contains no unique data that might
have been modied.
Write back also called store-in or copy back cache. In contrast to the write through policy, write
back only modies the cached data. By the use of the dirty-bit, that cache line can be marked
as dirty, which is an indication for the cache controller that the line when being replaced must
be written back to the next level. The write-back policy reduces the write trac that leaves
the cache by taking advantage of the temporal and spatial locality in writes[46].
Consecutive writes to block only results in a single write[2] compared to a write for every
modication of the block as for the case with write-through. Since every write issued by the
processor do not results in a write to next memory level, less bandwidth is used which are
desired in multiprocessor systems[8]. When write-back caches are used in a multiprogrammed
environment bursts of writes are common when the processor performs a task switch[2].
53
Write miss
When a requested write to the cache results in a miss the choice of policy for handling the situation
have a signicant eect on the CPU stall time during the cache miss, as wall as for the rell trac
to the cache. The bandwidth is a big concern for write misses but most of the polices focus on
reducing the latency[46].
Write allocate also referd as fetch on write. A cache invoking write allocate as write miss policy
will load the desired block into the cache when a miss occurs. After the block has been fetched
the cache acts accordingly to a write hit[8]. Jouppi[46] claims that write allocate and fetch on
write are separable. According to Jouppi are the above a denition of fetch on write whereas
write allocate do not fetch data from the level further down the hierarchy. Instead the address
written to are allocated in the cache. With those denitions it is possible to have direct
mapped cache with write allocate without the fetch of data for the refereed address. This will
improve performance compared to non-blocking caches that makes use of buers to handle
write misses, due to a subsequent read of the modied data will yield a hit. In the general
case write allocate are often used along with copy-back caches.
No-write allocate caches treats a write miss by updating the memory in the next level, hence
with a consecutive read data must be fetched from the other level introducing a CPU-stall.
Generally write-through caches uses no-write allocate.
54
Increasing the block size without changing the total cache size yields fewer blocks in the cache,
which will results in increased rate of con
ict misses and even capacity misses will increase.
The total cache size remains, yielding fewer blocks as block size increases generating increased
rate of con
ict misses and even capacity misses will increase.
When block size is enlarged requires more bandwidth and increased the latency, hence increased
miss penalty. Obviously there are important tradeos to be considered.
Higher associativity trade-os Increasing associativity of the cache will reduce con
ict misses
on the expense of eventual increased hit time.
Miss & Victim caches A technique to reduce con
ict misses without increase miss penalty or
aecting the clock speed, are to between the cache and its rell path insert a small fully
associative cache. two solutions based on this architecture are presented.
Miss caching A miss cache are a small, two to ve lines fully associative cache[45] for
insertion between the rst level cache and the level closest under. In case of a cache miss
data from lower lever are both inserted in the normal cache and in the miss cache where
the LRU entry is replaced. In parallel with the probing for address match in the direct
mapped cache, the miss cache also probed for a hit. In case the cache probe yields a miss
and a hit was made in the miss cache then the direct mapped cache can be reloaded with
the actual cache block in the next clock cycle, taking the stored data from the miss cache.
The miss cache is better in reducing data con
ict misses than for instruction con
ict
misses.
Victim cache A victim cache is an improvement of the previous introduced miss cache[45].
The improvement lies in the reduction of copying data to both the cache and the miss cache
when there is a cache miss, hence waste of cache space. By using another replacement
algorithm, performance of the miss cache can be enhanced. Requested date from a miss
are not longer loaded into the miss cache, but cache lines that the direct mapped cache has
victimize for replacement are, hence the name victim cache. If a miss occurs in the direct
mapped cache and the requested block is stored in the victim cache the corresponding lines
are swapped between the two caches. Further improvements can be achieved by the use of
selective victim caches[51]. In contrast to \simple" victim cache the selective victim cache
can either place a block from a lower level in the direct mapped cache or in the victim
cache. This selective placement are done using a prediction scheme based on previous
references. Those block that are most likely to be accessed in a near future are assign to
the real cache whereas others classied as not likely to be referenced in a certain amount
of time, are to be placed in the victim cache. Prediction parameters are recalculated in
case of a miss in the main cache and requested block are to be found in the victim cache.
Pseudo-associativity After exploring the dierent cache strategies regarding their performance
and drawbacks, one could think that combining the best features from dierent techniques
can become benecial. A cache that possess the fast hit time of direct mapped together
with the low miss rate of set-associative caches. This can be achieved with a special case
of associativity presented to the direct mapped cache, a column-associative cache or pseudoassociative [47]. This special cache reduces con
icts by dynamically choosing another location
for the con
icting data using a hashing functions. Instead of reside con
icting data to another
location within the same set as the case for set-associative caches, the column-associative cache
which fundamentally is direct mapped nds the alternative block place within another set still
in the same cache, hence the name column-associative[47]. The new location is easily found
with a bit-
ipping technique, hence
ipping the most signicant bit of of the set selection bits
When a reference hits in the cache it performs just like a ordinary direct mapped cache, on
the other hand when the reference miss another set is controlled. This will give the columnassociative cache two dierent hit-times19 which can result in deteriorating performance and
complicated Worst Case Execution Time (WCET) calculation.
19 Regular
hits as an ordinary direct mapped cache and pseudo slower due to the hash-function and double probing.
55
4.4 MMU
Time wasted by processor stalls waiting for cache misses can be reduced with
prefetching, which brings data closer to the processor before needed. Prefetching applies
to instructions as well as for data. The dierent prefetching paradigms and techniques will be
covered in section 4.6.
Compiler optimizations There exist other approaches than those hardware based techniques to
reduce the miss rate. Reduction of miss rate can be achieved with several compiler optimizations. Those techniques are not in the scoop of this analysis, and will therefore be left over for
the intriguent reader to nd elsewhere.
Miss penalty As the processor/memory speed gap continues to increase the relative time measured in
CPU stall cycles for a miss to be handled will also increase. With this in mind, reducing the miss
penalty time are no less signicant than hit time reduction.
Read miss before write miss The use of write-buer in order to allow the processor to overlap writemisses with execution of other instructions, introduces problems when a read instruction is issued
to an previously modied address not yet written back to the lower level, hence the modied value
still remains in the write-buer. With a rather large buer waiting for the writes to nish and than
proceed can waste considerable amount of CPU-cycles, depending the number of pending writes in
the buer. A better technique is to reuse the data residing in the buer in no other con
icts are
present
Sub-block placement Sub-block placement is based on the idea to to have larger block residing in the
cache, and to enable invalidate fractions of the whole block hence sub-block. Only a the sub block
that generated the miss is necessary to load from memory.
Fast word access The size of desired data when the CPU issuing memory access instruction are determined by that specic instruction. Byte, word, double word and further up to the size of the
processor registers. That amount of bytes are often in large caches only a subset of the total data
stored in the cache block. This relation between cache blocks and request size can be utilized in
order to reduce miss penalty. Since the processor do not need the whole cache block but just a
fraction of it, lets give the processor its requested portion of the block as soon as it been loaded.
Then carry on loading the rest of the block, this is called early restart. Another promising and
somewhat more aggressive approach is to read requested data rst of all from its memory block
and the when the CPU are satised continue to read the remaining part or parts of the block. This
scheme is known as critical word rst.
Non-blocking loads CPU stalls waiting for a cache miss20 to propagate can to a certain degree be
reduced if the cache featured non-blocking loads. Non-blocking caches enables the instruction which
caused a cache miss to be overlapped by pending instructions, during the miss is treated[43]. The
amount of possible overlap depends on number of independent21 instructions available. If during the
overlapped execution of pending instruction an instruction dependent upon the data being loaded
the processor must stall and wait for the cache miss to terminate. Another cache miss during
overlapped execution will likewise stall the processor.
Prefetching
4.4 MMU
The MMU Memory Management Unit is primary responsible for providing address translation, between
virtual and physical address space. Virtual memory allows program address space to exceed the size of
physical memory present. The size of the program is only restricted by the addressing capacity of the
processor[134]. Since the physical memory is physicly addressed translation between virtual and physical
space must be performed.
20 Miss after issuing a load instruction, hence the name
21 Independent in respect to data being loaded into the
non-blocking loads.
cache during the overlap.
56
A cache that operates on physical addresses that been translated from virtual addresses by the MMU is a
physical indexd cache. A virtually addressed cache operates on virtual addresses. The advantage of having
a virtual addressed cache is that there is no need for address translation for entries residing in the cache.
The drawback is that they must deal with synonym or aliasing problem of recognizing all virtual addresses
that map to the same specic physical address. The placement of the dierent caches in respect to
memory, mmu, and the processor can be seen in gure 26. Figure 26 (a) represents a conguration with a
physical addressed cache, whereas gure 26 (b) show the case for a virtual addressed cache. Except address
translation the mmu is also responsible of detection and processing of missing items22 . The structure
of the address translation unit depends on segmentation and paging of the memory. Segmentation,
subdividing of the address space into logically related groups (segments). Another approach is to divide
the address space into xed-size pages. A method used to reduce the overhead of address translation
encountered in paging systems is to use a translation lockaside buer (TLB)[134], the TLB contains a
associative memory where recently used descriptors to pages are kept.
CPU
MMU
Physical
address
Virtual
address
Physical
Address
Cache
Main
Memory
Physical
address
(a)
Virtual
Address
Cache
CPU
Virtual
address
Physical
address
Main
Memory
MMU
(b)
All computers can dependent on their level of parallelism in executing instructions and utilizing data, be
divided into four categories, according to Flynn's taxonomy[2]. The four categories: SISD, SIMD, MISD
and MIMD, further information about those dierent abbreviations can be found section 2.7.4 . This
chapter are designated to the Multiple Instruction Multiple Data architectures (MIMD), which essentially
are a multiprocessors.
Further can multiprocessor systems be divided by the use of dierent programming models. A programming model can be described in user level communication primitives of the system[1]. It can be said
to be the model for how communication is managed by the programmer. Implicit through the assembler
instructions generating communication (transparent to the programmer) or via explicit message passing.
In the shared address model is communication managed through a specic shared memory location.
Which hides the abstraction of communication from the programmer who can not tell if he/she is programming on a uniprocessor system utilizing multiprogramming or on a multiprocessor.
Message passing as the name implies (also often called multi computers ) relies on explicit messages for
communication between nodes. In this paradigm each processor has it own local memory only accessible
by that processor[119] The communication is managed via a software layer protocol, this will increase
communication time in comparison to the shared memory approach. On the other hand message passing
solution scales better as more nodes are added[121]. Shared memory multiprocessor can also be reliant
upon message-passing, this is for distributed shared memory multiprocessors described in section 4.5.2.
In the data parallel model, data are processed parallel and communication is for synchronization which
can be either message based or by the use of shared addresses. According to Flynn's taxonomy this model
is a SIMD architecture.
22 Parts
57
The main concern of this chapter is designated to examine the memory organization used for Multiprocessors. What implication there is in maintaining a shared memory multiprocessor in respect to uniformity, non-uniformity, coherence, and scalability. A few of those problems has already been mentioned
(i.e coherence misses, see section 4.3.2). Even though the basics of cache management and architecture
remains the same as for uniprocessors, it exist other problems concerning cache and memory management
that are specic to multiprocessor architectures that must be addressed.
4.5.1 Symmetric Multiprocessors
Symmetric multiprocessors (SMP) also referd as UMA Unied Memory Access which for all main memory
resides at equal distance to all individual processors. No matter which address a processor issue a reference
to, they are all accessed in equal time independent of physical location in the system. Since much of the
communication in shared memory systems are performed by memory referencing assembler instructions,
the choice of memory organization are a key issue[1]
In general, three types of memory hierarchies are common for shared memory multiprocessor systems.
P1
Pn
Switch
P1
Pn
P1
Pn
Cache
Cache
Cache
Cache
(Interleaved)
Level 1 cache
(Interleaved)
Main memory
Shared cache
Bus
Memory
Interconnection network
I/O
Memory
Memory
Memory
Dancehall
Shared cache
58
architecture[1]. Coherence in bus-based systems are often obtained through bus snooping protocols
see section 4.5.5
Dance-hall The dance-hall approach shows many similarities to the bus-based model, the most obvious
dierence lies in replacing the bus and make use of a more sophisticated interconnect. The other
dierence is that main memory is divided into small entities which individually connects to the
interconnect. This approach is designed to encounter a higher degree of scalability than the previous
described solutions. Although, the hierarchy is still symmetric in the context of uniform length
between all processors and memory, the actual length between the entity's can limit performance
as the system enlarges.
4.5.2 Distributed memory
The Distributed memory (DSM) architecture is an approach to overcome the limitations in scaling encountered in SMP models, but keep the convenience of a shared address space. This is achieved by using
memory that are physically distributed among the nodes, but logically implements a single shared address
space[121]. A general description of a distributed system can be viewed in gure 28.
The key distinction between this model and a symmetric multiprocessor is that here each processor
node have its own local subset of the total global memory, and communication is done through explicit
message passing. The short distance between CPU and the local memory enables higher speed and low
latency for memory references that can be handled by local memory. Processor nodes are tied together
by a scalable interconnect. This will result in high memory latency for references that do not hit in the
local memory system, hence references must be obtained from another processor's local memory. This
property of non-uniformity in memory access time, gave this approach the name, NUMA Non Uniform
Memory Access [121, 116].
Two well known shared memory multiprocessors are: The Stanford DASH Directory Architecture for
SHared memory and the Stanford FLASH FLexible Architecture for SHared memory, more about those
architectures can be found in[119, 117].
P1
Pn
Cache
Cache
Memory
Memory
Interconnection network
Distributed memory
59
When main memory AM has the functionality of a cache the block's address is a
global identier and not a physical location. Since a block can exist in other processors local AM,
there needs to be a method to localize remote block when a miss occurs. The processor that missed
a reference in its local AM, must communicate with a directory in order to localize the holder of
a valid copy. In hierarchial COMA architectures[120] the nodes are organized in either a tree or a
ring structure, where each level contains a directory. To obtain a block, traversal of several levels
in the hierarchy is sometimes necessary.
Block replacement Since block in COMA architectures migrates they do not have xed backup location, where to a write-back can take place. A block scheduled for replacement, must even if it is
un-modied and the only remaining copy be replaced and not just overwritten and disappear. The
system must keep track of the last copy of a block and migrate it to another AM when replaced.
Memory overhead There must always remain a amount of un-allocated memory in the AM in order
to manage replication and migration. If non un-allocated memory remains in AM a replacement
has to be done for every new block that is put in the AM.
The main advantage of the COMA-architecture is the ability to capture remote capacity misses as hits
in the local memory[121]. The latency of the normal hierarchial COMA has led alternative approaches
that tries to overcome the problem.
Flat-COMA: The Flat-COMA architecture does not rely on any hierarchy to nd a block, this enables
the architecture to use a high-speed network[120]. The directories are distributed among the nodes.
Memory blocks can still migrate but directory entries remains at the home node. For a miss in AM,
a request goes to the directory responsible for the block, which redirect the request to the block
holder.
Simple-COMA: In simple-COMA (S-COMA) the replacement is software directed, which diers from
the hardware approach used in normal and Flat-COMA.
Multiplexed Simple-COMA: The S-COMA architecture suers from memory fragmentation problem, due to allocation of memory in paged chunks even if the block are much smaller[120]. Leading
to in
ated working sets and frequent replacements. This problems are being addressed in Multiplexed Simple-COMA (MS-COMA) by allow multiple virtual pages in a node to map to the same
physical page simultaneous.
Block localization
4.5.4 Coherence
Although coherence is important in uniprocessors with cache(s) in dierent levels of their memory system
the problem escalate when caches are used in multiprocessor systems. The problem arises when dierent
caches have the same memory location and one of them updates that address. The other shares of
that location will see dierent values for the same address. That is also the case for main memory, the
system is said to be inconsistent at this instant time[116]. To invoke caches and not solve the problem
of inconsistency will put a the programmer in a dilemma since the intuitive programming model of
consistent memory is no longer present[121]. The problem is address by having the caches monitoring the
state of their contents in respect to eventual shares, write permission, and invalidation. This motoring and
preservation of state information is administrated by the coherence protocol. Dierent protocols applies
to dierent multiprocessor architectures. Dependent on their interconnect architecture and memory
model (e.g shared or distributed). Another factor that can aect the consistency in the system, is I/O
transfers performed by Direct Memory Access (DMA). The DMA transfer data between some I/O device
and directly to it's dedicated location in memory without involving the CPU. If the DMA transaction
overwrites a location in main memory which still resides in one or several processors cache, inconsistency
will prevail. The problem is analog in the reverse, a DMA transaction can transfer stall values that
resides in main memory, while the correct values are still in the cache. This problem can be addressed
by the use of uncachable locations or by actions taken by the operating system.
60
In SMP's, the bus is a continent device for maintain cache coherence, all processors in the system are
able to observe the ongoing memory transactions[112]. All the caches snoops the bus in order to monitor
the other caches actions on the bus. When a snooping cache discovers a transaction relevant to some of
its own cached blocks, actions must be taken according to the applied coherence protocol. The snoop
control is basically a regular tag control analog to the tag control performed for normal memory accesses
by the processor. Actions will be taken dependent upon what action the other cache did in combination
with the actual state of cached data.
Snoopy protocols are benecial solutions in bus based cache coherent multiprocessors, due to the
inexpensive and speedy broadcast properties of the bus[115]. A major drawback is the increased bus
trac introduced by coherence actions issued on the shared bus. Contention will increase as more
processors issues bus trac, hence, bus-based solutions are only applicable to small and medium size
multiprocessors[115].
The snoopy protocols are classied in to three groups based on the used write policy, write invalidate,
write update, and a combination of both adaptive protocols in an attempt to combine the benets of the
other two.
Write-invalidate protocols This class of snoopy protocols can allow multiple readers of the same data
but only one at a time are allowed to write[115, 114].
All writes done by the processor are propagated on to the main memory bus. Any cache observing
a write to an address corresponding to one of its own cached blocks, it invalidates that block. This
scheme can be implemented using only two states, Valid (I) & Invalid (V), a un-cached line are
often considered as being invalid.
Write-back \invalidate" protocols This class also applies to the rule of multiple reader and a single
writer. For the most cases write-invalidate protocols are impractical due to the vivid bus trac
imposed by the write back policy and reload of previously invalidated lines[114]. By the use of
write-back some of the writes to memory when using write-through will be eliminated if the cache
that modied the data could keep it and write it back when invalidated by other cache or replaced.
To eliminate that transition it becomes necessary to distinguish unmodied blocks from those that
are modied[1]. This can be achieved using a protocol that consists of three states modied (M),
Shared (S), and Invalid (I), (MSI). The modied state indicates that data has been modied and
invalidated by shares, the holder of that block are now considered owner of that block. A cache
is said to be the owner of a block if it on a request for that data must provide that data to the
main memory or sometimes even to other caches[114]. The shared state means that one or several
caches can have that block, but non of the have modied it, hence the cache and main memory are
consistent. Every write to a shared block must be preceded with the invalidation of all other copies
[115]. When a cache have the block in modied state, local reads or write do not generate bus
trac nor invalidations since all other copies are already invalidated. A cache holding a block in
modied state must provide the memory with the updated data in case of another processor writes
or read that address. A disadvantage is the extra bus transactions made on a write hit to data in
shared state even though no other cache may share that block.
This problem is addressed in the Illinois MESI protocol. To avoid the extra transaction the protocol
needs to recognize the sharing status of a cached block[115]. with the introduction of exclusive (E)
state, often called exclusive clean or un-owned [1], data blocks can be indicated as the only valid and
unmodied copy of the block. By the use of dierent states for unshared (exclusive) and shared
copies, the protocol will improve performance in handling private data. This due to avoiding
invalidations on write hits to unmodied blocks with no other shares[115].
Write-update protocols This class of coherence protocols makes use of a distributed write approach
which allows several copies of the same cache block to exist in a writable state simultaneously[115,
114]. The cache that issues a write, broadcasts the word to be written to a shared block to all
61
caches in the system. This type of protocols often makes use of a special bus line or lines so that
a cache dynamically detect the sharing status of a block. The extra bus line are only in use when
a block is shared by more than one cache. When a block no longer is shared among the caches
it is marked as private and the update broadcast is no longer necessary. This makes this type of
protocol use write through for shared writes and write-back when data is private[115].
A typical implementation of a write-update protocol is the Fire
y coherence protocol, used in the
NEC Fire
y multiprocessor[114]. Coherence is maintained with the use of four states, which are
combinations of the two state bits dirty & shared, no invalid state is needed since the protocol is
update based[1, 114]. When the dirty bit is set, the cached block is modied in respect to main
memory and must be written back to memory in case the block must be replaced. The shared
bit indicates that one or several other caches may have the same block. When this occurs write
through must be used. Reads and writes to unshared memory addresses are satised by the cache,
no involvement of main memory is needed.
The Xerox Dragon multiprocessor workstation is another known implementation using a slightly
dierent approach then in re
y but is still a write-update protocol. The most apparent dierence
is the memory update policy. The memory is not updated using a distributed write like the re
y
protocol, instead the owner of the block is responsible for writing data back to main memory. This
compels an invalid or more precisely a owned modied state. The Dragon protocol involves four
states, modied (M) already described, shared clean (Sc) state, the block are shared by two or more
caches and main memory might be up-to-date, and exclusive clean (E) has the same semantics as the
corresponding state in the MESI protocol. The last state shared modied (Sm) two or more caches
holds the block which is not consistent with main memory, this cache is the owner of the block and
are responsible for the write-back of the block when that block is scheduled for replacement. A
block may reside in Sm state only in one cache at a time, all other possible shares holds the block
in Sc. The main benet of the dragon protocol compared to the re
y is that frequently updates
of main memory due to shared write hits are avoided[115].
Even the Dragon and re
y protocols have their short comings, in the Dragon protocol there exist
occasions when the main memory responds to a data request, even though there exist valid copies
in other caches. This situation appears due to only a dirty cache are allowed to respond to a data
requests. In the Fire
y protocol clean cache lines can be transfered between caches, but a write to
a shared block will impose excessive memory trac as the main memory must be updated[113].
In an attempt to reduce this unnecessary access to main memory, Takahashi et al. propose the
(CRAC)Coherence solution Reducing memory Access using CCU protocol[113]. The CCU Central
Coherence Unit has two responsibilities, monitors all cache tags and maintain coherence through
cache to cache transfer, it also arbitrates concurrent bus requests from dierent processors and
controls the data transfer between o-chip memory and the on-chip caches. In order to keep
caches and memory consistent the protocol incorporates ve states: Invalid, Clean-Exclusive, DirtyExclusive, Clean-shared, and Dirty-shared. Upon a memory request data can come from any cache
holding a copy, reducing memory trac when copies exists. Writes to shared lines do not result in
an access to main memory, data are instead sent to other caches that have a copy of that line. Only
lines in Dirty-exclusive or Dirty-shared state are responsible for updating memory, which occurs
when the actual line is being replaced. The responsibility of update the memory can be transfered
to other shares of the block, this occurs when a Dirty Shared block must be replaced and another
cache has that line. Data is transfered and set in Dirty state.
Adaptive protocols Apparently non of the described protocol delivers optimal performance across all
types of work loads[115]. A solution that performs better than both pure write-invalidate or writeupdate protocols is to combine the benets from each of them in to a new type of protocols. These
adaptive protocols tries to achieve optimal performance by adaptation of used coherence mechanism
according to observed and predicted data use[115]. This led to a variety of dierent protocols, the
RWB (Read Write Broadcast) and EDWP (Ecient Distributed Write Protocol) among several
others, more information about RWB and EDWP can be found in[103, 104]. This diversity resulted
62
For computers with distributed memory bus-based coherence will not deliver enough bandwidth when
the system scales, another approach must be considered. Scalable cache coherence is mainly based upon
using a directory that maintains the state of individual memory blocks and message passing between the
directories to keep the system consistent. This class of multiprocessors are often called Cache Coherent
NUMA or just CC-NUMA architectures[1]. When a processor requests a memory block, it must look up
the state of the block in the directory. Every block of main memory has got a record associated to it,
that contains information of all the caches that currently have a copy of that memory block. A nod in
the system that encounters a cache miss must communicate via the interconnect with the directory that
holds that block. The actual location of the directory can be obtained from the memory address, since
each directory is coupled to its corresponding main memory (see gure 29). The information gained from
the directory determines what must be done in order to obtain a copy of the requested memory block.
This includes communication with the actual holder of the block, sending invalidation to other holders,
and receive acknowledgments from those caches. The requesting node must also when needed inform the
directory of possible changes of the block's state, this is also done using interconnect communication.
Most of the communication through the interconnect are performed by a Communication Assist (CA)
rather then by the processor it self[1]. The key characteristic of directory-based coherence protocols are
the use of a directory, which stores information of the system's global state in regard to coherence between
main memory and the caches[115]. Directory-based cache coherence protocols can either be invalidate
or update based. Invalidate-based protocol requires that the cache to write to a block has exclusive
ownership of that block, and update protocols requires an order preserving network[107].
P1
Pn
Cache
Cache
Memory
Memory
Directory
Directory
Interconnection network
Distributed memory
Figure 29: Distributed memory with cache directories assigned to the memory.
The previously example of a directory, can be said to be a Full-map directory. Fullmap directories are in the class of centralized directory-based cache coherence protocol chaiken-90.
They resides in main memory and have entries for every memory block that is cachable[112, 115].
A full-map protocol make use of directory entries with one bit per processor and a dirty bit, where
each bit represents if the block is present or not, in the corresponding processor's cache. The
dirty bit indicates that the cache have write permission to the block, only one processor have write
permission in an instant time. The caches has two bits for state information, valid, not valid, and
Full-map directory
63
the second bit indicates if the cache has write permission for that block or not. The drawback is
that the protocol do not scale well in respect to memory overhead[112].
Limited directories The motivation for limited directories are the problem with memory overhead
observed in full-map directories[115]. The approach to reduce the overhead, is to restricting the
number of simultaneous shares of a block. This directory replace the presence vector used by
full-map directories with a small number of identiers that points out the shares of the block.
The performance of the limited directory in comparison to the full-map is dependent of amount of
shared data, the number of processors that access each shared location, and the synchronization
method[112].
Chained directories An improvement in scalability is to introduce a chained or distributed directory,
which do not impose any limit to the number of possible shares of a block[115]. The directory is
spread across the individual caches. A linked list is used to maintain control over all possible shares
of that a memory block. The main memory contains a link to last cache to become sharer of the
block, all the caches has a pointer entry used to point out the next sharer of that block. Two types
of chained directories exist single (i.e Stanford Distributed-Directory SDD[111]) or double linked
(i.e The Scalable Coherent Interface SCI[110]).
In the simple single linked version of chained directories, main memory entries has a pointer to the
rst cache23 that have a copy of that actual block[107]. Each of one of all caches in the list contains
a link to the next cache that have a copy of the block, except the last which contains the chain
terminator[115]. The single link chain can introduce extra overhead in replacement of caches lines,
causing extensive invalidations[111, 115, 112].
Instead of only use a single pointer, each list entry contains a forward and a backward pointer, this
approach solves the replacement problem encountered with only single linked lists[115]. Since the
replaced block easily can be dropped by chaining its predecessor and successor24.
The performance of distributed directories are competable to full map directories, whether double
linked are better than single linked chains is debated[111, 112].
4.6 Hardware-driven prefetching
Prefetching has proven to be an eective method in order to tolerate memory latency[67]. This is
achieved by letting the CPU overlap data accesses with computations. There exists two broad categories
of prefetching, hardware driven and software assisted. In this section only hardware based techniques
are in focus, software based prefetching are described in section 2.7.5. Except from few minor issues
concerning the hardware originating from software controlled prefetching. The main advantage of hardware driven prefetching is the dynamic handling of prefetch at runtime without compiler assistance as
for software based prefetch[70]. Hardware-based prefetching techniques relies on speculation in order to
predict future references patterns based on information regarding past referenced patterns[70], information which dynamicly is provided to the prefetch hardware during runtime. The additional hardware
for prediction and management of prefetching ought to be simple and not in the critical path of the
processor's cycle-time[1].
Applications that have good caches performance or exhibit irregular reference patterns will not benet
form prefetching, programs that do iterate over long arrays can achieved performance improvements[70].
Systems that exhibits large latencies, are likely to be the most beneted by prefetch. This due to
that stall cycles represent a signicant amount of the total execution time in those systems[70]. When a
system utilizes prefetching memory trac will increase due to prefetch of obsolete data, increased amount
of cache misses due to con
icts with established working set, extra invalidations caused by additional
write-sharing, and increased rate of invalidations misses due to prefetch for writes[67].
23 Head of the list.
24 Further information
64
When prefetching is used the responsibility of the cache will increase when recently prefetched data
must coexist along with the caches current working set. Causing negative eects on the total miss rate,
due to con
ict misses between the prefetched data and already cached data[69].
4.6.1 One-Block-Lookahead: (OBL)
The most Simple form of sequential prefetching is variations of one-block-lookahead[71, 70]. Signicant
to all OLB schemes are that they initiates a prefetch of the next consecutive block. This technique is
dierent from just doubling block size since the prefetched block is treated as a separate item regarding
cache replacement. The dierent OBL-schemes are classied depending on what action triggers the
prefetch. Most of the techniques described as OBL can be modied to prefetch more than just one block.
Prefetch always Every reference generates a prefetch of its successive block.
prefetch-on-miss When a memory reference causes a miss in the cache, the referenced block will be
fetched. Then the next block will be prefetched if that block not already resides in the cache. This
simple scheme can reduce the number of misses in a strictly sequential references stream in half[45].
tagged prefetch Using tagged prefetch a tag-bit is associated with every block. The bit is used as an
indication of the blocks status in regard to when prefetch of its successive block is to be issued.
When a block has been prefetched its tag is initiated to zero. Then for every reference to that
block the tag will be set to one, whenever a blocks tag turns from zero to non-zero its successor
block will be prefetched. For a strict sequential reference stream, misses except the rst can be
reduced to zero[45, 70]. Due to the extra overhead and complexity introduced by the tag bit and
its management, tagged prefetch are more expensive to implement.
A severe risk with all OBL-schemes is that the actual prefetch action been issued to late for the memory
system to respond before data is needed by the processor[70]. As a further extension to the OBL paradigm,
several consecutive blocks can be prefetched and buered in a
queue, called stream buer.
fifo
4.6.2 Stream buer
Stream buer (or just buer ) proposed by Jouppi[45] is a cache-line FIFO-queue inserted between two
levels in the memory hierarchy for prefetching of consecutive memory blocks that are to be accessed by
the processor in near future. The stream buer FIFO consists of stacked cache lines containing tag eld,
available bit and storage for the memory block. A tag comparator are assigned to the head entry of the
FIFO, hence in the rst variant of the stream buer only the rst entry are checked for a hit and transfered
up in the hierarchy. Assigned to the last line in the buer are a adder responsible 25of calculating the next
address to prefetched data from, in this case the the last address + one unit stride . When a cache miss
occurs the buer starts to prefetch successive memory blocks starting with the address generating the
cache miss. When the prefetch request are being sent to lower level of the memory system, the tag of the
block to be fetched is written in the stream line and the available bit is set to false. Upon the arrival of
requested block, data is placed in the entry and the available bit is set. In the case of subsequent cache
misses, the head of buer are compared and if the tag matches the referenced address and the available
bit is set data is fetched from the buer in a single cycle. As one entry are moved from the buer and
up the hierarchy, the entries remaining in the buer are shifted towards the head, leaving the tail entry
empty. Based on the previous last entry the following address is calculated and passed on as a prefetch
request. If an access misses both in the cache an the buer, the contents of the buer is
ushed and a new
prefetch cycle begins by prefetching the address that caused the
ush. Write-backs bypasses the buer
invalidating stale copies that can reside in the buer. In order to handle interleaved streams of data,
multi way streams was introduced enabling prefetching of multiple streams in parallel. When an address
reference miss in the cache, all stream buers are probed for a hit. The one containing the data provides
it, in case of no hit in the streams, the oldest stream is
ushed and ready to begin prefetching again.
25 A
unit stride corresponds to the instruction length or the processor, word, double word, quad word or greater.
65
For replacement(e.g
ush) are LRU often used. Design parameters for stream buers are the depth of
the buer which are equal to the number of prefetched blocks each stream contains, and the number of
buers used[52]. The optimal depth depend upon the performance of the memory hierarchy, a stream
shall at least be deep enough so the main memory latency are covered.
4.6.3 Filter buers
A problem detected in the previous described buer is wasted bandwidth, fetching useless blocks that
never are referenced. In an attempt to reduce this squander of memory bandwidth Palacharla &
Kessler[52] suggest a scheme that will ltrate isolated references from being fetch and inserted in to
the buer using a allocation lter. The lter prevents the buer to prefetch a block when a reference
miss26 but for the second miss the buer will prefetch the subsequent block to the referred data yielding
miss number two. Block i and block i+1 will be ignored and not prefetched but i+2 down to block i+n.
The management of miss count and reference comparisons are handled by a history buer simply called
lter. Simulations[52] indicates that the proposed lter can reduce the extra waste of bandwidth with as
much as 50%. Further optimizations to the scheme can be made. The use of unit strides can be replaced
with a use of dynamicly calculated strides depending on previous miss addresses. When multiple streams
are in use there are a chance of data overlapping between streams. In order to benet the most of multiple
streams, overlapping must be eliminated. this is done by assigning a tag comparator to to each line in
the buer and compare internal addresses between buers. This modication will result in a much more
complex construction then for the rst purposed original stream buer. The buer can be extended to
not only allow the head element to be transfered but even enable non head entry's to be passed on to the
next level. Hence the time required for loading data that are not at the top of the FIFO will be reduced
since no
ushing is required.
4.6.4 Opcode-driven cache prefetch
A instruction op-code based prefetch scheme (IOBP) is proposed by Chi & Lau[68], that performs data
prefetching based on information given by the instruction decode unit. In programs there are predictable
pattern of constant strides in arrays or in pointer references. Those data types are commonly accessed
using a index-displacement addressing mode. Information of this patterns can be extracted from the
instructions used for this kind of references,
or
instructions.
In addition to load or store data, the register used for address calculation will be updated with the
calculated eective address for the reference. Executing a
instruction using index
displacement are as follows:
Rt,(Rx+Disp). The eective address (EA) = (Rx) + Disp, Rt
= (EA) when the execution is nished register Rx will be updated with the value of the eective address,
hence Rx = EA. This is normally used in order facilitate access to successive data. In the prefetch scheme
the procedure is utilized in order to have Rx prepared to calculate the next expected reference, which
now is EA + Disp. That address is sent to the prefetch unit which will perform the actual prefetch.
These actions are repeated for every issued
instruction.
A possible problem with this method arises when indexing with constant stride of 1 and cache blocks
are larger (i.e by a factor 4). Issuing a reference to location A will result in prefetching of location
A+1, which already resides in the cache, hence no prefetch will be issued. This will go on until A+5 is
referenced, which will trigger a prefetch. This can result is stall cycles for the processor which might need
the data faster than the prefetch can provide. A better approach is to still predict next reference based
on the data address but prefetch next data based on cache block addresses. Thus, when address A+1
placed in cache block B is referd cache block B+1 is to be prefetched. To further improve performance
a scheme to prefetch multiple cache blocks can be invoked. The required additional hardware is very
simple, no architectural changes needs to be done there is also no need for new compiler optimizations.
load/store-update
load/store-modify
load/store-update
load/store
load/store-update
26 Miss
66
Due to the inability for sequential prefetching techniques to handle strides through nonsecutive memory
blocks, there is need for a technique that takes advantage of both large and small strided array-referencing
patterns. This can be obtained with a special hardware entity which monitors the reference pattern
issued in the processor. Prefetching actions are detected by address comparison between consecutive
load/store instructions. When the prefetch hardware detects a predictable reference pattern generated
by a particular load or store, it will start prefetching for that instruction. The monitoring of memory
accesses, prediction of prefetch address, and calculation of stride is described in the following example:
During successive loop iterations a memory instruction m references the addresses a1, a2 ,a3, up to an.
A prefetch is initiated if a2 - a1= 6= 0, the represents the stride to use for further accesses. The rst
address to be prefetched will be A3 = a2 + , where A3 is the predicted value of address a3. Prediction
will continues until An 6= an.
To record the reference history of the memory instructions during program execution, a special purpose
cache is used reference prediction table (RPT). Each line in that cache contains,
memory instruction
address, previous address accessed by that instruction, stride value if any27, and the state of the cache
entry. The RPT is indexed with the the program counter.
When an instruction rst enters the RPT it is said to be in initial state and its stride is zero, due to
not been executed before. Later that instruction is executed again its state is set to transient and a stride
have been calculated. Transient state are an indication that a reference pattern for that instruction is
emerging. RPT will then issue a prefetch to the address estimated from the instructions stride and its
previous address. The third time that instruction is executed the instruction is promoted to steady state,
which indicates the the stride calculated on previous executions are stable. When an incorrect prefetch
is made for an instruction, that instruction is reset to initial state.
The RPT scheme performs better than sequential schemes does by correctly handle large stride
arrays[70]. There will still be initial misses before the reference pattern is established. At the end
of loop or if irregular, this scheme will produce unnecessary pre-fetches.
4.6.6 Data preloading
Preloading diers from previous prefetch schemes in regard to what information is used to determine
what to be prefetched. All the above described method relies upon information about previous execution
in order to conduct prefetching. In a by Baer & Chen[72, 118] proposed data preloading scheme is based
on predictive information from the instruction stream execution. The architecture is dependent upon a
Branch Prediction Table (BPT) see section 2.2.6 for a more detailed description. A BPT is used to predict
a programs future execution path. Further hardware requirements are a Look-Ahead Program Counter
(LA PC) is used to predict the future of the execution stream, the LA PC is incremented and modied
using information from the BPT. A RPT previously described in section 4.6.5, with a few minor changes
(i.e a new state no fetch is added.) is used. There is also need for a buer that holds addresses of in
progress or outstanding requests, a Outstanding Request List (ORL). The RPT is used in a similar fashion
as before but the LA PC is used for indexing. When the LA PC comes across a load/store instruction
the RPT is searched for that instruction. If that instruction resides in the table, three controls must be
performed in order to determine if a prefetch are to be issued or not. A prefetch will be issued if the
state of the entry is not no fetch and block to be loaded not already in cache or marked in progress in
the ORL. If the control success a load request is issued and the address is stored in ORL.
The scheme has proven quite benecial in environments that exhibits regular access patterns, but performs only moderately when access patterns are more irregular. The advantages gained from prefetching
can however be limited by the memory bandwidth[72].
27 In order to have a stride value it must have been established which happen when an instruction is executed the second
time.
67
In Multiprocessor systems, preloading of data can originate either form the sender of data sender-initiated
usually in producer consumer situations, or by the receiver of data receiver-initiated. Sender-initiated
preloads often occurs short after data has been produced, whereas receiver-initiated which are the most
frequently used, preloads data before it is actually needed. A key issue for how far ahead a preload
can issued in multiprocessor system is whether the prefetch is binding or non-binding [1]. Binding/nonbinding is in this context applies to how the processor see the data after a prefetch been conducted.
With a binding preload data is bounded at the time of the actual prefetch. No matter if other processors sharing the same block write to it, the the processor issuing the prefetch will see the value as it
was when fetched. The opposite applies for non-binding, where the state of prefetched data are under
constant supervision of the coherence protocol and updated in respect to other processors interventions
to that block. Timing and \luck" are also important in order to make prefetching successful[1]. The
use of explicit prefetch instructions (see section 2.7.5) can in cache coherent systems prematurely issue
a state transition of that block in another cache from exclusive to shared state, which can complicate
the scheme for writes[69]. The use of data preloading in multiprocessor systems can complicate cache
coherence. Further the technique to predict regular patterns in uniprocessor system is not inherently the
same it has to be done in multiprocessors, where loop iterations can be spread over several CPUs. Memory bandwidth can be aected by eventually increased memory trac due to incorrect prefetching [72]
In systems with shared address space without caching of shared data, the need for cache coherence is
of no use. Hence due to the non-caching scheme for shared data, prefetched data is stored in a prefetch
buer from which the processor read the head entry. The buers prefetch depth is adapted to overcome
the latency of the memory system. The actual prefetch are often receiver initiated and the preload
instruction that triggers the the prefetching must be non-blocking and not stall the processor.
In general prefetching in shared memory multiprocessors are a much more complicated task than
prefetching in uniprocessor with no shared memory. This compelled by the greater sensitivity to the
memory sub-system and data sharing hazards[69].The fact that shared data can directly be stored in the
processor's cache, compels the coherence protocol to involve control of data that are being prefetched.
Prefetching also introduce additional negative eects, especially with bus-based shared memory multiprocessor which are very sensitive to increased memory trac due to narrow bandwidth on the bus. This can
make these architectures insensitive to CPU-throughput and totally dependent upon memory throughput which is proportional to the miss rate. Hence, the system's performance will deteriorate using any
prefetch method that increases the miss rate. For the CPU this will result in increased execution time as
the memory systems saturates[69]. Prefetching of shared data in a cache coherent multiprocessors, can
introduce new demands on the entire cache organization. The problem arise when possible future working set of one processor interferes with one or several other processors current working set. Additional
misses will occur when recently prefetched data are invalidated when another processor(s) issues a write
to that block prior to it is being used. Increased miss rate will also occur when data block is prefetched
in exclusive mode, generating invalidations among shares.[69].
5 SUMMARY
68
5 Summary
The evolution of the semiconductor industry has enabled integration of whole systems on a single piece
of silicon. It is the consumer demands for faster, cheaper and smaller products that has forced the
development towards a SoC solution. By combining all the functions into one chip the system becomes
smaller, faster, and less power consuming. To decrease the time-to-market SoCs are entirely or partially
build with IP-components. Thanks to SoC, a whole new domain of products, like small hand held
devices, has emerged. The concept has been around a few years now, but there are still challenges
that needs to be resolved. There is a lack of standards for enabling fast mix and match of cores from
dierent vendors. Further needs are new design methods, tools, and verication techniques. SoC solutions
needs special kind of CPUs that consumes less power, is cheaper, smaller, but still has high-performance
requirements. To fulll all these demands, they are getting more and more complex as the number
of transistors are rapidly growing which has led to the emerging of multiprocessors systems- on-a-chip.
Interconnecting all the IP-cores in a SoC solution diers from the strategy used on systems-on-board.
Since ordinary interconnects have a dierent set of constraints than SoC interconnections, they are
unsuitable for connecting components within a SoC. A number of dierent standards for SoC interconnects
has emerged, all with dierent strategies to solve the hard task of integration. The memory hierarchy has
been a bottleneck in systems for a while. Unfortunately, the gap between the CPU speed and memory
speed continues to grow, leading to larger and larger caches serving memory requests locally. Caches in
multiprocessor systems introduces many diculties that must be solved. The rate of process renement
makes it possible to embed more and more of the needed memory on-chip, which in some cases are large
enough for a target application.
REFERENCES
69
References
[1] David E.Culler, Pal Singh. Jaswinder and Anoop Gupta. Parallel Computer Architecture, A Hardware/software approach, Morgan Kaufmann Inc, San Fransisco California, 1999, ISBN 1-55860-343-3.
[2] Sven Eklund avancerad datorarkitektur, Studentlitteratur, Lund, 1994, ISBN 91-44-47671-X.
[3] global sources. System-on-a-chip sets new rules in the industry
global sources MArs 10, 1999.
http://www.globalsources.com/MAGAZINE/EC/9905/SOCREP.HTM
[4] Bill Cordan, Palmchip Corporation An eecient bus architecture for system-on-chip design Custom
INtegrated Circuits, 1999. Proceedings of the IEEE 1999 pp: 623{626
[5] Sibabrata Ray, Hong Jiang. A recongurable bus structure for multiprocessors with bandwidth reuse,
Journal of Systems Architecture 45, 1999
[6] Hammond Lance, Olukotun Kunle. Considerations in the Design of Hydra: A Multiprocessor-on-aChip Microarchitecture, Stanford Technical Report CSL-TR-98-749, Stanford University, 1998.
[7] Lars-Hugo Hemert Digitala kretsar, Studentlitteratur, Lund, 1996, ISBN 91-44-00099-5.
[8] John L. Hennessy, David A. Patterson Computer Architecture A Quantitative Approach, second edition, Morgan Kaufmann Inc, San Fransisco California, 1996, ISBN 55860-329-8.
[9] Vincent P. Heuring & Harry F. Jordan Computer Systems Design and Architecture, Addison Wwesley,
California, 1997, ISBN 0-8053-4330-X.
[10] Howard Sachs, Mark Birnbaum VSIA Techical Challenges Custom Integrated Circuits, 1999. Proceedings of the IEEE 1999 , 1999 , Page(s): 619 -622
[11] Geert Roosseel, Sonics Inc. Decouple core for proper integration eeTimes Jan 3, 2000.
www.eetimes.com.story/OEG20000103S0048
[12] Jon Turino, SynTest Technologies, Inc. emphDesign for Test and Time to Market - Friends or Foes
Test Conference, 1999. Proceedings. International, 1999 , Page(s): 1098 -1101
[13] Lewis, Je. Intellectual Property (IP) Components, Artisian Components Inc.
http://www.ireste.fr/fdl/vcl/ip/ip.htm
[14] Olukotun Kunle, Bergman Jules, Kun-Yung Chang and Basem Nayfeh. Rationale, Design and Performance of the Hydra Multiprocessor, Stanford Technical Report CSL-TR-94-645, Stanford University,
1994.
[15] RealChip Custom communication Chips.
System-on-Chips,
http://www.realchip.com/Systems-on-Chips/systems-on-chips.html
[16] Rincon Ann Marie, Cherichetti Cory, Monzel James. A, Stauer David, R, Trick Michael, T.
IBM Microelectronics Corp.
Core Design and System-on-a-Chip Integration, IEEE Design & Test of Computers. Volume: 14 4 ,
Oct.-Dec. 1997, pp: 26{35
[17] Rincon Ann Marie, Lee William. R and Slattery Michael.
IBM Microelectronics Corp.
The Changing Landscape of System-on-Chip Design, Custom Integrated Circuits, 1999. Proceedings of
the IEEE 1999, pp: 83{90
REFERENCES
70
[18] IBM Microelectronics Corp & Synopsys, Inc./Logic Modeling Design Environment for System-OnA-Chip Products & solutions Success Stories
http://www.synopsys.com/products/success/soc/soc wp.html
[19] Wilson, Ron. Is SoC really dierent?,
T November 8, 1999. http://www.eetimes.com/story/OEG19991108S0009
[20] Wilson, Ron. The rest of the SoC task,
T October 11, 1999. http://www.eetimes.com/story/OEG19991011S0006
[21] David Patterson. Vulnerable Intel,
The New York Times, June 9, 1998, pp: 44{49
[22] Manfred Schlett. Trends in Embedded-Microprocessor Design,
Computer, August, 1998.
[23] Wulf Wm. A and McKee Sally. A. Hitting the Memory Wall: Implications of the Obvious, Computer
Science Report CS-94-48.
[24] Prince Betty Memory strategies International, USA
Memory in the fast lane, IEEE Spectrum Feb 1994, Vol 31, Issue 2.
PP: 38{41.
[25] Golla C and Ghezzi S.
Flash Memory Architecture, Microelectron Reliab.. Vol 38, No 2, 1998, pp: 179{184.
ee
imes
ee
imes
[26] Abraham S.G, Sugumar R.A, Windheiser D, Rau B. R and Gupta R. redictability of load/store
instruction latencies Microarchitecture, 1993, pp: 139{152.
[27] Boland K and Dollas A, AT&T Global Information Predicting and precluding problems with memory
latency IEEE Micro, Aug 1994, Vol 14, Issue 4, pp: 59{67.
[28] Katayama Y, IBM Research, Tokyo, Japan Trends In Semiconductor Memories IEEE Micro, 1997,
Vol 17, Issue 6, pp: 10{17.
[29] Rao R. Tummala, Vijay K. Madisetti System on Chip or System on Package ? IEEE Design & Test
of Computers Volume: 16 2 , April-June 1999 , Page(s): 48 -56
[30] Mark Dorais, ROHM ELECTRONICS Analyze ASIC Designs To Optimize Integration Levels ELECTRONIC DESIGN ONLINE, August 1999
http://devel.penton.com/ed/Pages/magpages/aug0999/digitech/0809dt3.htm
[31] Lage C, Hayden J.D and Subramanian C. Adv. Products Res. & Dev. Lab., Motorola Inc., Austin,
TX, Advanced SRAM technology-the race between 4T and 6T cells Electron Devices Meeting, 1996.,
International Dec 1996, pp: 271{274.
[32] Takai Y, Nagase M, Kitamura M, Koshikawa Y, Yoshida N, Kobayashi Y, Obara T, Fukuzo Y and
Watanabe H. 250 Mbyte/s Synchronous DRAM Using a 3-Stage-Pipelined Architecture IEEE Journal
of Solid-State Circuits. April 1994, Vol 29, Issue 4, pp: 426{431.
[33] Henry Chang, Larry Cooke, Merrill Hunt, Grant Martin, Andrew McNelly, Lee Todd.
Surviving the SOC Revolution, A Guide to Platform-Based Design, Kluwer Academic Publishers, 1999,
ISBN 0-7923-8679-5.
[34] National Semiconductor, Geode Products
Geode SC1400 (Information Appliance-on-a-Chip)
www.national.com/appinfo/solutions/0,2062,243,00.html
REFERENCES
71
REFERENCES
72
REFERENCES
73
[75] Eggers, S.J.; Emer, J.S.; Levy, H.M., Lo, J.L.; Stamm, R.L.; Tullsen, D.M. Simultaneous multithreading: a platform for next-generation processors IEEE Micro, Volume: 17 5 , Sept.-Oct. 1997 , pp:
12{19
[76] Stefan Pees, Martin Vaupel, Vojin Zivojnovic, Heinrich Meyr On Core and More: A Design Perspective for Systems-on-a-chip
Signal Processing Systems, 1997. SIPS 97 - Design and Implementation., 1997 IEEE Workshop on ,
1997 , pp: 60{63
[77] Design Environment for System-On-A-Chip IBM Microelectronics Corp, Synopsys, Inc. , 1997
http://www.synopsys.com/products/success/soc/soc wp.html
[78] Weiss, A.R. The standardization of embedded benchmarking: pitfalls and opportunities Computer
Design, 1999. (ICCD '99). International Conference on, 1999, pp: 492{508
[79] Peter N. Glaskowsky Silicon Magic:DVINE-LY INSPIRED ? Microprocessor report, March 27, 2000
[80] Roman L. Lysecky, Frank Vahid, Tony D. Givargis Techniques for reducing read latency of core bus
wrapper Proceedings of Design, Automation and Test in Europe 2000, pp:84{91
[81] Lefurgy, C.; Bird, P.; Chen, I.-C.; Mudge, T. Improving code density using compression techniques
Microarchitecture, 1997. Proceedings., Thirtieth Annual IEEE/ACM International Symposium on ,
1997 , pp: 194{203
[82] Lefurgy, C.; Piccininni, E.; Mudge, T. Reducing code size with run-time decompression HighPerformance Computer Architecture, 2000. HPCA-6. Proceedings. Sixth International Symposium on
, 1999, pp: 218{228
[83] IDT Peripheral Bus (IPBus) Intermodule Connection Technology Enables Broad Range of SystemLevel Integration An IDT White Paper www.silicore.net/pdles/idtipbus.pdf
[84] W.A Halang Real-time systems: Another perspective The Journal of Systems and Software. April
1992, pp: 101{108
[85] http://www.sussex.ac.uk/engg/research/vlsi/projects/pibus/
[86] Wade D. Peterson Application Note: WBAN003, Design Philosophy of the WISHBONE SoC Architecture September 7, 1999 www.silicore.net
[87] Lichen Zhang Predictable architecture for real-time systems Information, Communications and Signal
Processing, 1997. ICICS., Proceedings of 1997 International Conference on
Volume: 3 , 1997 , pp: 1761{1765
[88] B. Cogswell and Z. Segall MACS: A predictable architecture for real-time systems IEEE, December
1991, Proceedings of 12th Real-Time Systems Symposium, pp: 296{305
[89] J Turley Evaluating Embedded Processors MICRO Design Resources, Sebastopol, Calif., 1997.
[90] Lennart Lindh, Tommy Klevin Scalable Architecture for Real-time Applications and use of busmonitoring Real-Time Computing Systems and Applications, 1999., pp: 208{211
[91] Tricore Architecture Overview Handbook release version 1.3.0 1999 Inneon Technologies Corp.
www.inneon.com/us/micro/tricore/sub arc.htm
[92] FISPbus Foundation Library Mentor Graphics, 1998 http://www.mentor.com/inventra/cores/catalog/sp bus peripheral
[93] www.silicore.net
[94] Wade D. Peterson Application Note: WBAN003. Design Philosophy of the WISHBONE SoC Architecture September 7, 1999, www.silicore.net
REFERENCES
74
[95] Wishbone Interconnection for Portable IP Cores, Specication Revision A www.silicore.net
[96] Ann Harwood Motorola's Peripheral Interface Standards Embedded Processor Forum, May 1999
http://www.mot.com/SPS/MCORE/downloads/nal-epf.pdf
[97] Marc Torrant
Simultaneous Multithreading Presentation, May 14, 1999
http://www.rit.edu/ mxt8837/thesis/Overview 5 14 99/index.htm
[98] MCORE Reference Manual, Motorola Inc., 1997
[99] Triscend homepage: http://www.triscend.com
[100] Tensilica homepage: http://www.tensilica.com
[101] ARC homepage: http://www.arccores.com
[102] Transmeta homepage: http://www.transmeta.com
[103] Rudolph, L & Segall, Z. Dynamic Decentralized Cache for MIMD Parallel Processors. Proceedings
of 11th ISCA, 1984, pp: 340{347.
[104] A Cache Coherence Approach for Large Multiprocessor Systems. Proceedings of Supercomputing
Conf, IEEE, 1988, pp:337{345.
[105] Tyson, G.; Farrens, M.; Pleszkun, A.R. Misc: A Multiple Instruction Stream Computer. Microarchitecture, 1992. MICRO 25., Proceedings of the 25th Annual International Symposium on , 1992 ,
pp: 193{196
[106] Sweazey, P & Smith, A. J. A class of Compatible Cache Consistencey Protocols and their Support
by the IEEE Futurebus. Proceedings of 13th ISCA, 1986. pp: 414{423.
[107] Glasco, D. B. Design and Analysis of Updtae-Based cache Coherence Protocols for Scalable SharedMemory Multiprocessors. Technical report NO. CSL-TR-96-670. 1995.
[108] Sweazey, P. VLSI support for copyback caching protocols on Futurebus. Computer Design: VLSI in
Computers and Processors, 1988. ICCD '88., Proceedings of the 1988 IEEE International Conference,
1988, pp: 240{246.
[109] Compcon Spring '88. Thirty-Third IEEE Computer Society International Conference, 1988, pp:
505{511. Shared memory systems on the Futurebus
[110] James, D.V, Laundrie, A.T, Gjessing, S, & Sohi, G.S. Distributed-directory scheme: scalable coherent iterface. IEEE Computer, june 1990, Vol. 23. Issue 6. pp: 74{77.
[111] Thapar, M, & Delagi, B. Distributed-directory scheme: Stanford distributed-directory protocol. IEEE
Computer, June 1990, Vol. 23, Issue 6, pp: 78{80.
[112] Chaiken, D, Fields, C, Kurihara, K, & Agarwal, A. Directory-based cache coherence in large-scale
multiprocessors. IEEE Computer June 1990, Vol 23, Issue 6, pp:49{58.
[113] Takahashi, M, Takano, H, Kaneko, E, & Suzuki, S. A shared-bus control mechanism and a cache
coherence protocol for a high-performance on-chip multiprocessor. Proceedimgs of High-Performance
Computer Architecture Symposium 1996, pp: 314{322.
[114] Thacker, C.P, Stewart, L.C, & Satterthwaite, E.H., Jr. Fire
y: a multiprocessor workstation. IEEE
Transactions on Computers, Vol. 32 Issue 8, Aug 1988. pp: 909{920.
[115] Tomasevic, M, & Milutinovic, V. Hardware approaches to cache coherence in shared-memory multiprocessors, Part 1. IEEE Micro, Vol. 14 Issue 5, Oct. 1994, pp: 52{59.
REFERENCES
75
[116] Tanenbaum, Andrew. S. Distributed Operating Systems, Prentice-Hall, 1995, ISBN 0-13-143934-0.
[117] Kuskin, J, Ofelt, D, Heinrich, M, Heinlein, J, Simoni, R, Gharachorloo, K, Chapin, J, Nakahira, D,
Baxter, J, Horowitz, M, Gupta, A, Rosenblum, M, & Hennessy, J. The Stanford FLASH multiprocessor.
Proceedings of Computer Architecture, 1994, pp: 302{313.
[118] An eective programmable prefetch engine for on-chip caches. Tien-Fu Chen Proceedings of Microarchitecture, 1995, pp: 237{242.
[119] Lenoski, D, Laudon, J, Gharachorloo, K, Gupta, A, Hennessy, J. The directory-based cache coherence protocol for the DASH multiprocessor. Proceedings of Computer Architecture, 1990. pp: 148{159.
[120] Cache-only memory architectures. Dahlgren, F, & Torrellas, J. IEEE Computer. Vol 32 Issue 6,
June 1999, pp: 72{79.
[121] Hennessy, J, Heinrich, M, & Gupta, A. Cache-coherent distributed shared memory: perspectives
on its development and future challenges. Proceedings of the IEEE Vol. 87 Issue 3 , March 1999, pp:
418{429
[122] www.sonicsinc.com
[123] Drew Wingard, Alex Kurosawa Integration Architecture for System-on-a-Chip Design Custom Integrated Circuits Conference, 1998, Proceedings of the IEEE 1998 pp: 85{88
[124] A. John Anderson Multiple Processing, A systems overview Prentice Hall, 1989, ISBN 0-13-605213-4
[125] Govindan Ravindran, Michel Stumm Performance Comparison of Hierarchical Ring- and Meshconnected Multiprocessor Networks High-Performance Computer Architecture, 1997., Third International Symposium on , 1997 , pp: 58{69
[126] Gasbarro, J.A. The Rambus memory system. International Workshop on, Memory Technology,
Design and Testing, 1995, pp:94{96.
[127] Philofsky, E.M. FRAM-the ultimate memory. IEEE International Nonvolatile Memory Technology
Conference, 1996, pp: 99{105.
[128] McFarling, S. Cache replacement with Dynamic Exclusion ACM 1992.
[129] Gillingham, P. MOSAID Technologies Inc. SLDRAM Architectural and Functional Overview. SLDRAM Consortium 29 Aug 1997.
[130] Kanishka Lahiri, Anand Raghunathan, Sujit Dey Fast Performance Analysis of Bus-Based Systemon-Chip Communication Architectures IEEE/ACM International Conference on Computer-Aided Design, 1999 , pp: 566{572
[131] Alpha Systems - Compaq's commitment to Alpha http://www.compaq.com/alphaserver/news/commit letter.html
[132] W.J. Bainbridge Asynchronous Macrocell Interconnect Using Marble Proceedings of the Fourth
International Symposium on Advanced Research in Asynchronous Circuits and Systems, 1998, pp:
122{132
[133] Eyre, J.; Bier, J. DSP processors hit the mainstream Computer Volume: 31 8 , Aug. 1998 , pp:
51{59
[134] Milenkovic, M. Microprocessor memory Management Units. IEEE Micro. April 1990. pp: 70{85.
Socrates
- Specications
Revision: 0.99
Authors
Abstract
This document contains the specications for the SoCrates Congurable Platform, -a SoC multiprocessor system intended for a single FPGA. This is the second of three documents that forms our
Master Thesis in Computer Engineering.
II
CONTENTS
Contents
1 System Architecture
1.1 Functionality . . . . . . . . . . . . . . . . . . .
1.1.1 Functionality demands . . . . . . . . . .
1.2 Implementation . . . . . . . . . . . . . . . . . .
1.2.1 Electrical Interface . . . . . . . . . . . .
1.2.2 Programming of internal registers . . . .
1.2.3 Explicit demands . . . . . . . . . . . . .
1.2.4 Motivation of design choiches/trade-os
1.2.5 Future work . . . . . . . . . . . . . . . .
1.3 Testing . . . . . . . . . . . . . . . . . . . . . .
1.3.1 How to test . . . . . . . . . . . . . . . .
1.3.2 What to test . . . . . . . . . . . . . . .
2 CPU Node
2.1 Functionality . . . . . . . . . . . . . . . . .
2.1.1 Motivation for the component . . . .
2.1.2 Functionality demands . . . . . . . .
2.1.3 Interactions with other components
2.2 Implementation . . . . . . . . . . . . . . . .
2.2.1 Collaborating components . . . . . .
2.2.2 Electrical Interface . . . . . . . . . .
2.2.3 Future work . . . . . . . . . . . . . .
2.2.4 Testing . . . . . . . . . . . . . . . .
2.2.5 What to test . . . . . . . . . . . . .
2.2.6 Test environment . . . . . . . . . . .
2.2.7 Test methodologies . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3.1 Functionality . . . . . . . . . . . . . . . . . .
3.1.1 Motivation for the component . . . . .
3.1.2 Functionality demands . . . . . . . . .
3.1.3 Interactions with other components .
3.2 Implementation . . . . . . . . . . . . . . . . .
3.2.1 Electrical Interface . . . . . . . . . . .
3.2.2 State machines/pseudo code . . . . . .
3.2.3 Programming of internal registers . . .
3.2.4 Explicit demands . . . . . . . . . . . .
3.2.5 Motivation of design choices/trade-os
3.2.6 Future work . . . . . . . . . . . . . . .
3.3 Testing . . . . . . . . . . . . . . . . . . . . .
3.3.1 How to test . . . . . . . . . . . . . . .
3.3.2 What to test . . . . . . . . . . . . . .
3.3.3 Test environment . . . . . . . . . . . .
3.3.4 Test methodologies . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 CPU
4 Network Interface
4.1 Functionality . . . . . . . . . . . . . . . . .
4.1.1 Motivation for the component . . . .
4.1.2 Functionality demands . . . . . . . .
4.1.3 Interactions with other components
4.2 Implementation . . . . . . . . . . . . . . . .
.
.
.
.
.
1
1
2
2
2
2
2
2
3
3
3
4
4
4
4
4
4
5
5
6
6
6
6
7
7
7
12
12
12
13
14
14
14
15
15
15
15
15
15
16
16
16
16
16
16
III
CONTENTS
4.2.1
4.2.2
4.2.3
4.2.4
4.2.5
4.2.6
4.2.7
4.2.8
4.2.9
4.2.10
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5.1 Functionality . . . . . . . . . . . . . . . . . . .
5.1.1 Motivation for the component . . . . . .
5.1.2 Functionality demands . . . . . . . . . .
5.1.3 Interactions with other components . .
5.1.4 Motivation of design choiches/trade-os
5.1.5 Future work . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5 IO Node
6 Interconnect
6.1 Functionality . . . . . . . . . . . . . . . . . .
6.1.1 Motivation for the component . . . . .
6.1.2 Functionality demands . . . . . . . . .
6.1.3 Interactions with other components .
6.2 Implementation . . . . . . . . . . . . . . . . .
6.2.1 Electrical Interface . . . . . . . . . . .
6.2.2 Bus protocols . . . . . . . . . . . . . .
6.2.3 Read cycle . . . . . . . . . . . . . . .
6.2.4 Write cycle . . . . . . . . . . . . . . .
6.2.5 Explicit demands . . . . . . . . . . . .
6.2.6 Motivation of design choices/trade-os
6.2.7 Future work . . . . . . . . . . . . . . .
6.3 Testing . . . . . . . . . . . . . . . . . . . . .
6.3.1 Test environment . . . . . . . . . . . .
6.3.2 How to test . . . . . . . . . . . . . . .
6.3.3 What to test . . . . . . . . . . . . . .
7 Arbitration
7.1 Functionality . . . . . . . . . . . . . . . . . .
7.1.1 Motivation for the component . . . . .
7.1.2 Functionality demands . . . . . . . . .
7.1.3 Interactions with other components .
7.2 Implementation . . . . . . . . . . . . . . . . .
7.2.1 Electrical Interface . . . . . . . . . . .
7.2.2 Request queue . . . . . . . . . . . . .
7.2.3 Programming of internal registers . . .
7.2.4 Explicit demands . . . . . . . . . . . .
7.2.5 Motivation of design choices/trade-os
7.2.6 Future work . . . . . . . . . . . . . . .
7.3 Testing . . . . . . . . . . . . . . . . . . . . .
7.3.1 How to test . . . . . . . . . . . . . . .
7.3.2 What to test . . . . . . . . . . . . . .
7.3.3 Test environment . . . . . . . . . . . .
16
17
18
18
18
19
19
20
20
20
21
21
21
21
21
21
21
22
22
22
22
22
22
22
23
23
23
24
24
24
25
25
25
25
26
26
26
26
26
26
26
26
26
27
27
27
27
27
28
28
IV
CONTENTS
8 Boot
8.1 Functionality . . . . . . . . . . . . . . . . . . .
8.1.1 Motivation for the component . . . . . .
8.1.2 Functionality demands . . . . . . . . . .
8.1.3 Interactions with other components . .
8.1.4 Motivation of design choiches/trade-os
8.1.5 Future work . . . . . . . . . . . . . . . .
8.1.6 Test methodologies . . . . . . . . . . . .
9 Memory Wrapper
9.1 Functionality . . . . . . . . . . . . . . . . . . .
9.1.1 Motivation for the component . . . . . .
9.1.2 Functionality demands . . . . . . . . . .
9.1.3 Interactions with other components . .
9.2 Implementation . . . . . . . . . . . . . . . . . .
9.2.1 Entity/Interface . . . . . . . . . . . . .
9.2.2 Explicit demands . . . . . . . . . . . . .
9.2.3 Motivation of design choiches/trade-os
9.3 Testing . . . . . . . . . . . . . . . . . . . . . .
9.3.1 How to test . . . . . . . . . . . . . . . .
9.3.2 What to test . . . . . . . . . . . . . . .
9.3.3 Test methodologies . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
28
29
29
29
29
29
29
29
29
30
30
30
30
30
30
30
30
31
31
31
31
31
SYSTEM ARCHITECTURE
1
1.1
System Architecture
Functionality
The system contains one or more processing units, a real-time unit, interconnect, and pheriphal components (gure 1). The system can be booted by loading code from an external source via the parallel
port. During the download phase, each node is halted. When the download is completed, the I/O node
"releases" each node by broadcasting a signal telling each node to begin its execution.
A distributed shared no-cache memory allows threads to communicate with each other. The global address space is 32-bits, which means that the memory can be addressed from 00000000h to FFFFFFFFh.
An 32-bit address consists of a 8-bit base which is concatenated with a 24-bit oset. The base is a one-hot
coded unique identication which divides the global address space into several local address spaces. Each
local address space ranges from 000000h to FFFFFFh.
MEM
NI
CPU/DSP
NI
CPU/DSP
MEM
MEM
NI
CPU/DSP
I/O
CPU/DSP
The FPGA that is used is a XILINX XCV1000 that has 1,124,022 gates and 131,072 bits of RAM
organised in 4 Kbit dual-port RAM clusters. The rst version of SoCrates will use two processing
nodes, meaning that the available RAM will be divided into two parts, 65,536 bits each (8192 bytes),
which results in a physical address space from 000000h to 002000h. Each processing node uses the local
memory address space for an exception vector table, TCB address table, a list of TCBs, thread segments
(code, data, stack), Network Interface (NI) registers, I/O registers, and Real Time Unit (RTU) registers
(gure 2).
NI
Interconnect
Arbiter
NI
MEM
CPU/DSP
MEM
CPU/DSP
NI
RTU
MEM
Interrupt lines
External Interrupts
1.2
Implementation
002100h
(physical limit) 002000h
000000h
002100h
Network Interface
registers
002100h
Network Interface
registers
Network Interface
registers
Shared Data
Shared Data
Shared Data
User Data
User Data
User Data
User Code
User Code
User Code
Exception Code
Exception Code
Exception Code
TCB block #N
TCB block #N
TCB block #N
TCB block #2
TCB block #2
TCB block #2
TCB block #1
TCB block #1
TCB block #1
000000h
Processing node 1
Processing node 2
000000h
Processing node N
1.2
Implementation
1.3
1.3
Testing
Testing
CPU NODE
CPU Node
2.1
Functionality
The node serves as a container component for those components that are necessary in order to create the
smallest environment in which the CPU can perform its task. The whole system consists of at least one
such node and other nodes with dierent functionality and responsibilities, see section 1 for the whole
system specication.
Implementation
The node implementation consists of making instances of the participating components and connecting
them to a functional unit.
NI can be viewed as component with three dierent interfaces handling dierent requests and
response from dierent parts of the system.
{ CPU-interface
The CPU-interface handles request/response from/to the processor (see section 4.2.1 or gure 3).
{ MEM-interface
The MEMory-interface handles all strobes that controls the local memory on the node. Addresses originating from a remote access and data to be delivered to a remote node are also
handled, see section 4.2.1 or gure 3 for further information.
{ EXT-interface
The EXTernal-interface is the nodes link to the rest of the system. Outgoing requests and
responses are handled according to section 4.2.1 and gure 3.
CPU
The CPU performs requests to NI and memory when issuing load, stores, swap, and prefetch
instructions. Strobes and control signals are transmitted to NI while address and data are delivered
to data CPU and address CPU, which is available to both memory wrapper and NI.
2.2
Implementation
Memory wrapper
The memory wrapper serves as an interface to the on-chip RAM blocks. This is done to obtain
a general interface between the memory, NI, and CPU independent of how the on-chip memory is
accessed.
Dual-ported memory
reset
clk
reset
clk
mas
berr
mode
rw
ack
ad_strobe
bus_grant
bus_request
data
reset
clk
address
A dual-ported memory is attached to the NI via the memory wrapper and CPU as shown in g 3
to avoid a decrease of available bandwidth when both a remote NI and the local CPU are accessing
the memory. The protocol between CPU and memory is true SRAM zero wait-state. NI inserts
wait-states via nWAit if it can not satisfy zero wait-state transaction.
MEMORY WRAPPER
MEMORY
lock
rw_CPU
rw_NI
cs_CPU
nwait
berr
NETWORK
INTERFACE
trans
cs_NI
mode
address_NI
mas
data_NI
rw
CPU
data_CPU
address_CPU
irq
Figure 3: Node structure. Signals and electrical interfaces are described in section 4.2.6.
Width
Type
Description
data
InOut
InOut
Out
In
InOut
InOut
InOut
InOut
InOut
In
address
bus request
bus grant
ad strobe
ack
rw
mode
berr
irq
2.2
Implementation
2.2.4 Testing
2.2.5 What to test
Internal and external processor initiated transactions must be veried for correct behavior. Concurrent
requests and responses must be simulated.
CPU
CPU
3.1
Functionality
The instruction set is augmented with a prefetch instruction. The instructions performing multiplication
does not have to be implemented in this version of the processor. The processor consists of several internal
components (gure 4):
A register le consisting of 20 32-bit registers
A 32-bit ALU
A barrel shifter
An address increment unit
A control unit
An exception handler
The register le contains 20 registers, where R0-R14 are general purpose registers (R15 is used as a
program counter, see gure 5). There is a seventeenth register, the Current Processor Status Register
(CPSR), that holds status information. The rest of the registers are banked, according to ARM organization. The gure also shows which registers are visible when operating in either User or Supervisory
mode. The CPSR register is organized in the following way:
Bit(s)
31
30
29
28
27-8
7
u6
5
4-0
Symbol
N
Z
C
V
I
F
T
M4-M0
Description
Negative/Less than
Zero
Carry/Borrow/Extend
Over
ow
Reserved
IRQ disable
Fast Interrupt Request (FIQ) disable
State bit
Mode bits
The CSPR is identical to the ARM CSPR register and further explanation of the semantics of each bit
can be read in the ARM7TDMI manual, section 3.8.
3.1
Functionality
Address bus
Address Register
Exception handler
Address
Incrementer
Instruction
Register Bank
18 32-bit registers
2 status registers
Decoder
&
Control Logic
Barrel shifter
32-bit ALU
Data bus
3.1
Functionality
Supervisor
R0
R0
R1
R1
R2
R2
R3
R3
R4
R4
R5
R5
R6
R6
R7
R7
R8
R8
R9
R9
R10
R10
R11
R11
R12
R12
R13
R13_svc
R14
R14_svc
R15 (PC)
R15 (PC)
CPSR
CPSR
SPSR_svc
= banked register
3.1
10
Functionality
Mnemonic
ADC
ADD
AND
B
BIC
BL
CMN
CMP
EOR
LDM
LDR
MOV
MRS
MSR
MVN
ORR
RSB
RSC
SBC
STM
STR
SUB
SWP
TEQ
TST
PRF
Instruction
Add with carry
Add
AND
Branch
Bit Clear
Branch with Link
Compare Negative
Compare
Exlusive OR
Action
Rd:= Rn + Op2 + Carry
Rd:= Rn + Op2
Rd:= Rn AND Op2
R15:= address
Rd:= Rn AND NOT Op2
R14:= R15, R15:= address
CPSR flags:= Rn + Op2
CPSR flags:= Rn - Op2
Rd:= (Rn AND NOT Op2)
OR (op2 AND NOT Rn)
Stack manipulation (Pop)
Rd:= Rn + Op2 + Carry
Rd:= Op2
Rn:= PSR
3.1
11
Functionality
31
28 27
Condition Field
0
Op-code & Operands
28 27
1
0
Prefetch Address
The endian conguration of the system does not aect the reads or stores if only words are used. It
does only matter when half word and bytes are used. The following (two) tables shows the appropriate
actions when issuing a LDRH and LDRB:
A[1:0] Little Endian (BIGEND=0] Big Endian (BIGEND=1)
00
D[15:0]
D[31:16]
10
D[31:16]
D[15:0]
Endian eects for 16-bit fetches (LDRH)
A[1:0] Little Endian (BIGEND=0] Big Endian (BIGEND=1)
00
D[7:0]
D[31:24]
01
D[15:8]
D[23:16]
10
D[23:16]
D[15:8]
11
D[31:24]
D[7:0]
Endian eects for 8-bit data fetches (LDRB)
Example:
Data Word
Bit:
Byte Address:
Byte Address:
AA
31
3
0
BB
CC
2
1
1
2
DD
0
0
3
DD
DD
DD
BB
DD
AA
3.2
12
Implementation
BB
BB
BB
DD
BB
DD
Implementation
CPU
nM
nRW
nWait
nRESET
ABORT
nIRQ
nMREQ
LOCK
MAS[0]
0
1
0
1
nMode This bus indicates in which mode the processor is operating in. A LOW signal indicates supervisor mode and HIGH indicates user mode.
3.2
13
Implementation
nRW This signal indicates whether the processor wants to write or read data. When HIGH, this signal
indicates a write cycle; when LOW, a read cycle.
nWait This signal is used when accessing slow pheripherals, to let the processor wait for a number of
clock cycles. This is acheived by driving nWait LOW. If nWait is not used it must be tied HIGH.
In SoCrates, the nWait signal is used to delay the processor execution to wait for an external
transaction to complete.
nReset This signal triggers a hardware reset of the CPU. A LOW level will cause the instruction being
executed to terminate abnormally. When nReset becomes HIGH for at least one clock cycle, the
processor will re-start from address 0. nReset must remain LOW (and nWait must remain HIGH)
during reset.
ABORT This is an input which allows the memory system to tell the processor that a requested access
is not allowed.
nIRQ Must be taken LOW to interrupt the processor when the appropriate enable is active.
nMREQ When LOW, this signal indicates that the processor requires a memory access.
LOCK When LOCK is HIGH, the processor is performing a \locked\ memory access, and the memory
controller must wait until LOCK goes LOW before allowing another device to access the memory.
There are eight exceptions that are supported by the ARM7TDMI processor. Only a few of these will
be implemented (data abort, reset, undened instruction, SWI, IRQ). The routines are pointed out by
interrupt vectors in the following way (ARM7TDMI):
Address
0x00000000
0x00000004
0x00000008
0x0000000C
0x00000010
0x00000014
0x00000018
0x0000001C
Exception
Reset
Undefined Instruction
Software Interrupt (SWI)
Abort (prefetch)
Abort (data)
Reserved
IRQ
Fast Interrupt Request (FIQ)
When there are multiple exceptions, a xed priority system determines the order in which they are
handled (highest priority rst):
1. Reset
2. Data abort
3. FIQ
4. IRQ
5. Prefetch abort
6. Undened instruction
3.2
14
Implementation
31
27
Condition Field
Bit 27
Bit 26
Bit 26
Bit 25-24
11
Block Data
Transfer
Co-Proc.
1
BL
Bit 25
other
OK
other
Bit 24
SWI
Address
Alignment
Bit 25
Bit 24-20
other
10X10
Bit 7, 4
11
other
Bit 6,5
MSR
Data
Processing
Bit 24-20
10010
00
other
other
Bit 24
Bit 23
MULL (A)
SWP
0
Bit 22
Half-word
Immediate
Transfer
10X00
Data
Processing
Half-word
Register
Transfer
Bit 18
MRS
1
BX
MSR
MUL (A)
3.3
Testing
15
Testing
NETWORK INTERFACE
16
Network Interface
4.1
Functionality
The Network Interface (NI) handles communication both locally internal to the node as well as transfers
between dierent nodes.
A local Load/Store is a access to a memory that are physically located at the same node as the
requesting CPU.
The reference could not be handled locally, thus resulting in a transaction of data on the global
interconnect.
Prefetch of data
The NI must be able to handle prefetch requests from both the local CPU and to serve prefetch
requests from other CPUs. The NI will assume that all prefetches are to non-local memory locations.
Further, at boot of the system, NI must halt the CPU to let the I/O node write to the memories before
starting execution.
Implementation
As many as possible of the actions should be done in parallel. The local CPU must be able to access
the memory simultaneously with accesses that are coming through the interconnect. All of the doubledirected signals must be Three-stated whenever it is possible to avoid signals to be driven from two or
more sources, possibly resulting in unpredictable behavior.
4.2
Implementation
17
When NI detects a read from local memory, initiated by the local CPU, it controls the chip select
and write-enable strobes to the memory wrapper. The memory responses to the action the next
cycle and delivers the requested data to CPU and at the same time NI resets the strobes to the
memory wrapper.
If the access is a write to local memory NI controls the chip select and write-enable to the memory
wrapper. No response signals are sent to either the NI or CPU. NI resets strobes to the memory
wrapper in the same manner as described above.
In a case of a read from remote memory, NI checks if the address matches the contents of the
prefetch address register. In case of a match, and if the valid bit is set, NI can deliver the data to
the local CPU directly from its prefetch register. If the address matches and the valid bit is not
set NI uses the interconnect as described in section 6.2.3. After the interconnect has delivered the
data to the requesting NI, it is forwarded to the local CPU. The cycle after data is forwarded, the
signals to the CPU are resetted.
In this case NI performs a write to remote memory according to the protocol described in section
6.2.4. After completion NI resets the signals to the CPU.
A prefetch of data is initiated by writing the address of data into the prefetch register located in
NI. The following actions to fetch the data is identical to the case of a read from a remote node
(see section 6.2.3) except that the delivered data that comes via the interconnect is written to the
internal prefetch data register instead of forwarding it directly to the local CPU. If NI is busy with
an activity that requires communication via the interconnect and a new remote read, remote write
or prefetch is issued during that time, the new action will be delayed until the current transaction
is nished.
A remote read comes via the interconnect and is therefore initiated by an another node. NI sets the
address and strobes to the memory wrapper and reads the contents of the memory following cycle.
The response on the interconnect to this request is described in section 6.2.3. Reset of strobes to
the memory wrapper are made one cycle after the memory access. If NI is busy with an activity
that requires communication via the interconnect and a new remote read, remote write or prefetch
is issued during that time, the new action will be delayed until the current transaction is nished.
If a remote write is requested via the interconnect, NI sets the address, data and strobes to the
memory wrapper. The following cycle data is written to the memory and the response on the
interconnect is described in section 6.2.4. Reset of strobes to the memory wrapper are made one
cycle after the memory access. If NI is busy with an activity that requires communication via the
interconnect and a new remote read, remote write or prefetch is issued during that time, the new
action will be delayed until the current transaction is nished.
4.2
18
Implementation
Read
Write
Local Remote
Local Remote
Read
Write
Read
Write
EVENT SOURCE
Prefetch
Remote
Read
EVENT
ACTION TO PERFORM
Write
Local
Read
Local
Write
delaying them.
If the access is a remote read or write the global interconnect must be locked by not leaving
the bus mastership until the lock signal is lowered by the CPU. There is no lock signal on the
global interconnect but the protocol described in section 6.2.2 allows bus lock by not releasing the
bus request signal.
4.2
19
Implementation
Width
31 downto 0
31 downto 0
NOF NODES-1 downto 0
NOF NODES-1 downto 0
1
1
1
1
1
Width
31 downto 0
31 downto 0
1
1
1
1
1
1
Type
InOut
In
In
Out
Out
Out
In
In
Type
InOut
Out
Out
In
InOut
InOut
InOut
InOut
InOut
data NI
address NI
rw CPU
rw NI
cs cpu
cs NI
clk
reset
Width
GLOBAL DATA WIDTH-1 downto 0
GLOBAL ADDRESS WIDTH-1 downto 0
3 dpwnto 0
3 downto 0
1
1
Miscellaneous signals: NI
Signal Name
Description
Description
Width
1
1
Type
In
In
Type
InOut
Out
In
In
In
In
Description
Indicates read or write mode. 0=read, 1=write
Indicates read or write mode. 0=read, 1=write
chip select for local accesses
chip select for accesses from other nodes
Description
Width
32 bits
1 bit
Memory location
2000h
2004h
Description
Address for the data
CPU halt register
Prefetch of data to the node is initiated by the CPU by writing an address to the prefetch register.
After fetching the data via the interconnect, it is stored in the internal prefetch data register. If a new
address is written to the prefetch address register before the former prefetch has been completed, NI will
still fulll the ongoing prefetch. The new prefetch will start after completion of the ongoing and the
internal prefetch data register will be overwritten with the new fetched data.
At reset of the system nWaitAlwaysActive is set to 1 which causes NI to raise the nWait signal to
the CPU and thus halting it. The nWaitAlwaysActive register is reseted to zero when the boot-sequence
4.2
Implementation
20
has nished by making a broadcast from the I/O-node to all NI:s in CPU nodes. All registers are both
readable and writable from both interconnect and local CPUs.
IO NODE
5
5.1
21
IO Node
Functionality
22
INTERCONNECT
Interconnect
6.1
Functionality
The interconnect moves data between two arbitrary nodes via their NIs.
Implementation
The interconnection network consists of a shared bus with transaction protocols for reads and writes.
When using a shared bus solution with multiple masters there is need for arbitration to resolve possible contention between several potential masters. Arbitration is done centrally by an arbitration unit
described in section 7.
Width
GLOBAL DATA WIDTH-1 downto 0
GLOBAL ADDRESS WIDTH-1 downto 0
NOF NODES-1 downto 0
NOF NODES-1 downto 0
1
1
1
1
1
1 downto 0
1
1
1
1
1
Type
InOut
InOut
In
Out
InOut
In
Out
In
Out
InOut
InOut
InOut
InOut
In
InOut
Description
Request for bus mastership
Indicates which node has the bus mastership.
Indicates read or write mode. 0=read, 1=write
Validity of address and data
Validity of address and data
Acknowledge for transactions
Acknowledge for transactions
Size of transfer
Buserror -transaction not performed
0=user mode, 1=supervisor mode
1=locked transaction, 0=unlocked transaction
global clock
6.2
23
Implementation
bus_grant(x)
ad_strobe
address
valid
data
mas
valid
valid
rw
ack
Figure 12: Read cycle. Signals with levels between high and low are considered as three-stated. The
number of cycles between request and grant, n, depends on the arbiter and its arbitration method.
Implementation of NI and the local node sets the number of cycles, m between a request of data and
response of data.
6.2
24
Implementation
when the destination nodes NI has fetched the data from the bus it sets the ack signal(4). Detecting the
ack the initiator lowers ad strobe and three-states address, data and rw.
clk
bus_request(x)
bus_grant(x)
ad_strobe
address
valid
data
valid
mas
valid
rw
ack
Figure 13: Write cycle. Signals with levels between high and low are considered as three-stated.
6.3
Testing
25
in parallel by multiplexing address and data onto the two general buses. This would need a new kind of
arbitration unit and protocol.
6.3
Testing
26
ARBITRATION
Arbitration
7.1
Functionality
The arbiter supervises the potential bus masters, leading and dividing the work on the bus.
The NI of each node interacts with the centralized arbiter. Communication goes both ways according
to the signals bus request & bus grant, in the bus protocol see section 6.2.2.
7.2
Implementation
Width
NOF NODES-1 downto 0
NOF NODES-1 downto 0
1
1
Type
In
Out
In
In
Description
7.3
27
Testing
N1
N2
N3
N4
...
Nn
Request
...
priority
...
Response
...
N1
N2
N3
N4
...
Nn
Figure 14:
Testing
7.3
Testing
28
Every node (NI), issues a single request without intervention from other nodes, the outcome for
correctness is a bus grant signal to that node as long as the NI holds bus request high.
Two or more network interfaces issues their bus request signals simultaneous in any arbitrary order.
For correct result, the nodes are to receive their bus grant in order of descending node ID's. Tests
when all nodes requests the bus simultaneous must be performed.
BOOT
8
8.1
29
Boot
Functionality
30
MEMORY WRAPPER
Memory Wrapper
9.1
Functionality
Implementation
Smaller memory primitives are instantiated and merged together with the generate/port map command.
To suit ARM, a special wrapper is implemented which has one CS and 4 WE. Four byte-wide memories
are instantiated. The databusses are threestated when the memory is not adressed and no write enable
is active.
9.2.1 Entity/Interface
generics: address width, data width
External Interface: NI-Interconnect
Signal Name
clk
reset n
we 31 24 NI
we 23 16 NI
we 15 8 NI
we 7 0 NI
we 31 24 CPU
we 23 16 CPU
we 15 8 CPU
we 7 0 CPU
cs NI
cs CPU
address NI
address CPU
data NI
data CPU
Width
31 downto 0
31 downto 0
1
1
1
1
1
1
1
1
1
1
depth-1 downto0
depth-1 downto0
31 downto0
31 downto0
Type
In
In
IN
IN
IN
IN
IN
IN
IN
IN
In
In
InOut
InOut
InOut
InOut
Description
Strobes are active high. Data is bidirectional, direction is controlled by WE. The number of WE is
dened by (address width / 8) because of ARM's demand for separately controllable bytelanes.
9.3
Testing
31
Testing
SoCrates
-Implementation details
Authors
Abstract
This document is the result of a Master Thesis in Computer Engineering, describing the implementation of The First prototype of Socrates, a congurable platform for System-on-chip Multiprocessor
system. It ts on a single million gate FPGA including 2 processing nodes, memory, embedded
software, an IO unit, and a Real Time Hardware Unit. Here, an detailed overview is given of the
implemented CPU-core, a non-pipelined clone of ARM7TDMI processor and the Network Interface.
Description is given how to link an application, congure the system, and how to simulate it, before
the end section where suggestions of future work is given. The document also include preliminary
sythesis results and future work.
CONTENTS
Contents
1 CPU
2 Network Interface
2.1
2.2
2.3
2.4
2.5
Address Decoder
Control Unit . .
Prefetch Buer .
Sender . . . . . .
Receiver . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 Arbiter
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
4
5
5
7
7
7
7
8
9
9
10
11
11
12
12
12
12
12
13
13
13
13
14
16
18
18
19
19
19
20
21
22
23
23
24
25
25
25
26
26
28
30
30
30
30
30
CONTENTS
II
System setup . . . . . . .
Conguring HW-platform
Simulation . . . . . . . . .
Synthesis . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 Current Results
7 Future work
31
31
35
35
35
35
35
37
37
37
38
38
38
38
38
39
40
40
40
40
40
41
41
1 CPU
1 CPU
This chapter is a description of the processor that is implemented in the SoCrates project. The processor
is integrated together with a Network Interface and local memory to form a CPU node. One restriction
put on the processor was that it could execute ARM code, because lots of applications in the embedded
industry has been implemented on an ARM platform, which would make SoCrates more attractive to
potential users. The processor is implemented with similar features as the original ARM together with
some new components. The gure 1 shows a general view of the processor architecture. Here, the
components that the processor consist of will be described in the following sections.
Address
Data
Control
Unit
Register File
A
B
IMM_value
Exception
handler
PC
Pipeline
Compensator
Pipeline
Compensator
Operator
Operator
Barrel
Shifter
C
Amount
C
Operator
C
N
Z
ALU
+4
V
RES
Operator
PC
Operator
+4
PC
B(L) The branch and branch linked instructions performs a pc relative jump. The destination address is
generated by adding a oset to the actual program counter. Whether the instruction is an ordinary
branch or a branch liked is determined by the state of bit 24, the L-bit in the op-code gure 2, one
means linked. A branched linked will save the return address1 in the link register, to be used when
returning from subroutines.
COND
1 L
(a).
Offset
Figure 2:
Data processing The 16 data processing instructions are the instructions that internally process data
among the processor registers. The dierent instructions are dened by the alu operation the
instructions op-code in gure 3 maps to, see section ALU. The second operand is either an immediate
value or a register depending on the I-
ag. When operand 2 is a immediate value the operand 2 eld
looks like gure 3a. A 8 bit immediate vale is rotated ror, the amount given by the 5 bits in the eld
Rotate. Two dierent types of register operands exist, register immediate gure 1.1d, and register
register gure 3e. When operand 2 is register immediate the shift amount is a 5 bit immediate value
whereas the shift amount for register register is the lowest ve bits of the register Rs. For registers
as operand 2 any of the ve shifting functions are possible, given by the Fx eld. special awareness
must be taken when the pc register 15 is used, both on the implication of pipeline compensation
and mode changing. The S-bit is also of signicans for the control
ow of the program, it decides
whether the status
ags should be updated by an instruction or not.
COND
OpCode
Rn
Rd
(a).
Operand 2
Rotate
(b).
Imm
Shift
(c).
Rm
Amount
Rs
Fx
(d).
Fx
(e).
Figure 3:
BDT Block Data Transfer enables transfer of data to or from any number2 of registers, to and from the
main memory.
This instruction is convenient to use for stack operation though it supports any stacking mode, post,
pre, up, and down indexing. The 16-bit wide register list contains all the registers the instruction
shall operate on. Each register that ought to be processed is marked with a one in it's corresponding
bit position, e.g register R5 in bit number 5. At least two registers must be present in the list. The
behavior of the instruction is derived by the
ag-bits in the op-code gure 4.
1 The address
2 Any general
Special awareness must be taken whenever the program counter is in the register list. The use of
the S-
ag is also important when using the pc. When using user mode transfer3 the base address is
obtained from a banked supervisor register and all the registers in the list are non banked registers.
COND
P U
S W L
Rn
(a).
Register list
Figure 4:
SDT Single Data Transfer, load or store byte or word quantities. The contents of the Rd destination
register is saved on location Rn base register+ a oset value, or Rd is loaded with the contents of
address Rn+oset. The actual instruction and modes are determined by the
ag-bits in the op-code
gure 5a.
When the I bit is clear the oset is a 12-bit immediate value shown in gure 5b. Otherwise the
oset is a register like gure 5c, that ought to be shifted in the same way as the data processing
register immediate is done before addition to the base is performed.
COND
P U B W L
Rn
Offset
(a).
Immediate offset
(b).
Rd
Shift
Rm
(c).
Figure 5:
SWP This instruction is primary test & set instruction, designed for implementating software semaphores.
The instruction reads a location given by the base register Rn and stores the contents of register
Rm on address (Rd). Then the previously read value is stored in Rd. The instruction requires
that the dierent accesses to the memory are atomic, therefore the instruction is able to lock the
interconnect during execution.
COND
0 B
Rn
Rd
Rm
(a).
Figure 6:
MRS Transfer status register to a general purpose register. The contents of cpsr or spsr is transfered to
any of R0-R14 even to R15 the PC4 .
1.2 Register le
COND
4
0
0 Ps
Rd
(a).
Figure 7:
MSR Transfers the contents of any register including the pc to cpsr or spsr. Two modes do exist, the rst
transfers the whole register contents to the status register see gure 8. The other one only updated
the condition
ags-bits of the status register, gure 8a. Two dierent kinds of
ag transfer are
possible. Register or immediate, where the register contents is used directly whereas the immediate
value is rotated before transfer.
COND
0 Pd
COND
0 Pd
Rm
(b).
Source operand
Rotate
0
(c).
Imm
0
(a).
Rm
(d).
Figure 8:
SWI Software interrupt, this instruction is used to enter supervisor mode in a controlled manner. A
software trap will occur, put the processor in supervisory mode. The return address is saved in the
link register (R14 supervisor) and the contents of the cpsr is transfered to the saved status register.
The ignored eld of the instructions operation code, gure 9, is ignored by the processor and can
be used to pass information to the interrupt service routine ISR handling the exception.
COND
1 1 1 1
IGNORED
(a).
Figure 9:
PRE The prefetch instruction is an extension to the original ARM7TDMI instruction set, designed and
implemented for the Socrates system on chip architecture. The instruction will issue a non-blocking
read of one 5 data word from any address within the system. Unlike the rest of the instructions the
prefetch instruction can not be executed conditionally since the four ones in the normal condition
eld is the actual op-code for the instruction. A prefetch is issued of the data located on the node
given by the eight bit ID eld and the 20 bit wide local address eld see gure 10. A prefetch can
be viewed as a non-blocking load of eventual remote located data that normally would result in a
processor stall. When the data later on is needed a read to the same address will deliver requested
data.
1.2 Register le
The original ARM7TDMI make use of a ve ported register le, this due to the use of a pipeline and
the fact that some modes of the data processing instruction needs to read three registers at one instant
time. Since no pipelining is used in this clone a "pseudo\ dual ported RAM model ought to perform well
for the purpose. The register le is "pseudo\ dual ported in the meaning that either two register can be
read simultaneously, or a write to a single register can be performed.
5 The
rst prototype only supports prefetch of one 32-bit word, will be enhanced in later versions.
1.2 Register le
1
ID
(a).
Address
Figure 10:
The ARM processor has 16 general purpose registers ranging from zero to fteen whereas the pc is
the sixteenth register R15, and the current processor status register cpsr. There also exist banking
(duplications) of registers that are accessible only in supervisor mode. The banked register are R13 svc,
R14 svc, and the saved processor status register spsr. A table of all general and banked registers can be
found in the Socrates specications cpu section and the ARM7TDMI manual.
1.2.1 Registers
As stated in the Socrates specications document the ARM-clone only supports system/user and supervisor states, therefore except the general registers R0 to R15, and processor status the banked registers
R13, R14 and saved status must be implemented. This makes a total register count of 20. All registers
address, name, usage and implementation can be viewed in the table below.
Register address
-0000
-0001
-0010
-0011
-0100
-0101
-0110
-0111
-1000
-1001
-1010
-1011
-1100
01101
01110
01111
10011
10100
11111
Register name
R0
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
R11
R12
R13
R14
R15
R13 supervisor
R14 supervisor
pc
Function
R0
R1
R2
R3
R4
R5
R6
R7
R8
R9
R10
fp
ip
sp
lr
UNUSED
sp supervisor
lr supervisor
pc
Implementation
Ram
Ram
Ram
Ram
Ram
Ram
Ram
Ram
Ram
Ram
Ram
Ram
Ram
Ram
Ram
Ram
signal
signal
signal
1.2.2 Implementation
The main part of the register bank for this clone will consist of the ram module wr sync dpram from the
sysa 1 course. The ram module is congured to contain 16 32-bit wide ram cells, which is where the main
parts of the registers will be located. The registers not located in ram will be implemented as signals
within or outside the actual register le. Those registers that are signal but resides inside the bank are
the banked R 13, and R 14. The two processor status registers are both implemented as signals but not
1.2 Register le
invoked in the physical register le unit. This is motivated by the reason that the MSR instruction has
the option to not update the whole register but just the conditional
ags. This update of the
ags would
have been very hard to realize with the status registers as ram cells. The program counter is also placed
outside the register le as a signal. This due to speed performance reasons.
A wrapper component is built around the ram module where a mapping function translates the incoming signals from the processors control unit into signals interpreted by the ram module. The address of
the dierent register is seen in the tables above. The wrapper and is's contents is viewed in gure 11.
In order to read the contents of any register in the register le the we reg strobe must be cleared and
the address to the desired register must be presented to one of the two address lines for reads. Data will
be ready for use on one of the two out ports the next cycle. Two registers can be read simultaneously.
When issuing a write, only a single register is written. The we reg is set high and the register address is
put on the add write line. The data to write is put on data in. The write is completed in memory after
two cycles this due to the synchronous nature of the ram module.
we_reg
add_read_A
add_read_B
add_write
Mapping function
R0
R8
R1
R9
R13_svc
R2
R10
R14_svc
R3
R11_fp
R4
R12_ip
R5
R13_sp
R6
R14_lr
R7
unused
PC_R15
CPSR
SPSR_svc
Mapping function
data_A
data_B
data_in
1.2.3 Interface
Signal Name
clk
we reg
add write
add read A
add read B
data in
data A
data B
Width
1
1
4 downto 0
4 downto 0
4 downto 0
31 downto 0
31 downto 0
31 downto 0
Type
In
In
In
In
In
In
Out
Out
Description
Clock pulse
Write enable strobe
Write address
Read address port A
Read address port B
Write data input
Port A data output
Port B data output
Clk, the clock needed by the synchronous dpram from the sysa 1 course which the register bank
mainly consist of. Writes are synchronous.
We reg, write enable strobe. One enables writes, zero disables writes.
Add write, the register address to which the processor wish to write to. This address is only decoded
when we reg is high.
Add read A, read address for port A, in order to gain access to one of the registers R0-R15 independent of mode and eventual banked registers.
Data A, the output resulting from reading the contents of the register corresponding to the address
specied by the signal add read A.
1.4.1 Interface
Signal Name
Rm
amount
Op
I rot
cpsr C
B out
Width
31 downto 0
4 downto 0
1 downto 0
1
1
32 downto 0
Type
In
In
In
In
In
Out
Description
Input argument
Shift amount
Operand code
Indicates immediate shift, (no rrx)
Carry bit from Status Register
Result of operation + carry out from shift
Rm, the 32-bit contents of any register except the status registers, or an immediate value specied
by the instruction currently executing.
Amount, the ve bit value giving the amount of steps to be shifted. The amount is specied by an
immediate value or by a register contents supplied by the actual instruction.
Op, gives the function to be performed by the barrel shifter, specied by the instruction.
I rot, immediate shift
ag indicates that the amount is specied by an immediate value and rrx are
not to be used.
cpsr C, current state of the carry
ag.
B out output result from shifting Rm x-bits along with carry out produced by the shifting process.
The carry produced by the barrel shifter is used as carry in for the alu later on.
1.4.2 Functions
Basically the barrel shifter supports three shifting modes, logical shift left(lsl ) and right(lsr ), and arithmetic shift right(asl ). It also supports two rotates, rotate right(ror ) and rotate right extended(rrx ). The
rrx function is a special case of a ror #0, which is interpreted as rrx.
0 0 0
LSL #3
0 0 0 0
C
LSR #3
Rotate right
Ror performs a rotate of the bits in the operand, the contents of bit zero is placed in the carry and
bit 31 for a ror #1. No bits are ever shifted out. Figure 15 illustrates the result of an ror #3.
C
ASR #3
C
ROR #3
C
ROR #0
RRX
1.4.3 Functionality
The barrel shifter works in a pipelined fashion with three steps stages where the dierent stages except
the last one produce an intermediate value that is passed on to the next stage for further processing.
The result from the last step is the actual output from the barrel shifter and is to be passed on to the
alu. By dividing the possible maximum shift amount into three stages we will reduce the chip area used
compared to have 31 dierent shift modules, one for every possible amount. Still some eectiveness in
speed is obtained.
1.4.4 Stages
In each stage of the pipeline the amount is calculated and the function is evaluated. The incoming
operand whereas it is the unprocessed operand or an intermediate value is passed to the right functional
shifting block according to function and amount. The intermediate value produced denoted stage n in
gure 17, is a result of the shifts carried out by all previous steps.
1. The rst stage of the barrel shifter encodes the amount to be shifted from the two lowest bits of the
amount vector. This yields a shift amount of f0,1,2, or 3g which gives the actual subset of the whole
1.5 ALU
10
shift to be performed by the rst step. A zero amount in those two bits indicates a bypass of the
operand to the next step. A special case emerges when the shift amount is zero and the operator
is ror. If the signal I rot equals zero we have to perform a rotate extended, this is expressly done
in this stage.
2. Shifts of 0,4,8, or 12 bits are performed in the second stage. The amount of bits to shift is taken
from bit 2 & 3 in the amount signal. Likewise the rst step zero amount means bypass to next step.
3. The third and last stage takes the intermediate value from the second stage and perform either a
bypass to the output or a shift by 16 bits before passing it on to the output. This depends whether
the last bit in the amount vector is one or zero, zero means bypass.
OPERAND
SHIFT 16
RESULT
SHIFT 0
Stage 2
SHIFT_AMOUNT
SHIFT 12
SHIFT 8
3
2
SHIFT 4
OPERAND
SHIFT 0
Stage 1
0
SHIFT 3
OPERAND
SHIFT 2
Data
SHIFT 1
SHIFT 0
1.5 ALU
The Arithmetic Logic Unit (ALU) performs 16 basic ARM dened data processing operations. The
arguments to ALU comes via the two inputs A and B and the result is delivered at the output OUT and
at the 4 output
ag signals. The unit is asynchronous.
1.5 ALU
11
A
c_in
OP
s
c
v
z
n
RES
Figure 18:
Type
In
In
In
In
In
Out
Out
Out
Out
Out
Description
Input argument 1 (Op1)
Input argument 2 (Op2)
Operand code
Set condition code
Carry bit from Status Register
Result of operation
carry
ag
zero
ag: set if the result of operation is zero
valid
ag: set if the data is not valid
negative
ag: set if the result is negative
1.5.2 Operations
Which operation that is performed by the ALU is dened by the input on the OP signals. All operations
and their semantic actions are shown in the table below:
OP
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
12
Operation
AND
EOR
SUB
RSB
ADD
ADC
SBC
RSC
TST
TEQ
CMP
CMN
ORR
MOV
BIC
MVN
Description
RES = A AND B
RES = A EOR B
RES = A - B
RES = B - A
RES = A + B
RES = A + B + c
RES = A - B + c - 1
RES = B - A + c - 1
Set condition codes on A AND B
Set condition codes on A EOR B
Set condition codes on A - B
Set condition codes on A + B
RES = A OR B
RES = B
RES = A AND NOT B
RES = NOT B
The TST, TEQ, CMP and CMN operations does all aect the RES signal which means that it is up
to the control unit to not use the result. For further information refer to ARM7TDMI Data Sheet: ARM
Instruction Set.
1.5.3 Flags
The S input signal control if the
ags are to be set or not. If the S is 1 the
ags are aected and if S is 0
no
ags are aected. In the implementation the S
ag in the alu is always set and the data path control
desides wether the status register should be updated or not depending on the actual instruction and wich
ags are set in it's op-code.
1.5.5 Testing
All operations and are tested with all possible combinations of inputs. The verication of the simulation
is made \by hand". By using generics, ALU can be tested for input arguments of size 2.
13
All handling of interrupt and exceptions are performed in the exception state, in which a oset vector
from the exception decoder is processed and loaded in to the pc enableing execution to continue from
that point. The address is presented to the address bus and the processor goes to the decode state.
Signal Name
Reset
Undened
Software
A prfetch
A data
Reserved
IRQ
FIQ
mask
arm irq
oset
Width
1
1
1
1
1
1
1
1
1
1
2 downto 0
Type
In
In
In
In
In
In
In
In
In
Out
Out
Description
Reset line
Undened instruction line
Software interrupt line SWI
Prefetch abort line (unused)
Data abort line bus error
Reserved for future expansion
Rtu interrupt line
Fast IRQ line (unused)
interrupt mask from status register
Interrupt signal to the processor
Address oset to the ISR
14
The state machine is implement as a mealy machine with synchronous outputs, all assignment of signals
are in the same vhdl process. This may not be the most eective way to implement a state machine for
speed performance but it has been our main technique until this date.
Exception
Start
B_A
B_L
Load
Store
DP
E
B_S
swap
1.8.1 States
The control machine consists of 12 states, which can be viewed in gure 19. A general description of
what action is performed in each one of the processor states is given in this section. General in the
meaning that every signal assignment are not described but the function of the state in order to execute
the dierent instructions are.
start The start state is a hardware setup state, internal signals are set to their initial values. The pc,
cpsr, and spsr are initiated. All buses are three stated. When the system boot signal8 the rst
instruction is fetched and a state transition to decode is done.
fetch
F The fetch state is for most of the instruction the last state in their execution path. It is also
the state where fetch of the next instruction is made. The last part of an instructions execution is
performed here which is write back of the destination register and eventual updates of the status
ags for data processing instructions.
decode D All instructions comes to this state once in their execution cycle. If the nWait signal is high
execution of the instruction can begin, otherwise the processor halts. First the condition eld of
the op code is evaluated according to the states of the condition
ags specied in the ARM7TDMI
manual. If the conditions for the instruction to be executed fail the pc will be updated and the
next instruction fetched, we will remain in the decode state to decode the next instruction.
Now the op-code is being decoded according to the decoding tree presented in the Socrates specication document section CPU. Any op-code not recognized by the decoder will generate a undened
8 The
15
instruction exception which will be handled by the exception handler. For all instructions that need
to pass information from the op-code to other states along the instructions execution path a inst info
vector is created and initiated with adequate information see section 1.8.2 for further detail. For
the most of the instructions registers are fetched from the register le, except for the eventual use
of immediate values. Dependent of what remains to do in order to complete the execution of the
instruction, the next state diers among the instruction in the instruction set. The MRS instruction
is already nished and can issue a fetch of the next instruction and remain in decode state. Others
like data processing load/store and SWP goes to the execute state. Prefetch, branch and MSR will
proceed to the fetch state.
E For most of the instructions reads of register initiated in the decode state are here passed on
to dierent units in the processor architecture. Values are put on the pipeline compensators, shifted
in the barrel shifter an placed on the two alu ports. Data processing instruction are passed further
down their execution path due to their use of registers, immediate values, one register two registers
up to three registers. The block data transfer instructions are the only instruction that enters this
state several times during their execution. Therefore the execute state for those instructions is
divided in to two parts distinguished by the amount of register to process in the register list. If
the register list is empty all register in the list has been processed and the instruction in nished.
Update of the base register if the W-bit in the op-code is set or post indexing is used, is the only
remaining action to be done. If the list is un empty, the rst register in the list is processed, whether
the instruction is a load or a store is determined and action taken, read of source register for a store
or address output for a load.
execute
DP This state is only used when operand 2 of the data processing instruction is of register
register type. This because the register le only enables two registers to be fetched at once, but
this instruction is in need of accessing three register simultaneously. This is overcome by buering
the contents of one of the registers in the execute state and reuse it in this state with the third
register being fetched from the execute state.
dp extra
load When the load instruction reached the load state the eective address calculation is ready. The
address must then be evaluated according to alignment rules, word aligned for word accesses. For
byte accesses no special alignment is needed. If a word access do not pass the alignment check an
undened instruction exception will occur, the load will be aborted and execution will continue with
the exception handler for undened instruction. If the address is aligned correct, the right address
for the load, depending on the addressing mode pre or post indexing is placed on the address bus.
Any eventual write modify of the base register is performed here,
owed by a transition to the fetch
state.
store The store state is identical to the load state in regard to alignment control, write modify of the
base register, and it's next state. Not only the address bus is updated but also the data bus with
the actual value to be written in memory. Special action must be taken when the program counter
is to be stored, it must be pipeline compensated by 12. For the case when the write is of byte size
the lowest byte of the source register is duplicated on to all four bytes of the data bus.
B A This state is exclusively used by the block data transfer instructions in order to be
able to calculate the address of the transfer before they reach the execution state.
block address
B L This state handles write back of registers for the load part of the block data transfer
instruction. Since the actual read from memory was initiated in the decode state control of the
nWait signal must proceed all other actions. After the nWait signal goes high the destination
register is examined. If the program counter is the destination register the pc is update with the
value currently residing on the data bus, then if the S-bit in the instruction was set the contents of
spsr is transfered to cpsr (i.e a mode change). The instruction will then return to the execute state
to process the next load.
block load
16
B S In this state the data bus is presented with the contents of the register to be stored.
A control must be done to determined if the register to be stored is the pc or not. The address to
store the data depends on the indexing modes pre, post, up, or down. This is evaluated and the
right address is put on the address bus. Execution will then proceed to the execute state.
block store
swap This state is a special state for the swp instruction where the previously read test variable is being
temporally stored in the internal swap reg. The store part of the swp instruction is also initiated
by presenting the write address and data on to the busses.
exception When entering this state something unwanted has happened or an interrupt from an external
source has occurred. The pc value is stored in the link register for supervisor mode and the pc
is updated with the address to the interrupt service routine which will handle the exception. The
address is constructed by shifting the oset vector from the exception decoder left two bits, then
add the 8-bit identication eld to the upper 8-bits of the address.The address is also presented to
the address bus for a fetch of the rst instruction of the ISR. The machine is driven to state decode
to begin execution of the next fetched instruction.
17
U Up or down bit.
P Pre or post indexing.
I Immediate or register oset.
1/2 Obsolete.
H Obsolete.
S Obsolete.
SWP The conguration of the swp information can be seen in gure 20d.
Instruction code 100
Rd Destination register.
Rn Base register.
Rm source register.
B Byte or word quantity.
MSR See gure 20e, for the information vector.
Instruction code 110
r/s Destination register.
R Status register cpsr or spsr.
F Whole register or
ag bits only.
I Immediate or register transfer.
PRE See gure 20a.
Instruction code 111. The inst info vector for the prefetch instruction expressly contains the
instruction code, this du to no other information needs to be passed along. The instruction
code is necessary because the instruction must be recognized in the the fetch state.
DP Data processing is the only instruction to utilize the full size of the inst info vector, which can be
seen in gure 20f.
18
UNUSED
(a).
(b).
Rd
(c).
Rd
(d).
(e).
UNUSED
UNUSED
Rn
UNUSED
H 1/2
P U
Rm
Rn
I
UNUSED
I
alu_OP
pc
dp
OP
W Rs
S pc
Register list
Rm
pc
R r/s
pc
Rd
(f).
S W L
Rn
(g).
Rn
P U
Figure 20: The inst info vector for the B(L), BDT, SDT, SWP, MSR, PRE, and Data processing instructions.
19
Figure 21: State transition(s) for instructions with false condition code.
Instruction Type
CC false
BCET
1 cycle
WCET
Tf etch + 1 cycle
Figure 22: State transition(s) for Branch & Branch Linked (B, BL) instructions.
Instruction Type
B
BL
BCET
2 cycles
2 cycles
WCET
Tf etch + 2 cycles
Tf etch + 2 cycles
20
D
(a).
(b).
DP
F
(c).
BCET
2 cycles
3 cycles
3 cycles
3 cycles
4 cycles
WCET
Tf etch + 2 cycles
Tf etch + 3 cycles
Tf etch + 3 cycles
Tf etch + 3 cycles
Tf etch + 4 cycles
>BS
+ TE
>BL + nW aitL DR +
21
nWait
bus_request + nWait
B_L
D
B_A
E
B_S
Figure 24: State transition(s) for Block Data Transfer (BDT) instructions.
Instruction Type
STM
LDM
BCET
4 cycles
4 cycles
WCET
Tf etch + 115 + Tbusr equest
Tf etch + 147 + Tbusr equest
Ttotal = TD
>E
+ TE
>ST R=LDR
+ TE
>F
>F
+ TF
>D
, where
TD >E = 1 cycle
TE >ST R=LDRP= 1 cycle
Tbusr equest = Ni=0 Ti (Sum of all execution times of all the nodes (N) in the request queue before you)
TnW ait = 0 cycles if local, 5 cycles for global STR, 7 cycles if global LDR
TLDR=ST R >F = 1 cycle
TF >D = 1 cycle
The best case is described in gure 25, where there is no transition from state E to LDR or STR.
Instead, the there is an direct transition from state E to state F, because the PC isn't located in the
register le, and no cycle is needed to fetch it, as in the general case. In the simple case, the rst cycle calculates the address, the second cycle puts the address on the bus (and data if needed), and the
last cycle writes back the data (when needed) at the same time its fetching a new instruction. The
BCET occurs when the LDR or STR does not need to access the interconnect, which gives a total of
1 + 0 + 1 + 0 + 0 + 0 + 1 = 3cycles.
The worst case occurs when a LDR instruction must access the interconnect and all nodes are issuing a
blocked transfer. This causes the bus request loop to wait until all blocked transactions are done. The
WCET then becomes 1 + 1 + 0 + Tbusr equest + 7 + 1 + 1 = 11 + Tbusr equest cycle. The WCET for STR
is basically same, with the exception that the TnW ait for STR takes 5 cycles, which gives a total of 9 +
Tbusr equest cycles.
Instruction Type
LDR
STR
BCET
3 cycles
3 cycles
WCET
Tf etch + 11 + Tbusr equest cycles
Tf etch + 9 + Tbusr equest cycles
22
bus_request + nWait
Figure 25: State transition(s) for Single Data Transfer instructions (1st operand as PC, 2nd operand
immediate).
bus_request + nWait
Load
D
E
Store
Figure 26: State transition(s) for Single Data Transfer (SDT) instructions (general case).
Ttotal = TD
>E
+ TE
>swap
>F
>D
, where
TD >E = 1 cycle
TE >swap = 1Pcycle
Tbusr equest = Ni=0 Ti (Sum of all execution times of all the nodes (N) in the request queue before you)
TnW aitr ead = 0 cycles if local, 7 cycles if global
Tswap >F = 1 cycle
TnW aitw rite = 0 cycles if local, 5 cycles if global
TF >D = 1 cycle
The best case execution time occurs when the instruction is a local swap, because no accesses to interconnect is needed. This gives a BCET of 1 + 1 + 0 + 0 + 1 + 0 + 1 = 4cycles.
The worst case is when the swap instruction must access the interconnect and all other nodes must issue
a blocked instruction that takes maximum time to execute (which is a blocked load of all 16 registers).
So, the time for Tbusr equest depends on how many nodes that are present in the system. With this in
mind, we can estimate the WCET to 1 + 1 + Tbusr equest + 7 + 1 + 5 + 1 = 16 + Tbusr equest cycles.
Instruction Type
SWP
BCET
4 cycles
WCET
Tf etch + 16 + Tbusr equest cycles
23
bus_request + Nwait
swap
Nwait
Figure 27: State transition(s) for Single Data Swap (SWP) instructions.
BCET
1 cycle
2 cycles
WCET
Tf etch + 1 cycle
Tf etch + 2 cycles
BCET
2 cycles
WCET
Tf etch + 2 cycles
24
D
exception
1.9.10 Prefetch
This instruction is implemented to make prefetching of data possible. In the prototype, only pre fetching
of 32-bit data is possible. The prefetch instruction operand consists of a 28-bit address (8 bits id + 20 bits
local address) that will be written to the predened address of the Network Interface prefetch register.
After the write, the instruction is nished and data should be available when it is needed. The prefetch
instruction always takes 2 cycles because one cycle is needed to write to the prefetch register at NI, and
one cycle is needed to fetch a new instruction.
D
Instruction Type
PRE
BCET
2 cycles
WCET
Tf etch + 2 cycles
2 NETWORK INTERFACE
25
2 Network Interface
This section describes the implementation of the Network Interface (NI). The functionality demands for
this component is described in Socrates specications in section 4. To made future modications easy NI
was divided into 5 components shown in gure 32 described below:
Address decoder
Listens to accesses made by the CPU and decides whether it is an local access or an external access
or an prefetch initialization.
Control Unit
The heart of the NI. Keeps the state of the NI in mind and controls Sender, Prefetch buer, the
CPU data bus, and the nWait strobe to stop the CPU whenever needed.
Sender
Handles accesses to the interconnect, and as in the case for the Socrates Prototype where the
interconnect is a shared bus the accesses will follow the bus protocol described in the Socrates 1.0
Specications.
Receiver
Listens if there is accesses to memory located at this node or to the nWaitAlwaysRegister. Receiver
controls the strobes to the memory wrapper. In case of a read, the fetched data from local memory
is delivered to the access initiator via the interconnect.
Prefetch Buer
This is a cache-like buer but the contents data within the buer is totally controlled by the local
CPU by initiating prefetches. A prefetch init is done by making a write to a specic address.
Dividing the component into a separate sender and receiver and a control unit with a clear interface
makes it easy to adapt NI if the bus protocol is changed. In this case there is only need for modications
in the sender and receiver components. Even a change to a complete dierent interconnect type will
only have aect on the sender and receiver with minor modications for the rest of NI. In the following
sections the 5 subcomponents implementation will be described.
2.3 Prefetch Buer
data_NI(31:0)
address_NI(11:0)
cs_NI
address
RECEIVER
berr_i_in
bus_request
SENDER
rw_i
lock_i
data
PREFETCH_BUFFER
ADDRESS_DECODER
Figure 32:
ack_in
lock
nwait
berr
data_CPU(31:0)
trans
adderss_CPU(31:0)
rw
mas
26
PREFETCH A prefetch is ongoing on the interconnect. Waiting for the prefetch to nish. Additional accesses before completion of prefetch will be delayed until sender has prefetched data.
RW An access has been started via the interconnect. Waiting for acknowledge.
request_address(31:0)
prefetch_address(31:0)
complete
data_out(31:0) hit
CONTROL_UNIT
data
prefetched_data update
bus_grant
send_fcn
ad_strobe_out
sender_address
rw_NI(3:0)
cs_CPU(3:0)
rw_CPU
prefetch_init
mas_i
data_write
rw_i
nWaitAlways
remote_read
address
data_read
mas_i
remote_write
ack_out
local_lock
This synchronous component is trigged by the Control Unit by setting an access type (remote read or
remote write). After a completed access the control unit is informed and in case of a load the data is
delivered to control unit. A general description of the states in the FSM shown in gure 34 is given below:
2.4 Sender
Control unit can read this buer within one clock cycle since reads are performed asynchronous. Writes
are handled synchronously to avoid latches. A data line in the buer consists of: 32 bit address, 32-bit
data and a valid bit. Prefetch buer can only store one word but the dividing of prefetch buer into a
own subcomponent makes it easy to increase the size. For more info refer to Socrates 1.0 Specications
document.
2.3 Prefetch Buer
berr_decoder
local_access
lock_i
berr_sender
berr_i_out
ad_strobe_in
remote_lock
2.4 Sender
27
cmd=REMOTE_WRITE
send_fcn <= INTERCONNECT_WRITE
addr_sender <= address_CPU
data_write <= data_CPU
RW
cmd = REMOTE_READ and HIT=0
send_fcn <= INTERCONNECT_READ
addr_sender <= address_CPU
cmd = REMOTE_READ and HIT=1
send_fcn <= INTERCONNECT_READ
COMPLETE = 1
cmd = PREFETCH_INIT
send_fcn <= INTERCONNECT_READ
addr_sender<=data_CPU
PREFETCH
Figure 33: The Finite State Machine for the Control Unit
2.5 Receiver
28
send_fcn = INTERCONNECT_READ
bus_request <= 1;
IDLE
send_fcn = INTERCONNECT_WRITE
bus_request <= 1;
lock = 0
complete = 0;
lock_i <=Z;
data <= (others =>Z);
ARBITRATE_R
ARBITRATE_W
COMPLETE_STATE
bus_grant = 1
ad_strobe_out <= 1;
address<=addr_from_cu;
data <= (others => Z);
mas_i <= mas;
rw_i <= READ;
lock_i <= lock;
lock = 1
bus_grant = 1
ad_strobe_out <= 1;
address <= addr_from_cu;
data <= data_from_cu;
mas_i <= mas;
rw_i <= rw_i;
lock_i <= lock;
complete <= 0;
berr_sender<= 0;
BUS_LOCKED
WRITE_INIT
READ_INIT
send_fcn = INTERCONNECT_WRITE
ad_strobe_out <= 1;
address<=addr_from_CU;
data<=data_from_CU;
mas_i<=mas;
rw_i<=WRITE;
send_fcn = INTERCONNECT_READ
ad_strobe_out <= 1;
address<=addr_from_CU;
data<=(others=>Z);
mas_i<=mas;
rw_i<=READ;
berr_i = 1
berr_sender <= 1;
bus_request <= 0;
berr_i = 1
berr_sender <= 1;
bus_request <= 0;
SET_BERR_SENDER
Figure 34:
2.5 Receiver
The receiver is implemented with one synchronous FSM and parallel VHDL for detecting node accesses
and controlling byte lanes to memory wrapper. The FSM in gure 35 is brie
y described below.
RECEIVER IDLE Listening for accesses on the interconnect whose addresses matches this nodes
ID-eld.
2.5 Receiver
29
SET_BUSERROR
berr_i <= Z;
data_NI <= (others => Z);
ack <= Z;
berr_i <= Z;
remote_lock <= 0;
local_lock=1
RECEIVER_IDLE
ack <= 0;
berr_i <= 0;
data_NI<=(others =>Z);
cs_NI <= 0;
local_lock = 0 and
lock_i = 0
data<=(others =>Z);
ack <= 0;
berr_i <= 0;
WRITE_TO_LOCAL
MEM_LOCKED
THREE_STATE_DATA
data_NI<=(others =>Z);
cs_NI <= 0;
READ_FROM_LOCAL
lock_i=1
data<=(others =>Z);
ack <= 0;
berr_i <= 0;
Figure 35:
WRITE TO LOCAL Write access detected, setting strobes to memory wrapper to perform a write.
READ FROM LOCAL Read access detected, setting strobes to memory wrapper to perform a
read.
THREE STATE DATA Read access has been performed. Data bus on the interconnect is threestated.
SET BUSERROR Invalid access, for example invalid access size.
3 ARBITER
30
3 Arbiter
The Arbiter unit shares resources among several potential busmasters in a round-robin fashion. The
implementation is done generically, so it can be congured to handle arbitrary number of nodes. It is
implemented as an asynchronous unit.
Arbiter Unit
grant(1)
request(1)
grant(0)
request(0)
3.2 Operations
The arbiter uses an internal FIFO-queue to store pending requests in a round-robin fashion. If a node
does a request, and no one is currently master, then it gets mastership until it lowers its request line.
If there were several pending nodes at the time of request, the requesting node will be inserted last in
the FIFO-queue. IF several nodes does a request simultaneously, the node with the lowest ID is inserted
in the queue rst, then the node with second lowest ID, and so on... When a current master lowers its
request line it will automatically pass on its mastership to the node that is rst in the FIFO-queue.
3.2.2 Testing
The arbiter unit has been tested completely with one and two nodes, and more randomly with three
nodes. Each test has been done with do-les.
31
armcoffgcc c nodeX.c
NodeX.o
Phase 2
Phase 3
Phase 4
Phase 5
Phase 6
Figure 37: The phases needed to create an executable le in the SoCrates system.
32
applications. Usually, the code for the threads on a node can be implemented in a single le. An example
of a typical user application can be seen below.
#include "Ose_ker.h"
#include "io.h"
int running_thread
#define STACK_SIZE
100
char ctrl_stack[STACK_SIZE];
void ctrl_thread(void)
{
while(1)
{
outstring("C\n");
}
}
int thread_main(void)
{
uart_setup(NO_COM_INT);
ose_init(1);
ose_thread_create(1, 1, ose_READY, 0, ctrl_thread, ctrl_stack, STACK_SIZE, 1);
on_tsw();
}
while(1) { }
The application shows how a main function ( thread main) calls Real Time Unit (RTU) functions that
initializes and creates threads for scheduling during execution. The threads also call io-functions that
makes it possible to send messages to a terminal. The supporting libraries are precompiled and can
be used by including their header le (Ose ker.h and io.h). A thread may need to communicate with
another thread(s) that resides on a dierent node. A common way of communication is to use variables
or semaphores. In order to make the variables \visible" to both threads, they need to be placed in an predened section that every node can access. With the attribute directive, one can assign variables and
functions to special secitions that can later be placed in the local memory by the linker script (more information about the attribute directive can be obtained from the gcc manual at http://www.gnu.org).
The user application is simply compiled in the following way (phase 1 of gure 37). The -c
ag tells the
compiler only to compile and not to link the application. The output from this compilation is a object
le. Every node will produce its own object le that will be linked together in the next phase.
In the next phase, the precompiled libraries (or object les) are linked together with the user application
to produce an executable le. The precompiled les are:
Filename
io.o
Ose_ker.o
crt0_nodeX.o
Routines
uart_setup, outchar, getchar, getstring, outstring
init, create_thread, on_tsw, etc...
C runtime 0 functions for node X + taskswitch code
These les are linked together with the object les of user applications to produce a executable le in
srecord format (phase 2 of gure 37). During linking phase, the linker needs a script that shows where
33
all sections are placed in the local memory of each node. The following le is a typical linker script that
maps user code and data, RTU routines, startup routines, and IO routines to dierent memory areas.
OUTPUT_FORMAT("coff-arm-little");
ENTRY(_thread_main)
MEMORY
{
reset_vector
undef_vector
swi_vector
pref_abort
data_abort
reserved
irq_vector
fiq_vector
common_vars
rtu
boot
tsw_code
swi
code
data
shared_at_node1
ni_registers
io_registers
}
SECTIONS
{
.common
.reset
.swi
.irq
.tsw
.rtu
.code
.data
.bss
.rdata
.shared
.init
.io_regs1
.io_regs2
.io_regs3
.io_regs4
.io_code
}
(NOLOAD)
(NOLOAD)
(NOLOAD)
(NOLOAD)
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
{
{
{
{
{
{
{
{
{
{
{
{
{
{
{
{
{
org
org
org
org
org
org
org
org
org
org
org
org
org
org
org
org
org
org
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
0x01000000,
0x01000004,
0x01000008,
0x0100000c,
0x01000010,
0x01000014,
0x01000018,
0x0100001c,
0x01000020,
0x01000824,
0x01000e28,
0x01000f2c,
0x01001030,
0x01001134,
0x01001638,
0x01001b3c,
0x01002000,
0x80000000,
*(COMMON) }
*(.reset) }
*(.swi) }
*(.irq) }
*(.tswitch) }
Ose_ker.o }
*(.text) }
*(.data) }
*(.bss) }
*(.rdata) }
*(.shared) }
*(.init) }
*(.io_reg1) }
*(.io_reg2) }
*(.io_reg3) }
*(.io_reg4) }
io.o(.text) }
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
0x4
0x4
0x4
0x4
0x4
0x4
0x4
0x4
0x800
0x600
0x100
0x100
0x100
0x500
0x500
0x100
0x100
0x10
common_vars
reset_vector
swi_vector
irq_vector
tsw_code
rtu
code
data
data
data
shared_at_node1
boot
io_registers
io_registers
io_registers
io_registers
code
The script begins with the directive OUTPUT FORMAT which tells what the output format should be
from the linker. There are several options available, and they dier depending on which architecture the
linker is congured for. When executing objdump -i, the program will provide information about which
output formats are avaiable. This particular script uses the directive \arm-co-little" which produces a
34
binary le in little endian format. This binary le can later be converted to srecord format with the objcpy
command. Another way is to use \arm-srec" as output format directive. The next directive informs the
linker where the start function is when not using starting from a main function. We chose to always start
our applications from the function thread main and will therefore specify thread main in the ENTRY
directive.
The next directive is a MEMORY body that provides the linker with information about where each
section should reside in the local address space. Here's how a general section could be described:
section_name : org = start_address, LENGTH = area_size
The author of the linker script chooses what the section name, start address, and area size should be.
After the MEMORY body, one must connect the sections produced by the linker to the user dened
sections. This is done in the SECTIONS body, where a general line is written in the following way:
.real_section_name : { chosen_sections } > mapped_section
The real section name species which section is addressed. This could be user dened sections with the
attribute directive or the linker specied sections (e.g .text, .data, .bss, etc). The chosen section eld
describes which of the sections are chosen (in case there are several sections with the same name). For
instance, it is possible to map all the .text sections to a single section by specifying the chosen section eld
as *(.text). The last eld, mapped section, species to which user-dened section this section should be
mapped to (which means that mapped section must be mapped to some section name in the MEMORY
body). The linker script makes it possible to map all the sections according to the SoCrates memory structure (more information on where each section is mapped can be obtained by looking at the test application
linker scripts, node1.x and node2.x, or the System chapther in the SoCrates specication). Also, it is possible to force the linker to report the result of the mapping by specifying the -Map
ag. There are many more
options available when writing linker scripts, but the aim of this section is not to describe all the possible
script directives. Instead, the interested reader can nd more information about how to write linker scripts
at http://www.cygnus.com/pubs/gnupro/5 ut/b Usingld/ldLinker scripts.html#Concepts. The linker produces an output le with the name specied after the -o directive.
The executable le must still go through some changes before it can be used by the SoCrates hardware. First, The linker produces a byte-oriented srecord le, which means that each srecord line can have
an non-word sized number of bytes. The IO node in the SoCrates system handles downloading of the
executable le, wants word-sized srecord lines. To convert byte-sized srecords to word-sized srecords, we
use the SRECORD package, which is a collection of programs that can manipulate srecord les. One of
the programs in the package, srec cat, takes a srecord le as input and produces a word-sized srecord le
(phase 3 in gure 37) 13 .
The next phase simply concatenates each nodes word sized srecord les onto one srecord le that can
be downloaded by the IO node (phase 4 in gure 37). In the next phase, the srecord le is stripped
from srecord lines that do not contain data (srecord lines of type S5 and S7). This is accomplished by
letting sed delete the unwanted lines (phase 5 in gure 37). Finally, one srecord line is added to the nal
executable le. This line allows (when it is written to memory) the initially stalled processors to begin
their execution of the downloaded srecord lines (the code and data) (phase 6 in gure 37). The nal
product is an executable le in srecord format that the IO node can download to each processor.
13 More
35
5.3 Simulation
RTL-simulation can be done by using modelsim. To compile the system go to the implementation
directory and type: compile. Compile the SW as described in section 4 and make test-bench-readable
data of the software by typing loadsystem in the arm-code directory. Terminal emulation can be done
to see the output from the IO-node. To start the terminal emulators go the the arm-code directory and
write disp. Now two Unix-terminal that emulates VT100 should emerge. These terminals are connected
to the two UARTs in the IO-node. Start modelsim (or any other RTL-simulator) and load the entity
SUPERBENCH RTU to simulate the whole system.
5.4 Synthesis
Synthesis time is about 2 hours for Leonardo with optimization and half-an out for place and route14 .
A complete system including two CPU-nodes, RTU, I/O-node and arbiter occupies only 58% of the
VIRTEX 1000 FPGA. Pin placement is controlled by editing the Socrates.ucf le.
14 Using
5.4 Synthesis
36
SYSTEM_RTU.vhd
CPU_NODES.vhd
SUPERBENCH_RTU.vhd
system_generics.vhd
generic_or.vhd
terminal1.vhd
terminal2.vhd
terminal1.so
terminal2.so
terminal1.c
terminal2.c
socrates.ucf
modelsim.ini
Makefile
node1.c
node2.c
crt0_node1.S
crt0_node2.S
node1.x
node2.x
Ose_ker.c
Ose_ker.h
Ose_io.h
Ose_tcb.h
my_terminal.c
io.h
io.o
io.c
nWaitAlways.SREC
socload*
loadsystem*
srec_cat*
myterm*
disp*
Makefile
cpu_node/
CPU_NODE.vhd
Makefile
RTU/
rtu_node_top.vhd
SYSTEM
arm-code/
implementation/
src/
wrapper/
interconnect/
cpu-node/
CPU/
wr_sync_dpram_32.vhd
dpram2048x32_sim_distram.vhd*
dpram2048x32.vhd*
dpram2048x8.vhd*
Makefile
arbiter.vhd
NI.vhd
address_decoder.vhd
prefetch_buffer.vhd
control_unit.vhd
receiver.vhd
sender.vhd
README
Makefile
CPU_NODE.vhd
cc.do
Makefile
CPU.vhd
alu.vhd
barrel.vhd
exception.vhd
inc4.vhd
pipe_compensator.vhd
reg_bank.vhd
trippelport.vhd
wr_sync_dpram.vhd
Makefile
-component of CPU
-component of CPU
-component of CPU
-component of CPU
-component of CPU
-compiles VHDL files for simulation
sim.do*
socload/
socload.c
slsim/
IO/
socport/
src/
src/
io.vhd
uart.vhd
slsim*
slsim.c
control.vhd
parstim.vhd
piso.vhd
promrdr.vhd
socport.vhd
soctest.vhd
-submodule of io-node
Filerna.txt*
Figure 38: Directory structure and le descriptions for the Socrates Platform
6 CURRENT RESULTS
37
SYSTEM
CPU
MEMORY
CPU-NODE
RTU
NI
SHARED BUS
ARBITER
UART 2
MEMORY
UART 1
I/O-NODE
NI
CPU
CPU-NODE
6 Current Results
The goal from the beginning of the Socrates project was to have a multiprocessor system on a single FPGA
running a real-time demo application communicating with a host via VT100-terminals. However, the lack
of time and problems with integration of the real-time kernel due to its compiler specic15 API, delayed
the actual integration, verication and software development by two weeks. Also, the lack of details in
the RTU-documentation increased the system debugging time. This resulted that our ambitious goals
had to be re-visioned to only include RTL-simulation of the system since we kept it important to have
the RTU integrated.
38
6.5 RTL-simulation
The whole system was simulated at the RTL-level, using MentorGraphicsT M ModelSim. The demoapplication was compiled, linked with GCC cross compiler for the ARM architecture and downloaded
to a memory-model via the IO-node. It was successfully executed writing messages to the VT100 text
terminals performing task-switches for 7 seconds of simulated time lasting over a couple of days. Figure
39 shows a schematic picture of the simulated HW-components.
16 A
39
Simulated VT100 terminals
Host Software
THREAD 1
THREAD 2
THREAD 1
THREAD 2
.
.
.
THREAD 2
THREAD 1
CONTROL
CONTROL
CONTROL
.
.
.
.
CONTROL
CPU 1
CPU 2
Thread 1
Threrad 0
Thread 2
RTU
Simulated Hardware
Figure 40: The test environment, simulated hardware with software threads executing on dierent processors interacting with the RTU and printing strings to the simulated software host.
Unit
SYSTEM
CPU-NODE
CPU
RTU
NI
IO
ARBITER
Gates
653,349
41,929
33,834
5,873
6,625
0,271
Slices17
7,243
2,414
1,925
0,352
0,318
0,020
Maximum frequency
14.7 MHZ
18 MHz
29.6 MHz
50 MHz
64.9 MHz
-
7 FUTURE WORK
40
7 Future work
The objective of the implementation phase of this master thesis has been to specify, develop and implement
the rst prototype of the Socrates scalable system on a chip. A processor clone able to execute a subset of
the ARM7TDMI instruction set has been developed. The processor clone also features an addition to the
original instruction set, a prefetch instruction has been designed and implemented. A network interface,
handling internal as well as external bus transactions is also developed. The system also features a Realtime Unit RTU capable of handling several processors. The interface between hardware and software is
handled by kernel routines. For software development a cross compiler18 has been used, which will enable
us to create software in a serious manner, it also gives us the possibility to reuse other peoples c-code.
The linker scripts for linking the code on to the system is also developed and in use. However as rst
stated this is the very rst prototype and things do not become perfect at once. This also apllies to this
prototype, both on a system and architectural level as well as in component design and implementation.
The following sections will deal with the eventual short comings exist and how to solve them.
7.1.2 Register le
The use of the sysa 1 dpram module might save us some CLBs on the FPGA, but the module is inecient
when it comes to speed. There must exist other modules worth examine and evaluate if they can be used
instead of the module currently in use. If no better alternative can be found the only reaming option is
to design an new ram module within the project, in vhdl or layout. For the instruction19 that need to
obtain three register at once, the number of ports needs to be increased.
41
20 Four
Conclusions
We have shown that it is possible to develop (including a whole CPU core) a multiprocessor SoC that
ts on a single FPGA in very short time. By placing memory on-chip and reducing complexity in the
CPU core to gain system predictability, we take advantage of the benets that come with developing
systems on a single chip. When working closely in a small group of 3-5 people, information, new ideas,
and bug-reports during verication, are propagated immediately. A platform concept gives rapid design
time and keeps the designers focused on debugging the application and not the platform itself. One of
the biggest challenges remaining to be solved is the actual verication of the platform because the SOCs
all have embedded components that are hard to debug. Another issue is how to make software development easier to integrate together with the platform. Although our ambitions was to have a complete
mulitprocessor system running on a FPGA the nal stage of the master thesis ended at RTL-simulation
of a demo-application running on the target system.
SoCrates - Appendix
CONTENTS
II
Contents
1
1
2
3
4
6
8
9
11
13
The demo application consists of two source les: node1.c and node2.c. The rst node (node1) creates
and executes a control thread that simply prints out messages to the terminal. The other node (node2)
creates and executes two threads where each thread prints a message and lets the other thread take over
the execution. Every node has a thread main function that creates and starts up threads.
#define CTRL_ID
#define CTRL_PRIO
#define STACK_SIZE
1
1
100
char ctrl_stack[STACK_SIZE];
void ctrl_thread(void)
{
while(1)
{
outstring("Control \n");
}
}
int thread_main(void)
{
uart_setup(NO_COM_INT);
ose_init(1);
ose_thread_create(1, 1, ose_READY, 0, ctrl_thread, ctrl_stack, STACK_SIZE,
1);
on_tsw();
while(1) {
}
1.2
while(1) {
}
The I/O routines are communicating with an UART (9600 baud) in polled mode. This means that each
character is received or transmitted by reading/writing to the UART control registers. Each node that
need to write messages to a terminal need to initialize its UART with uart setup. After initialization,
sending or receiving messages can be done with one of the follwing routines listed in the header le.
unsigned
unsigned
unsigned
unsigned
#define
#define
#define
#define
#define
extern
*/
extern
extern
extern
extern
char
char
char
char
UART_TX;
UART_RX;
UART_CR;
UART_SRG;
COM1_INT_REC
COM2_INT_REC
COM1_INT_SEND
COM2_INT_SEND
NO_COM_INT
1
16
2
32
0
/* UART registers */
2.2
char
char
char
char
UART_TX
UART_RX
UART_CR
UART_SRG
__attribute__
__attribute__
__attribute__
__attribute__
((section(".io_reg1")));
((section(".io_reg2")));
((section(".io_reg3")));
((section(".io_reg4")));
#define ASCII_CR 13
void uart_setup(unsigned char setup)
{
UART_CR = setup;
}
void outchar(unsigned char c)
{
while( (UART_SRG & (unsigned char)32) != 32)
;
}
UART_TX = c;
UART_TX = *(ptr++);
return UART_RX;
2.2
The task switch routine must save the context of the current running thread, then fetch the identiaction
of the next thread, and nally restore the context of the new thread. The dierence between switching
threads at node 1 and node 2 is where the EXECTHREAD is located in memory.
3.1
msr spsr, R1
/* save the status reg
ldmia LR,{R0-R14}^
/* restore registers
ldr LR, [LR, #84] /* restore pc from tcb
adds PC, PC, #0
/* Force mode change,
*/
3.2
TCB is now in LR */
reg in R1 */
in saved status reg
from tcb */
(to LR) */
SUPER->USER mode */
*/
A detailed explanation of the meaning of each row in the linker script can be found in the Compiling
Linking the System Software section.
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
{
{
{
{
{
{
{
{
{
{
{
{
org
org
org
org
org
org
org
org
org
org
org
org
org
org
org
org
org
org
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
0x01000000,
0x01000004,
0x01000008,
0x0100000c,
0x01000010,
0x01000014,
0x01000018,
0x0100001c,
0x01000020,
0x01000824,
0x01000e28,
0x01000f2c,
0x01001030,
0x01001134,
0x01001638,
0x01001b3c,
0x01002000,
0x80000000,
*(COMMON) }
*(.reset) }
*(.swi) }
*(.irq) }
*(.tswitch) }
Ose_ker.o }
*(.text) }
*(.data) }
*(.bss) }
*(.rdata) }
*(.shared) }
*(.init) }
>
>
>
>
>
>
>
>
>
>
>
>
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
0x4
0x4
0x4
0x4
0x4
0x4
0x4
0x4
0x800
0x600
0x100
0x100
0x100
0x500
0x500
0x100
0x100
0x10
common_vars
reset_vector
swi_vector
irq_vector
tsw_code
rtu
code
data
data
data
shared_at_node1
boot
&
4.1
10
.io_regs1
.io_regs2
.io_regs3
.io_regs4
.io_code
(NOLOAD)
(NOLOAD)
(NOLOAD)
(NOLOAD)
:
:
:
:
:
{
{
{
{
{
*(.io_reg1)
*(.io_reg2)
*(.io_reg3)
*(.io_reg4)
io.o(.text)
}
}
}
}
}
>
>
>
>
>
io_registers
io_registers
io_registers
io_registers
code
4.2
11
(NOLOAD)
(NOLOAD)
(NOLOAD)
(NOLOAD)
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
{
{
{
{
{
{
{
{
{
{
{
{
{
{
{
{
{
org
org
org
org
org
org
org
org
org
org
org
org
org
org
org
org
org
org
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
0x02000000,
0x02000004,
0x02000008,
0x0200000c,
0x02000010,
0x02000014,
0x02000018,
0x0200001c,
0x02000020,
0x02000824,
0x02000e28,
0x02000f2c,
0x02001030,
0x02001134,
0x02001638,
0x01001b3c,
0x02002000,
0x80000100,
*(COMMON) }
*(.reset) }
*(.swi) }
*(.irq) }
*(.tswitch) }
Ose_ker.o }
*(.text) }
*(.data) }
*(.bss) }
*(.rdata) }
*(.shared) }
*(.init) }
*(.io_reg1) }
*(.io_reg2) }
*(.io_reg3) }
*(.io_reg4) }
io.o(.text) }
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
LENGTH
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
0x4
0x4
0x4
0x4
0x4
0x4
0x4
0x4
0x800
0x600
0x100
0x100
0x100
0x500
0x500
0x100
0x100
0x10
common_vars
reset_vector
swi_vector
irq_vector
tsw_code
rtu
code
data
data
data
shared_at_node1
boot
io_registers
io_registers
io_registers
io_registers
code
4.2
12
13
This article, which is a summary of the Socrates projet, has been accepted at the Date 2001 Conferance.
As we are currently waiting for instructions and feedback from the revision cometee, we cannot provide
the camera-ready version.
SoCrates
- A Multiprocessor SoC in 40 days
Mikael Collin, Raimo Haukilahti, Mladen Nikitovic and Joakim Adomat
Department of Computer Engineering, Malardalen Real-Time Research Center (MRTC)
Malardalen University, P.O Box 883, S-721 23 Vasteras, Sweden
fmci, rhi, mnc, jatg@mdh.se
Abstract
The design time of System-on-a-Chip (SoC) is today
rapidly increasing due to high complexity and lack of efficient tools for development and verification. The article describes the design and implementation of a Multiprocessor
SoC (MSoC) conducted by three master students. We propose a generic platform generator as way to reduce timeto-market and verification time. With the project, we have
shown that it is possible in a short time to develop a MSoC
that fits on a single FPGA.
1. Introduction
This paper describes the design of the first prototype of
Socrates, a generic scalable platform generator that creates
a synthesizable HDL description of a multiprocessor system. The platform is a result of a master thesis by three
students. The goal was to build a predictable multiprocessor system with mechanisms for data prefetching on a single
FPGA. This means that all development has to be done in a
very short time and all the software and hardware must fit
on a single FPGA.
2. Motivation
Design time including verification, has become one of
the largest challenges in SoC design. The productivity gap
shows the problem with developing complex SoC. To reduce this gap new design methods are needed. In order to
meet the demands of fast verification and time-to-market,
the system needs to be designed at a higher abstraction
level. This can be obtained by using a platform generator, where the individual components are already verified.
This means that the platform can instantly be tested and
SoCrates stands for SoC for real-time systems.
verified at system-level, which reduces the overall development time. To decrease design time we propose a parameterised system generator. SoCrates can easily scale to a
large number (1-20) of processing nodes and adapt the remaining components to given parameters at compile time.
3. System Overview
Socrates is a distributed shared memory (DSM) multiprocessor, with non-uniform memory access time. The architecture is based on a shared bus 1 on to which nodes are
connected (figure 1). A typical node contains local memory,
a network interface, and a CPU or DSP, where the CPU is an
ARM7 [5] clone. There exists a version with a DSP node,
but due to area limitations, no DSPs are included in this
version [3]. Software applications are divided into threads
and distributed onto the processors. Scheduling of threads
is managed and controlled by a Real-Time Unit (RTU) [2].
Interprocess communication is performed through shared
variables, whereas communication between hardware devices are made via memory mapped registers. In order to
compensate for the non-uniform memory latency, the system supports a prefetch functionality. Remotely located
data words are possible to fetch in advance before the point
in execution where data is required. I/O is managed by a
centralised node, which is responsible for code download
and external communication.
MEM
NI
CPU/DSP
NI
CPU/DSP
MEM
MEM
NI
CPU/DSP
I/O
CPU/DSP
Interconnect
Arbiter
NI
MEM
CPU/DSP
MEM
CPU/DSP
NI
RTU
MEM
Interrupt lines
External Interrupts
node1.c
int main(void)
node2.c
int main(void)
createThread(...);
createThread(...);
createThread(...);
createThread(...);
.
.
.
.
.
.
n1Thread1.c n1Thread2.c
n2Thread1.c n2Thread2.c
gcc cross-compiler
gcc cross-compiler
io.o
OSkernel.o applicationNode1.o node1map.x
io.o
OSkernel.o applicationNode2.o node2map.x
node1.SREC
node1.SREC
application.SREC
4. Problems
The hardware implemented real-time kernel performs
thread scheduling and synchronisation. RTU communication with the nodes is implemented with a memory mapped
register-based handshake scheme. Interrupts are sent out
upon context switch and the nodes fetches their next thread
ID. The kernel is parameterised with respect to the maximum number of threads, priorities and CPUs.
Despite fast workstations, time for simulation and verification is clearly a problem. Run-times for several days were
not unusual, especially for back-annotated data. System
simulation on pre place-and-route timing simulation proved
to be useful as Place and Route took several days for the
final implementation.
7. Conclusions
5. Current Results
A demo-application that runs on two CPU-nodes and
utilises the RTU and I/O nodes has been successfully implemented. The whole system including hardware and software fits on a single FPGA. The developed linker scripts
maps the user software to local address spaces and the system is parameterised by using generics. Figure 4 shows synthesis results of each hardware component.
Component
System
CPU node
RTU
CPU core
NI
Used gates
653 349
47 897
33 834
41 929
5 873
6. Future Work
Design for Test (DFT) is assumed to be implemented
with standard tools and design flows. This can be accomplished due to our all synthesizable approach. Our scalability relies on a perfect interconnect. Today, a shared bus
is used, which must be improved in the future with pointto-point/crossbar options in the generator. Software targeting is an important area and was solved by letting the user
partition the application at thread level. This can be done
automatically with respect to thread communication (locality) and code/data size. Interprocess communication HW
support will be integrated from existing research projects
at MRTC. With automatic generation of synthesis scripts, a
pushbutton design flow seems within reach. Virtual memory and dynamic memory allocation HW support will be
added.
This project shows that it is possible to develop (including a whole CPU core) a multiprocessor SoC that fits on
a single FPGA in very short time. By working closely
in a small group of 3-5 people, information as new ideas
and bug-reports during verification, are propagated immediately. A platform concept gives rapid design time and keeps
the designers focused on debugging the application and not
the platform itself. An all synthesizable approach is far
from feasible in most projects but it is a convenient way of
getting rapid results. A system view is crucial when designing SoC. All parts interact and focusing on narrow HW or
SW issues does not give a satisfying solution. The platform
concept will be dominating until IP companies have solved
todays problems with reusability and interoperability.
References
[1] David E.Culler, et al., Parallel Computer Architecture,
A Hardware/software approach, Morgan Kaufmann Inc,
San Fransisco California, 1999, ISBN 1-55860-343-3.
[2] L Lindh, et al., From Single to Multiprocessor RealTime Kernels in Hardware, IEEE Real-Time Technology
and Applications Symposium, Chicago, USA, May 15 17, 1995
[3] Carnegie-Mellon Low Power Group
http://www.ece.cmu.edu/ lowpower/benchmarks.html
[4] The GNU project
www.gnu.org
[5] ARM Ltd.
www.arm.com
[6] Xilinx Inc.
www.xilinx.com
[7] Mentor Graphics Corporation
www.mentor.com/modelsim