Documente Academic
Documente Profesional
Documente Cultură
OVERVIEW
From the functional point of view, it makes little difference whether a given
specification is implemented using a centralized architecture or a distributed
architecture. In this chapter, a number of arguments are presented in favor of a
distributed approach for the implementation of a hard real-time system. The chapter
starts with an overview of a distributed real-time system architecture, which consists
of a set of nodes and a communication network that interconnects these nodes. The
interface between the communication network and the host computer within a node is
discussed at length, and the concept of event message and state message is introduced.
The semantics of state messages facilitates the exchange of state information among
the nodes, and enforces a high degree of autonomy of the nodes and the
communication system.
The next sections argue for a distributed architecture as the preferred alternative for the
implementation of a hard real-time system. The first and most important argument is
that of composability. In a composable architecture, system properties follow from
subsystem properties. In such an architecture, a constructive approach to system
building and to system validation is supported. In the context of real-time systems,
composability requires that the communication network interface between the host
computer in a node and the communication network is fully specified, not only in the
value domain, but also in the time domain. Scalability requires that there exist no
limits on the extensibility of a system, and that the complexity of reasoning about
the proper operation of any system function be independent of the system size. It is
shown that the economics of VLSI production favor the distributed approach for
building a scalable large real-time system. Dependability arguments advocate a
distributed approach because such an approach makes it possible to implement well-
defined error-containment regions, and to achieve fault tolerance by replicating nodes.
30 CHAPTER 2 WHY A DISTRIBUTED SOLUTION?
messages are delivered by the communication system, such as the FIFO order of
message arrival.
If the message contains state information, e.g., the current temperature, it is
reasonable to overwrite the old version of a state value by a new version because the
receiver is normally interested in the latest version of the state. State values are not
disposed of on reading because many readers may be interested in the same state
value, e.g., the current value of the particular temperature sensor. In real-time
systems, state semantics is needed much more frequently than event semantics.
Control Strategy: The decision when a message must be sent can reside either
within the sphere of control of the host computer (external control) or within the
sphere of control of the communication system (autonomous control). The most
common temporal control strategy used in computer networks is external control.
The execution of a "send" command in the host computer causes the transfer of a
control signal across the CNI and initiates the transmission of a message by the
communication system. When a message arrives at the receiver, a control signal
(interrupt) from the communication system crosses the CNI and unblocks a "receive"
command in the receiving host computer.
In the case of autonomous control, the communication system decides autonomously
when to send the next message, and when to deliver the message at the CNI of the
receiver. The receiving host is not interrupted when a new message arrives because
this would compromise the autonomy of the receiver. Autonomous control is
normally time-triggered. The communication system contains a transmission
schedule, a table of points in time that determine which message must be transmitted
next. If the control of the communication system is autonomous, no control signals
must cross the CNI. In this case the CNI is used strictly for sharing data.
The Design Space: The sixteen different combinations of data semantics and
control strategies, at the CNI of the sender and at the CNI of the receiver, are listed in
Table 2.1. Some of these combinations make no sense, and this is indicated by a
"no" in the appropriate field. Of these sixteen combinations, the beginning and the
end of the table are of special significance. They represent event messages and state
messages.
2.1.5 Gateways
The purpose of a gateway is to exchange relative views between two interacting
clusters. A gateway node must have either an instrumentation interface and a
communication interface (interface node), or two communication interfaces (two
CNIs), each interfacing to one of the interacting clusters. An interface node is thus a
special type of gateway node. The gateway host must pass the relevant information
34 CHAPTER 2 WHY A DISTRIBUTED SOLUTION?
from the CNI of the sending cluster to the CNI of the receiving cluster. In most
cases, only a small subset of the information of one cluster is relevant to the other
cluster. In general, it cannot be assumed that the structure of the messages and the
representation of the information is identical in both clusters. Thus, the gateway host
must transform the data formats of one cluster to those expected by the other cluster.
In a time-triggered (TT) architecture, a gateway has data-sharing interfaces. There are
no control signals passing through a gateway component, i.e., in a TT architecture a
gateway does not reduce the autonomy of control of the interconnected clusters.
Many large real-time systems are not designed from scratch according to a single
master plan, but evolve gradually over many years, using different generations of
hardware and software technology. Gateways are important for designing interfaces to
such legacy systems with reasonable effort. One CNI can be designed to conform to
the data representation and protocol conventions of the legacy architecture, while the
other CNI can be designed according to the rules of the newly-implemented extension
to this legacy system. The gateway thus encapsulates and hides the internal features
of the legacy system and provides a clean and flexible interface.
2.2 COMPOSABILITY
In many engineering disciplines, large systems are built by integrating a set of well-
specified and tested subsystems. It is important that properties that have been
established at the subsystem level are maintained during system integration. Such a
constructive approach to system design is only possible if the architecture supports
composability.
2.2.1 Definition
An architecture is said to be composable with respect to a specified property if the
system integration will not invalidate this property once the property has been
established at the subsystem level. Examples of such properties are timeliness or
testability. In a composable architecture, the system properties follow from the
subsystem properties.
CHAPTER 2 WHY A DISTRIBUTED SOLUTION? 35
cooperatively accomplish some function know what to expect from the other devices.
Without this agreement, the interoperability of designs cannot be ensured.
2.3 SCALABILITY
Almost all large systems evolve over an extended period of time, i.e., over many
years or even decades of years. A successful computer system changes its
environment and this changed environment brings about new requirements on the
computer system itself. Only unsuccessful systems, which are obviously never used,
will not undergo changes. Evolving requirements are thus not an exception, but the
rule. Existing functions have to be modified, and many new functions have to be
added over the lifetime of the system. A scalable architecture is open to such changes,
and does not limit the extensibility of a system by some predefined upper limit.
2.3.1 Extensibility
A scalable architecture must not have any central bottleneck, neither in processing
capacity nor in communication capacity. Only distributed architectures provide the
necessary framework for unlimited growth since:
CHAPTER 2 WHY A DISTRIBUTED SOLUTION? 37
(i) Nodes can be added within the given capacity of the communication channel.
This introduces additional processing power to the system.
(ii) If the communication capacity within a cluster is fully utilized, or if the
processing power of a node has reached its limit, a node can be transformed into
a gateway node to open a way to a new cluster. The interface between the
original cluster and the gateway node can remain unchanged (Figure 2.4).
This transformation of a node into a gateway supporting a new cluster requires the
proper design of the name space. This issue is discussed in Section 8.4.1.
2.3.2 Complexity
Large systems can only be built if the effort required to understand the system
operation, i.e., the complexity of the system, remains under control as the system
grows. The complexity of a system relates to the number of parts, and the number
and types of interactions among the parts, that must be considered to understand a
particular function of the system. The effort required to understand any particular
function should remain constant, and independent of the system size. Of course, a
large system provides many more different functions than does a small system.
Therefore the effort needed to understand all functions of a large system grows with
the system size.
As indicated before, the complexity of a large system can be reduced if the inner
behavior of the subsystems can be encapsulated behind stable and simple interfaces.
Only those aspects of the behavior of a subsystem that are relevant to the function
under consideration must be examined to understand the particular function.
The partitioning of a system into subsystems, the encapsulation of the subsystem,
the preservation of the abstractions in case of faults, and most importantly, a strict
control over the interaction patterns among the subsystems, are thus the key
mechanisms for controlling the complexity of a large system.
In a time-triggered architecture, the structure depicted in Figure 2.4 encapsulates the
inner operation of one cluster from that of other clusters via the data-sharing gateway
interfaces. When reasoning about a particular function in a given cluster, the only
knowledge that is needed about the behavior of the other clusters is contained in the
temporal and value attributes of the data at the gateway interface. Because every TT
cluster exercises autonomous control, there are no control signals passing through
these gateway interfaces. In essence, the momentary values of the state variables at
the gateway CNIs to the other clusters form a sufficient abstraction of the
environment for the function under investigation. Such an architecture is scalable,
because the complexity of reasoning about the correctness of any system function is
determined by the cluster under investigation, and does not depend on the number of
clusters in the total system, i.e., on the system size.
38 CHAPTER 2 WHY A DISTRIBUTED SOLUTION?
2 . 3 . 3 Silicon Cost
The expected further advances of microelectronics technology will make the design of
large distributed systems even more competitive when compared with the
implementation of the same functionality using a centralized system.
The implementation of a given specification with a distributed architecture requires
more hardware (i.e., silicon area) than that with a centralized architecture. The
additional hardware is needed for the realization of the communication system among
the nodes, the replicated implementation of certain system functions (e.g., the
operating system in each node), and the additional packaging cost. This extra
hardware causes a higher initial investment for a distributed solution than for a
centralized one.
Example: Assume that the implementation of the communication system requires
about 100,000 transistors. If the transistor count in a single-chip node is just
400,000, this amounts to 25% of the silicon area of a node. However if a node
contains 10,000,000 transistorsa size that is within the reach of today's technology
then, the implementation of the same communication system requires only 1% of the
silicon area of the node (Figure 2.5).
Although a centralized system with a powerful single CPU starts with a lower initial
cost, the growth curve of the cost, as a function of system size, rises with a larger
gradient for a centralized system than for a distributed system (Figure 2.6).
2.4 DEPENDABILITY
Implementing a dependable real-time service requires distribution of functions to
achieve effective fault containment, error containment, and fault tolerance so that the
service can continue despite the occurrence of faults. The term responsive system has
been introduced to denote a system that has all of the three attributes: distribution,
fault tolerance, and real-time performance [Mal94].
that an error that occurs within an error-containment region is detected at one of the
interfaces of this error-containment region the error-containment coverage.
An error-containment region can be introduced at different levels, e.g., at the level of
a functional hardware block, or at the level of the task. In a distributed computer
system, it is reasonable to regard a complete node as an error-containment region and
to perform the error detection at the node's message interface to the communication
system. At this interface, error detection must be performed in the value domain and
in the time domain.
It is difficult to implement clean error-containment regions in a centralized
architecture because many system resources are multiplexed over many services. For
example, it is not possible to predict the consequences of an error in the hardware of a
central processor on a particular service because such an error can lead to a failure of
any one of the system services that use the central processor.
2.4.2 Replication
In a distributed system, a node must represent a unit of failure, preferably with a
simple failure mode, e.g., fail-silence. All inner failure modes of a node are mapped
into a single external failure modesilence (see Section 6.1.1). The implementation
of the node must guarantee that such a failure hypothesis stipulated on the
architectural level will remain valid during the operation with a high probability.
Given that this failure hypothesis holds, node failures can be masked by providing
actively replicated nodes. The replicas must show deterministic behavior. The issue
of replica determinism, i.e., replicated nodes "visit" the same states at about the same
time, requires careful hardware/software design. It is close to impossible to
implement a deterministic behavior in a large system that supports many concurrent
tasks and relies on asynchronous preemptive scheduling. The topic of replica
determinism is discussed at length in Section 5.6.
(ii) Hazardous: Fault that reduces the safety margin of the redundant system to an
extent that further operation of the system is considered critical.
(iii) Major: Fault that reduces the safety margin to an extent that immediate
maintenance must be performed.
(iv) Minor: Fault that has only a small effect on the safety margin. From the safety
point of view, it is sufficient to repair the fault at the next scheduled
maintenance.
(v) No Effect: Fault that has no effect on the safety margin.
The key concern in this categorization is the remainder of the safety margin after the
occurrence of a primary fault. The safety case must present trustworthy arguments
that the occurrence of a catastrophic or hazardous fault is extremely unlikely. The
classification of a fault into one of the listed categories is based on rational
arguments relating to the probability of the fault under investigation causing a
catastrophic failure.
The consequence of a fault is an error, i.e., a damage of the system state. An error in
a non-safety critical subsystem must be detected before it can propagate into a safety
critical subsystem. The issue of error containment thus plays a crucial role. If it is
not possible to demonstrate that the error-containment coverage is very close to one,
i.e., that the consequences of a particular error in a non safety-critical function do not
have an adverse effect on a safety-critical system function, then, the error in the non
safety-critical function must be classified as catastrophic. It is of utmost importance
to present an architecture that rules out by design any unintended interactions among
subsystems of differing criticality. This reverts to the issue of composability.
requirementsa very expensive endeavor if one considers that the design and validation
of safety-critical software is 2-10 times more expensive than the design and validation
of non safety-critical software.
POINTS TO REMEMBER
It is an established architectural principle that the function of an object should
determine its physical form. Distributed computer architectures enable the
application of this proven architectural principle to the design of computing
systems.
The interface between the communication controller within a node and the host
computer of the node, the communication-network interface (CNI), is the most
important interface within a distributed architecture.
An event message combines event semantics with external control and requires
one-to-one synchronization between the sender and the receiver.
A state message combines state-value semantics with autonomous control.
State-message semantics corresponds naturally to the requirements of control
applications.
The purpose of a gateway is to implement the relative views of two interacting
clusters. In most cases, only a small subset of the information in one cluster is
relevant to the other cluster.
In a time-triggered architecture, a gateway consists of a strictly data-sharing
interface. There are no control signals passing through a gateway component.