Sunteți pe pagina 1din 18

Cellular

MultiProcessing and
Uniform Memory
Access
A White Paper
unisys
Contents
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
What is SMP? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
What is ccNUMA? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
How do these relate to clustering? . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
What is CMP? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Who should care about SMP performance and capacity? . . . . . . . . . 6
What are the factors affecting application growth? . . . . . . . . . . . . . . 6
What are the problems to be solved by SMP design? . . . . . . . . . . . . . 7
What options are available for building an SMP platform? . . . . . . . 8
What does this mean to applications? . . . . . . . . . . . . . . . . . . . . . . . . 14
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2
Introduction
With the Unisys e-@ction Enterprise Server ES7000 and its Cellular MultiProcessing (CMP) clustered
architecture, Unisys has created a server product line that is ideally suited to the most demanding of
environmentsmission-critical transaction processing. The ES7000's foundation is a crossbar technology
based on extensive and successful experience with mainframe systems in such environments. This
technology has prevailed even in the extremely demanding environments of major banks, airline reservation
systems, and stock exchanges. Unisys is committed to using this same crossbar architecture for its future
ranges of both Unisys e-@ction Enterprise Server ES7000 and ClearPath mainframe systems.
This white paper examines the ES7000 server's crossbar symmetric multiprocessing (SMP) technology
from the customer's viewpoint, comparing it to alternative multiprocessing technologies such as ccNUMA
(cache coherent Non-Uniform Memory Access.)
The meaning of terms, such as SMP, CMP, and clustering, is often blurred by press announcements and
by the trade press, so this white paper begins with a few questions and definitions.
What is SMP?
The industry has been using multiprocessing mainframe systems for over thirty years. These systems consist
of a set of shared memory storage units and a number of CPUs (central processing units or processors)
under the control of a single instance of an operating system (OS). So, what is the significance of the term
"symmetric" in SMP?
Until recently, the accepted definition was that the "symmetric" in SMP referred to the roles of the
processors within the operating system. In an SMP system, all CPUs can "see" all of memory and are
capable of executing any task the OS schedules for them. In an "asymmetric" system, one CPU is designated
the master. The master may be the only CPU that can execute the OS, with the others (the slaves) being
restricted to application execution. Such asymmetric systems were a stopgap measure until the OS could be
modified for multiprocessor execution, and they disappeared long ago from the mainframe scene because of
their capacity and resiliency limitations. Within the context of this original academic definition, all the
systems discussed by this paper are SMP systems.
Figure 1:
Original definition of
Symmetric
MultiProcessing
(SMP)
3
With the advent of NUMA (Non-Uniform Memory Access) technologies, hardware vendors found a need to
differentiate NUMA from existing server designs. As the access times to all the memories in the existing
servers were uniform or "symmetric", they were referred to both as UMA (Uniform Access to Memory) and
as SMPs. The latter name prevailed. Within the context of this new industry definition, the NUMA systems
discussed here are not SMPs.
What is ccNUMA?
The "cache coherent" or "cc" aspects of the name are dealt with later in the paper. In a NUMA system, all
CPUs can see all of memory, but the access of any particular CPU to local memory may be significantly
faster (e.g., 7-10 times) than its access to another CPU's local memory. From an OS point of view, NUMA
systems are "symmetric" in that all processors have equal responsibilities. But from a hardware performance
viewpoint, they are not SMP systems.
Figure 3:
Non-Uniform Memory
Access architecture
Figure 2:
Current industry
definition of
Symmetric
MultiProcessing
4
How do these relate to clustering?
Systems are clustered in order to improve capacity, availability, or both. Responsibility for a single application
or database is spread over multiple systems. Each system has its own memory and instance of the OS. It is
often referred to as a "node" in the cluster. The nodes are usually SMPs though that is not a requirement.
Clustered nodes cannot "see" each other's memory, but they communicate by message passing, usually across
some form of LAN. In the case of "failover" clustering, there is only one active instance of the application
and one node acts as a backup in case of a failure in the other. In capacity clustering, an instance of the
application runs in all nodes. Such systems require special locking protocols to resolve database update
conflicts. They also provide additional resilience, as the surviving node(s) will assume the workload of a
failed node.
The nodes in a cluster may be unit or multiprocessor systems, either SMP or NUMA.
Figure 4:
Clustered systems
5
What is CMP?
Cellular MultiProcessing is a technology pioneered by Unisys and available for the first time in the ES7000
enterprise server. The ES7000 can include up to 32 CPUs. These CPUs can be configured as a single, large
SMP system or they can be partitioned into up to 8 "cells." For example, in a system partitioned into 4
cells, each cell could be an 8-way (8 processor) SMP system, with its own CPUs, dedicated memory, and
instance of the OS. These cells can be configured as independent systems, or they can be interconnected
and clustered around a single application or database.
On many clustered systems, the performance of the message-passing medium is a limiting factor on the
cluster's capacity. In ES7000 CMP-based systems, the cluster can use "shared memory" as an alternative to a
LAN. This memory is additional to the private memory used by the OS instances and is dedicated to message
passing. It is shared by all the nodes in the cluster.
This unique technology enables the nodes to pass messages at memory speeds and eliminates the latency
inherent with LAN hardware connections. Future enhancements may also permit the nodes to directly share
performance-critical data.
Although this is exciting clustering technology, the cluster is only as good as the sum of its component SMP
nodes.
Figure 5:
Cellular
MultiProcessing
clustered systems
6
Who should care about SMP performance
and capacity?
Although Moore's Law has given the industry some amazing growth in CPU performance over the decades,
the appetite of applications for MIPS has grown even faster. SMP systems are ubiquitous in both mainframe
and server environments, and the number of CPUs per SMP system has been growing steadily. Anyone
deploying an application or database needs to be concerned about whether SMP technology can keep up
with application growth.
What are the factors affecting
application growth?
The major factors affecting growth are the application's design, the OS' capabilities, and the hardware
platform's ability to support both.
The applications of interest are those that lend themselves to simultaneous execution by multiple CPUs.
This may happen because the designers realized that an application would perform better if split into a
number of cooperating concurrent processes. It may also be the result of needing to map hundreds of
incoming transactions onto a number of serving processes. For the purposes of this paper, these are referred to
as "engineered" and "intrinsic" multiprocessing applications. Examples in the "engineered" category are
parallel scientific applications, and query and searching engines. The "intrinsic" category includes almost all
online transaction processing (OLTP) applications. These are very different applications and designs, but
they share some common needs. They need to share some static data, and they need to be able to synchronize
control data such as input and output queues. In addition, the transaction application needs to be able to
synchronize a high volume of updates to records, logs, audits, etc. In both cases, it is the application
designer's responsibility to minimize bottlenecks in the application. For example, synchronization (locking)
must be at a low enough granularity to avoid contention, and individual processes must have sufficient
capacity to avoid stalling their peers.
The OS executes on all the CPUs in the system and has many of the same needs for access to shared
data and synchronization of updates. Its designers have similar responsibilities to avoid bottlenecks. In
addition, the OS has the responsibility for mapping application processes across multiple CPUs in a manner
that maximizes performance while satisfying the priority policies of the application environment.
The major factor in the hardware platform's design is the limit in memory performance. Although the
memory performance of systems has improved over the years, it has not matched the adherence of CPUs to
Moore's Law. This has led to increasingly complex memory caching schemes where the CPU keeps a local
copy of the most recently used memory locations. The desire is to place those copies as close to the CPU's
arithmetic units as possible, yet to make the cache large enough that trips to main memory are infrequent.
As a result of these conflicting requirements, CPUs typically have two levels of caching: a very close first
level cache (FLC) with 10Ks of bytes, and a more independent second level cache (SLC) with 100Ks of bytes.
7
The FLC is a store-through cache, so updates get written out to the SLC. The SLC is a "store-in" cache,
which means that updates do not get written out to main memory until the cache needs to replace them
with more recent data. Access to cache memory takes nanoseconds whereas access to main memory takes
around 50 nanoseconds, so the matching of the cache to the current process's memory references (the cache
"hit rate") is critical to the performance of the CPU. Managing the interactions of all the CPU caches in the
system is also one of the major design challenges for building a successful multiprocessor system.
What are the problems to be solved
by SMP design?
CPU affinity
In any system, the OS is constantly responding to events and juggling processes, priorities, and resources to
get maximum performance within the constraints of the customer's priority and fairness needs. Following
an event such as a CPU interrupt, the OS will select the most eligible process for execution on that CPU. In
an SMP system, any process may run on any CPU, but the effect of restarting a previously suspended process
on a different CPU is to divorce it from its cached data. The new CPU will have to run at main memory
speeds until the process has made enough references to reload the data into this CPU's cache.
Responsibility for alleviating this problem falls to the OS. The OS must keep track of which CPU was
last used for the process and try to maintain that relationship. Balancing this with the desire to not let CPUs
go idle is a complex task. It is called "CPU affinity scheduling."
Figure 6:
CPU cache
components
8
Cache coherence
As mentioned earlier, updates do not get written out to main memory until there is a need to replace them
with more recently referenced data. Clearly, this is a problem if two processes share a piece of data and the
last update is visible only to the one that updated it.
It is possible to write applications that are aware of this problem and are willing to explicitly flush the
cache out to main memory when necessary, but these are very specialized, e.g., solutions to differential
equations. Consequently, the platform hardware is expected to solve this problem and hide the mechanics
from the executing applications and much of the OS.
The mechanism used is called "cache coherence." It is a requirement for all general-purpose SMP
systems. It is the "cc" in ccNUMA. (There are some other specialized varieties of NUMA that are not cache
coherent.)
For performance reasons, it is desirable to allow a piece of data to be resident in all the caches that need
it. For data integrity reasons, only one cache must be able to update the data at a time, and the other caches
must find out that a piece of cached data has been updated and that any local copy must be invalidated.
Note that this is a more fundamental issue than the semaphores and locks used by software to ensure the
integrity of shared data. The software's locking instructions rely on the hardware to provide this
fundamental integrity. The mechanisms used to achieve both data integrity and performance are heavily
interdependent with the design of the hardware platform.
What options are available for building an
SMP platform?
This paper is not intended to be an exhaustive study, and instead compares the three most common designs,
those based on a bus, a Scalable Coherent Interconnect (SCI), and a crossbar.
Bus
The simplest and most common method of building an SMP platform is to attach the CPUs and memory to
a bus. The good news about buses is that the CPUs can easily talk to memory and to each other. The bad
news is that only one message can travel the bus at a time, so it is a potential bottleneck.
9
The bus used for Intel servers is referred to as a "snoopy" bus, as the CPUs can "snoop" on each other and
see each other's memory requests. This is how cache coherency is handled. If a CPU sees that another CPU
is requesting a memory address that it owns (has the latest copy of the data), the CPU will respond to the
request and deliver the data from its cache. A CPU will request ownership when it wishes to write to an
address, and will keep it until it "ages" the written data out to the memory or until another CPU requests
ownership.
If a process is switched from one CPU to another, the second CPU will suffer from cache misses until it
has loaded the process's data from the original CPU. The CPU affinity software alleviates this situation by
attempting to keep the process on the same CPU.
This approach has been a very successful configuration for 4-way (four CPU) servers, with many vendors
taking advantage of Intel's "quad board." It does, however, have limitations that keep it out of the high
volume transaction processing environment.
The first of these limitations is its lack of scalability. The 4-way configuration appears to be optimal.
Adding more CPUs increases bus contention, and increasing the length of the bus to accommodate the extra
CPUs also slows the bus down. Although some good 6-, 8-, and 12-way systems have been built around high
performance "heroic" buses, these buses have been rapidly made obsolete by improvements in CPU
performance.
The other limitation is poor availability. A failure in one of the components can make the bus unusable
by the others. It is hard to determine which component has failed, particularly if the OS cannot run, and
the whole system will be unavailable during repair and test. In contrast, both ccNUMA and crossbar
designs have components ("quads" and "subpods") that can be isolated, diagnosed, and in some cases
repaired, while the rest of the system still continues to support the OS and the application.
Figure 7:
Quad board
architecture
10
SCI and CCNUMA
The price/performance of the Intel quad board is so attractive, there is natural desire to string
them together in some fashion and build an inexpensive, high capacity system. Intel has done
just this with the Profusion chip set that couples two buses for an 8-way server. But this
technology does not extend beyond an 8-way to the building a 16- or 32-way system.
The Scalable Coherent Interconnect technology addresses this need. Multiple quad boards
are attached to the SCI and configured as a large, multiprocessor system.
Figure 8:
ccNUMA 12-way
system
11
Though this may look like a LAN connection, it is not. The SCI is a bit-serial ring that runs at one gigabyte
per second with very low latency. The system looks like a cluster, but it is not. There is only one instance of
the OS, and the OS "sees" a single, contiguous, large memory spread over all the quad boards. From an OS
viewpoint, this is an SMP system as all the CPUs have equal responsibility. But clearly the CPUs' access to
the individual memories and to each other's cached data is not symmetric. A CPU takes 7 to 10 times longer
to access data in the remote memories than it does to access local data. This is why these systems are
referred to as NUMA, Non-Uniform Memory Access. This NUMA condition can be partially alleviated by
additional hardware and software.
CPU affinity scheduling becomes more complex with NUMA configurations. The desire to keep a
process on one CPU is unchanged, but the overhead of reloading a process's current data from a remote
processor is much higher than from a local one. Consequently, the affinity-scheduling algorithm must be
enhanced to prefer rescheduling a process on a local processor if the original is not available.
The problem of memory access is much harder to overcome. The memory access problem becomes less
critical by adding a third level of caching to each quad board. This third level cache (TLC) is a store-through
cache for the rest of memory. If applications can be confined largely to one quad board or have a low
incidence of writes (for example a Web server with static pages) this cache works quite well. However,
maintaining cache coherency on writes is much more complicated than with the snoopy bus.
The CPU must get ownership from a local or remote cache or CPU or memory, and a write cannot
complete until all copies of the old contents have been invalidated. As the SCI is a "daisy chain," requests
and responses are passed serially from quad to quad.
It is this aspect of NUMA performance that prompted Unisys to design a memory architecture that
better supports its primary market: high-volume transaction processing.
Crossbar
Unisys was an early investor in SCI technology and considered using ccNUMA for its premier line of
Microsoft Windows NT and UnixWare servers, but it was apparent that it would not be sufficient for the
target market: mission-critical, high-volume transaction processing. Consequently, Unisys looked to its
mainframe systems for a solution. These systems have direct connections between CPUs and memories and
exhibit UMA, uniform memory access. They do this by designing for the maximum system from the
ground up. In this case, it meant designing the infrastructure for a 32-way SMP system.
First, Unisys abandoned the quad board componentry. Although it was very attractive on a cost basis,
the additional traffic generated in order to maintain cache coherency with the other quad boards' CPUs
makes the quad board's single bus a potential bottleneck. Instead, a Unisys "quad" (known as a "subpod")
has two buses and only two processors per bus. The four CPUs share an additional "third level" cache
(TLC). The TLCs are much larger than the CPUs' caches (16 megabytes initially, and 32 megabytes for the
64-bit CPUs), and they are about five times faster than main memory. A system can have 8 TLCs for a total
of 128 megabytes of cache.
12
The third level caches (TLCs) are attached to all the system's memories by a non-blocking crossbar. In
diagrams, crossbars often look like buses, but they do not suffer from the same capacity and availability
limitations. The TLCs have their own discrete path to each of the memories. The diagram below has been
expanded to show the discrete connections.
Figure 9:
Crossbar architecture
13
The memories themselves have a typical mainframe design, able to handle hundreds of requests concurrently.
Even higher performance can be achieved by interleaving requests between the memories. When interleaving
is enabled, data can be spread across all four memories (e.g., bytes 0-63 from memory 1, bytes 64-127 from
memory2, etc.), and the CPU can fetch from all four in parallel. This has the effect of spreading software
"hotspots" across all available memory for maximum performance.
Maintaining cache coherency is still a complex issue, but these systems have the advantage of being able
to handle invalidation requests in parallel via the crossbar, rather than sequentially via an SCI.
Figure 10:
Full 32-way SMP
system
14
What does this mean to applications?
In an ideal world, applications would simply scale up to take advantage of a platform's additional MIPS,
without regard for the number of CPUs or the organization of the memories. After all, the "tuning" of
applications for a particular hardware architecture can be difficult and expensive. The only tuning
enthusiasts are vendors competing for benchmark supremacy.
Earlier in this paper, SMP applications were divided into two broad categories, "engineered" and
"intrinsic." Engineered applications have tasks that can be spread across a number of CPUs and so executed
in parallel. As the spread is artificial, it may be possible to arrange the data in such a way that most of a
CPU's references are to ccNUMA's local memory. However, this is significantly more complex than mapping
tasks onto threads and eligible CPUs. With ccNUMA systems, it now matters which CPU is used for a
thread. There are also some difficult sizing questions, such as what to do if the workload grows beyond an
integral number of quad boards, or how to handle configuration changes. The mapping has to be
reengineered if part of the ccNUMA system is offline for maintenance or if the system is upgraded with
more or more powerful CPUs and more memory. Essentially, tuning a ccNUMA system is a multi-
dimensional problem, whereas mapping an engineered application onto a crossbar-based system like the
ES7000 is a single-dimensional problem.
Intrinsic applications, in particular transaction systems, present a further layer of complexity. With any
SMP application, the challenge is to map incoming transactions from hundreds or thousands of end users
onto the available CPUs. Very often, data access patterns will limit the number of CPUs that can practically
be used. The protection of data, the use of database locks, and single thread algorithms may impose a
natural limit on the concurrency of the application, the database manager, or the OS. Beyond this limit, the
addition of CPUs will be ineffective and may result in a drop in performance because of contention costs
(queuing).
High-volume transaction applications have two characteristics that may give problems on ccNUMA
systems. The first is the sharing of common data. Data that is frequently accessed and updated by the
transactions have no obvious home, and will be remote most of the time. The second is the temporal and
seasonal variation in workloads. A system that is tuned for the average day may perform badly at the
morning opening or in the holiday rush. The application can be retuned for differing loads, but changing
the tuning dynamically (and automatically) is a pipe dream.
Can the operating system help out in this situation? Most operating systems have grown up in a UMA
environment, or at least one in which there is little variance in memory access times. Though it may be
possible to plug ccNUMA-specific algorithms into an "open" OS like Microsoft Windows NT, the majority
of the OS has been designed with the assumption of a UMA environment.
The one advantage that ccNUMA systems have over crossbar systems is their additional flexibility in
adding CPU power. A crossbar-based system is designed for a specific maximum configuration, 32 CPUs in
the case of the ES7000, whereas ccNUMA systems can be expanded by adding more quads to the SCI. The
argument is that the ccNUMA inefficiencies can be offset by using hundreds of relatively cheap components.
The problem with this argument is there are few applications and even fewer operating systems that can take
advantage of that much parallelism. Note that Windows 2000 is the first Microsoft server operating system
to support configurations of 16 and 32 CPUs.
15
Summary
ccNUMA Crossbar
Third level cache For remote memories For all memories
Number of CPUs Up to 256 Fixed upper limit e.g., 32
Can be isolated on failure Yes Yes
Cache coherency Daisy-chained Parallel
Interleave between memories No Yes
CPU affinity Complex Simple
Application tuning Multi-dimensional Single dimension
Specifications are subject to change without notice.
Unisys is a registered trademark and e-@ction is a trademark of Unisys Corporation. Intel is a registered trademark
of Intel Corporation. Microsoft is a registered trademark and Windows NT and Windows 2000 are trademarks of
Microsoft Corporation. UnixWare is a registered trademark of Santa Cruz Operation, Inc. All other trade names are
the exclusive property of their respective owners.
1999 Unisys Corporation
All rights reserved.
Printed in U S America 12/99
unisys

S-ar putea să vă placă și