Bullion-Efficient Server Architecture For Virtualization

An efficient server architecture
for the virtualization of businesscritical applications

This white paper presents the different server architectures available on
the market to virtualize business-critical applications.
A particular focus is provided on the architecture designed by Bull
and implemented in its bullion server to virtualize business-critical
applications on a massive scale.
www.bull.com
Contents
Scale-out and scale-up architectures................................................................................................... 4

The two main scale-up server architectures.......................................................................................... 5
BCS: Bulls implementation of a glued architecture................................................................................. 9
bullion server based on BCS............................................................................................................ 15
VMware vSphere 5 on bullion........................................................................................................... 20
Conclusion.................................................................................................................................... 23
Introduction
y the end of 2012, over 50% of applications running on x86 platforms will be virtualized.
This figure illustrates the massive interest that businesses are showing in a technology which
guarantees more flexible and lightweight infrastructure, and is indispensable in a smart
switch-over to Cloud computing.
However, if you look at the number of applications there are a figure that is set to expand four
or five times by 2015 according to Gartner and the fact that currently only 20% of mission-critical
applications have so far been virtualized, there is a long way to go.
Indeed, it is gradually becoming clear that the most commonly used virtualization technologies,
which depend on large numbers of blade servers, only partially meet the demand for virtualization
on a massive scale, as well as the high availability needs of critical applications and power-hungry
applications such as ERP implementations and large databases.
However the deployment of multiple physical servers for virtualization purposes may increase
the instability and the complexity of virtualized infrastructures, thus creating new problems for IT
Departments. Consequently the choice of the target architecture for virtualization is essential for the
enterprise.
Yet there are several types of server architectures dedicated to virtualization, each one has its own
advantages and drawbacks.
In this white paper, a particular focus is provided on the architecture designed by Bull and
implemented in its bullion server to virtualize business-critical applications on a massive scale.
-3-
Scale-out
and scale-up
La rvolution
Openarchitectures
Source
Moving to the second wave of virtualization, on the road to private cloud computing, providing the agility of virtualization to
business-critical applications are placing todays enterprise virtualization clusters under enormous stress. Challenges include:
Inadequate scaling of compute capabilities: scalable solutions are required to effectively handle increasingly complex
and demanding workloads and support exponential data growth over the life of business-critical applications
Insufficient reliability: resilient solutions are required in particular to handle high-density virtualization. The ability to
ensure availability can have a clear economic benefit. Downtime can result in revenue loss, damage to company
reputation, and lower employee productivity.
Increased management complexity and operating cost: underutilized IT resources consume too much space, and cost
too much to power, cool, maintain, administer, and service. Efficient solutions exist to enable you to free up and redirect
operational dollars into business innovations that will strengthen competitiveness.
Methods of adding more resources for a particular application fall into two broad categories: scale-out (scale horizontally)
and scale-up architectures (scale vertically).
-4-
Scale-out
Scale-up
To scale horizontally (or scale out) means to

add more nodes to the infrastructure, such
as adding a new computer to a distributed
software application. An example might be
scaling out from one Web server system to
three. As computer prices drop and performance
continues to increase, low cost "commodity"
systems can be used for nearly any type of
requirement. Hundreds of small computers may
be configured in a cluster to obtain aggregate
computing power required to serve its users.
To scale vertically (or scale up) means to add

resources to a single node in an infrastructure,
typically involving the addition of CPUs
or memory to a single computer. Scale-up
architectures provide a larger reservoir of
resources (IO, memory and processing power).
With this architecture even the largest database
can be run efficiently in production. In addition,
with virtualization technologies this server
can be sliced into pieces and support a
multitude of applications on the same physical
architecture, which results in less components
(networking, storage and servers) inside the IT
infrastructure.
The two main scale-up server architectures
To handle increasingly demanding workloads, sockets are gradually added in a seamless way within a single server.
The sockets are connected together, as well as to the memory and the I/O boards. Applications are thus able to benefit from
more and more compute power, memory, I/Os and networking capabilities. However no bottleneck should slow down the
applications, the obtained performance must be exactly in line with the number of added resources. Otherwise your scale-up
architecture is useless.
There are two broad scale-up server architectures: the glueless architecture and the glued architecture.
The glueless architecture

The glueless architecture is an architecture
designed by Intel and implemented in the Intel
Xeon series E7. When building servers above
4-sockets, the sockets are directly connected
together through QPI links. The same links are
used to access the memory, I/Os and networks.
A glueless socket uses one of these links to
connect the socket to I/O and the remaining
three QPI links to interconnect the processor
sockets. This forces a topology in which each
processor socket connects directly to three other
sockets, but its connections to the other four
processor sockets are indirect.
4-socket glueless architecture
Advantages of a glueless architecture

One of the major advantages of this architecture
for server manufacturers is that there is no need
to develop specific products and then no need
to have additional skills nor expertise. If a
manufacturer can build an Intel 2-socket based
server it can build a 4- to 8-socket server.
In consequence, another advantage is the cost
of the server itself (not to confuse with the total
cost of ownership of the server that is probably
not in favor of the glueless servers).
8-socket glueless architecture
-5-
Drawbacks of a glueless architecture

Glueless architectures are limited in scalability
as caches need to stay coherent and to do
so, these architectures lose up to 65% of their
bandwidth. With this architecture servers can
only scale to 8-sockets.
In consequence, each time the number of
processor sockets is increased, the obtained
performance is not in line with the number
of added resources. As a result the price/
performance ratio drastically decreases with the
number of added resources. If glueless server
architectures enable to take (limited) benefit of
supporting larger number of VMs, or serving the
virtualization of large databases, the efficiency
is not optimal. In a certain way, you dont get
good value for money.
In addition these architectures are not optimized
to offer the Quality of Service provided by
enhanced RAS features required to serve
efficiently business-critical applications.
Connecting multiple sockets
Each processor socket has four QPI links.
A glueless 8-socket system implementation uses
one of these links to connect the socket to I/O
and the remaining three QPI links to interconnect
the processor sockets. This forces a topology in
which each processor socket connects directly
to three other sockets, but its connections to
the other four processor sockets are indirect
requiring multiple hops for any request to those
processors.
-6-
Cache coherency
To achieve cache coherency, a read request
must be reflected to all other processor caches
as a snoop. It can be compared to doing a
broadcast on an IP network.
Each processor must check for the requested
memory line and provide the data if it has
the most up-to-date version. If the read was
for exclusive access, then all other caches
must also invalidate their copies. In case the
modified line is available in another cache this
source snooping protocol provides the minimum
latency when the line is copied from one cache
to the next. However this solution has limited
scalability for many workloads, especially
when virtualizing Java Applications, running
large databases (Big-Data) or latency sensitive
applications.
In a source snooping coherency protocol all
reads result in snoops to all other caches.
This consumes link and cache bandwidth as
these snoop packets use cache cycles and link
resources that could otherwise be used for
data transfers. These source snoops can also
impact memory latency when snoops or snoop
responses require multiple hops. A source
snoopy memory controller cannot return the
memory data until it has collected all the snoop
responses and is sure that no cache provided a
more recent copy of the memory line. Accessing
local memory is sufficiently fast that the multi-hop
snoops and snoop responses delay the delivery
of data from the memory read. In 8-socket
glueless systems, snoops can consume up to
65% of the bandwidth.
The "glued" architecture based on node controllers

The glued architecture uses node controllers
to interconnect CPU sockets.
Communication and coordination

optimization
When a system scales to a large number of
interconnected processors, the communication
and coordination between the processors
grows at an exponential rate, creating a system
bottleneck. To optimize this communication
and coordination, hardware manufacturers
have developed a glue to the architecture.
This glue is the node-controller which helps to
optimize the communication and coordination.
Depending on the quality of the node-controller,
these manufacturers are able to provide server
architectures which scale actually in pace with
the resources added to the system, as bandwidth
and CPU cycles are used for their purpose,
rather than controlling the cache coherency like
in glueless architectures.
Performance benefits
In a ccNUMA system, the hardware ensures
cache coherency by tracking where the most
up-to-date data is for every cache line held
in a processor cache. Latencies between
processor and memory in a ccNUMA system
vary depending on the location of these two
components in relation to each other and the
quality of the node-controller.
When scaling to eight processors, the glued,
node controller implementation provides
performance benefits beyond those offered by
a glueless implementation.
Reduction of average memory latency

and bandwidth consumption
The node-controller for scale-up x86 system
using the Intel Xeon E7-4800 series with
an embedded memory controller implies a
Cache Coherent Non-Uniform Memory Access
(ccNUMA) system. The Intel Xeon E7-4800
series uses the source snooping variant of
the Intel QuickPath Interconnect (QPI). The
design goal for node-controller architecture
is to reduce average memory latency and
minimize bandwidth consumption resulting from
coherency snoop. On the contrary, glueless
architectures are limited in scalability as caches
need to stay coherent and to do so, these
architectures lose up to 65% of their bandwidth.
Glued architecture
-7-
Glueless architecture versus Bulls node controller architecture

Glueless architecture
Bulls glued architecture

based on a node controller
16
4, 8
2, 3, 4, 6, 7, 8, 10, 11,
12, 14, 15, 16
80
160
Number of processor hops to

address memory
% of bandwidth required for

cache coherency
65%
5-10%
Maximum memory support

(with 16GB RAM DIMM's)
2TB
4TB
Max specint-rate performance

in 4-socket configuration
1100
1100

< 2000
2200

NA
4100
Maximum number of
processors supported
Processor socket configurations
supported
Number of processor cores
-8-
BCS: Bulls implementation of a glued architecture
Bull Coherence Switch Architecture is Bulls implementation of the glued node controller architecture.
The BCS architecture is the design foundation for servers that need to deliver more scalability, resiliency, and efficiency to
meet the requirements of the most demanding applications in high performance computing as well as in business computing.
In business computing, the BCS technology is the foundation of bullion servers dedicated to virtualization and critical
applications (see chapter 5 for more information on bullion servers). In high performance computing, the BCS technology is
the foundation of bullx Supernodes series designed to run HPC applications that require huge volumes of shared resources,
in particular shared memory.
BCS Architecture
The BCS enables two key functionalities: CPU
caching and the resilient node controller fabric.
These features serve to reduce communication
and coordination overhead and provide
availability features consistent with Intel Xeon
E7-4800 series processor.
The BCS meets the most demanding
requirements in terms of performance, RAS
features and ease of use. Servers based on the
BCS Architecture scale to sixteen processors
supporting up to 160 processor cores and up
to 320 logical processors with enabled hyperthreading. The server processing capacity is
balanced with 256x DDR3 DIMM slots for
a maximum of 4TB of memory using 16GB
DIMMs, and up to 24 I/O slots.
The node controller developed by Bull plays

a critical role in enabling the bullion system to
scale from 4 to 16 processors with availability
features consistent with Intel QPI fabric, and
appropriate for a scale-up 16x socket system.
BCS architecture with 16 sockets

-9-
BCS architecture
As shown in the above figure, the bullion

architecture can scale up to sixteen processor
sockets; each socket accepts a processor with
up to ten cores. Each bullion module contains
four processor sockets. Each module groups the
processor sockets into a single QPI island of
four directly connected CPU sockets.
This direct connection provides the lowest
latencies.
Each node controller stores information about
all data located in the processor caches. This
functionality is called CPU caching.
-10-
BCS key technical characteristics

ASIC chip of 18x18mm with 9 metal layers
321million transistors
1837 ball connectors
6QPI and 3x2 XQPI links
Power conscious design with selective
power-down capabilities
Aggregated data transfer rate of 230GB/s
Up to 300Gb/s bandwidth
The international SPECint_rate2006 benchmark

which was run on bullion on a 16-CPU
Intel E7-4870 configuration highlighted the
exceptional features of bullion which is almost
twice as powerful as its fastest competitors.
bullions BCS technology demonstrates its
superior power compared with the glueless
technology adopted by most competitors. It
enables bullion to deliver 120% more power!
In addition the performance achieved is 30%
higher than Oracle Sparc servers in the same
category.
Source: Specint_rate 2006 benchmark July 2012
Performance testing using industry standard

benchmarks confirms the advantages of the
Bull BCS Architecture. By reducing processor
overhead, the bullion server delivers better

system performance for a diverse set of
workloads than competitive 8-socketand
largersystems.
The benchmark - which was run on a 16socket configuration - highlights bullion's
exceptional features. The bullion server clearly
demonstrated its superior power compared with
its competitors, also offering greater scalability
and lower energy consumption.
The result posted by the bullion server is tangible
proof of the extreme efficiency of its architecture
and Bull Coherence Switch (BCS) technology;
reassuring users of bullion servers that they are
getting the full benefit of the power of the Intel
Xeon processors.
Source: Specint_rate 2006 benchmark July 2012
Uncontested performance
The 6 fastest X86 business servers

Supplier
System
bullion vs risc servers

Performance
Architecture
Bull
bullion
4110
X86
Hewlett-Packard
ProLiant DL980 G7
2180
X86
Fujitsu
PRIMEQUEST 1800E2
1890
X86
Oracle
Sun Fire X4800
1380
X86
IBM
System x3850 X5
1250
X86
Cisco
UCS C460 M2
1160
X86
Fujitsu Oracle
Sparc Enterprise
3150
RISC
Hewlett-Packard
HP Integrity Superdome
1650
EPIC
-11-
Enhanced system performance

with CPU caching
CPU caching provides significant benefits for
system performance:
Minimized inter-processor coherency
communication and reduced latency to
local memoryProcessors in each 4- socket
module have access to the smart CPU cache
state stored in the node controller, thus
eliminating the overhead of requesting and
receiving updates from all other processors.
Dynamic routing of trafficWhen an
inter-node-controller link is overused, Bulls
dynamic routing avoids performance
bottlenecks by routing traffic through the
least-used path. In this way, the system
uses all available lanes and maintains full
bandwidth.
With the Bull BCS architecture, through CPU
caching and coherency snoops responses
consume only 5 to 10% of QPI bandwidth and
that of the Switch fabric. The Bull implementation
provides latency for local memory access
comparable to traditional 4-socket system and
44% lower latency when compared to an
8-socket glueless system. The bullion architecture
diagram shows the 4-socket module source
snooping QPI islands with BCS.
of memory lines, the BCS targets any remote

access to the specific location of the requested
memory line. With the BCS Architecture, bullion
effectively uses the four QPI links to connect the
4- processor sockets inside one module, like any
traditional 4-socket server and provides 2x more
eXtended QPI links, inter-connecting the different
4-socket modules; a glueless 8- socket system
has four QPI links.
In addition, CPU caching uses the links more
efficiently because it reduces the overhead
of cache coherency snoops. Because of the
reduction in local memory latency compared
to glueless 8-processor systems, virtual
environments will have higher performance
on bullion. With the NUMA-aware VMware
vSphere5, system performance will scale nearly
linearly.
By recording when a cache in a remote
4- socket module has a copy of a memory line,
the node controller can respond on behalf of all
remote caches to each source snoop.
This not only removes snoop traffic from
consuming bandwidth on the links and remote
socket caches, it also reduces the memory
latency in the cases where the data is not held in
any other cache.
Via the eXtended QPI (X-QPI) network the

4-socket modules communicate with the other
3x modules as part of a 16-socket system.
Within a 4-socket source snooping module
none of the snoops ever have a QPI link hop
to do between the requesting core, the CPU
cache in the BCS, and the memory controller.
Therefore all accesses to local memory have the
bandwidth and latency of a traditional 4-socket
system. Through its tagging of remote ownership
bullion measured performance vs the maximum

theoretical performance
-12-
Tackle the availability challenge

The BCS architecture enables a drastic
reduction of servers within the data center and
to greatly simplify maintenance and operation
tasks. This type of architecture requires highly
reliable machines i.e. redundant machines.
bullion perfectly meets these requirements.
In addition bullions reliability is also based
on native RAS (Reliability, Availability and
Serviceability) components and technologies.
Such technologies are not usually available on
standard x86 servers, and they significantly
boost server availability rates.
Reliability
Resilient system fabric
The Bull BCS Architecture extends the advanced
reliability of the Intel Xeon processor E7-4800
series in bullion with a resilient eXtended-QPI
fabric. This interconnect fabric provides higher
interconnect bandwidth to improve performance
and scaling, and availability features consistent
with the QPI fabric. The BCS X-QPI fabric
enables:
No more hops, to reach the information
inside any of the other processor caches.
Redundant data paths The X-QPI fabrics
provision of 100% more interconnect links
(eight here versus four in most competitive
8-socket systems without a node controller)
improves system performance by providing
more bisection bandwidth and dynamically
balancing the traffic on the links.
The fabric redundancy also helps reduce
unscheduled downtime. A failure of a X-QPI
link, is automatically resolved by using the
redundant link. Should in the most extreme
case a complete module fail the automatic
systems reboot. The system will re-initialize

and route around the failed link allowing a
scheduled service event to be delayed until a
convenient time (versus requiring immediate
service to get the server backup).
Rapid recovery improved error logging
and diagnostics information means that
administrators can easily take corrective
actions. If a fatal error occurs, the bullion
server captures the error log on the re-boot
to assist in diagnosis. This log can optionally
be automatically transferred to iCare, for
faster diagnostics or auto-surveillance. With
the system running, the administrator can
then use the log information to diagnose the
error and rapidly determine which repair
assemblies are needed.
High-speed link resiliency

The BCS Architecture of bullion uses multiple
high-speed links:
Intel QuickPath Interconnect (QPI)
Connects processor to processor, processor
to I/O hub, and processor to the BCS
Intel Scalable Memory Interconnect (SMI)
Connects processor to DDR3 the Intel 7510
memory buffer (Millbrook-2)
The eXtended QPI fabric used by the BCS to
interconnect the different BCS modules with
each other.
-13-
Bull designed the BCS to have RAS features

consistent with Intels QPI RAS features.
The point-to-point links (QPI, SMI and BCS
fabric) that connect the chips in the bullion
system have many reliability, availability,
and serviceability (RAS) features in common.
RAS features, including Cyclic Redundancy
Checksum (CRC), link level retry, link width
reduction (LWR), and link retrain, work together
to keep the system operational in the presence
of link errors. These features demonstrate the
level of RAS sophistication in bullion.
Normally, transaction requests such as memory
reads and responses such as data returned
from reads are sent to their destinations through
these narrow (typically 10 to 20 bits wide), but
high-speed (4.8 to 6.4GHz) links. To ensure
detection of any error on the link, a CRC
(Cyclic Redundancy Check) is used to detect
link errors that could affect many bits in each
transaction request and response.
Once CRC detects that a transmission error has
occurred,
the transaction is retried on the link
(link level retry).
If the link continues to experience errors
detected by CRC, then the link is reset and
retrained (link retrain), and the transaction is
retried again.
If the link reset and retrain are unsuccessful
in making the link operational, the system
analyzes the failure and, if possible, invokes
a spare wire (for either data or clock), and
retries the transaction once again.
-14-
If using the spare wire does not make the

link functional, then the system attempts to
find a part of the link that is still functional
first trying the upper and lower halves of the
link, then trying to find a functional quarter
of a link if one half does not workand
again retries the transaction (Link Width
Reduction).
All of the above link resiliency features apply to
both Intel links and the X-QPI fabric of the BCS,
and occur transparently with no system crashes
or hypervisor involvement.
If reset and retraining are successful, the error
logs record a warning to alert the system
administrator that the system is running with less
than full protection (for example, no additional
spare wires are available). The administrator
can then schedule a service call for a convenient
future time.
Similarly, if the link can be made functional
through link width reduction, the system
continues to run, but a log indicates that the
system is running with less than full protection,
and there may be some performance impact
as the result of the link width reduction. Again,
this allows the system to keep running without
interruption, and the service call can be
scheduled for a convenient time.
In all of these instances, the system remains
operational without requiring any special user
intervention, and more significantly without
crashing.
bullion server based on BCS
The BCS Architecture implemented in bullion

servers leverages Bulls years of experience
in designing mission-critical servers in RISC,
CISC and EPIC environments for most business
critical mainframe and UNIX environments.
bullion design makes the most of this expertise.
As a result, bullion provides significant benefits
beyond basic 4-socket system scaling. BCS
Architecture uses a directory cache to reduce
memory latency and provide more efficient
performance scaling.
The BCS Architecture also employs a high-speed
fabric to interconnect each of the maximum 4x
modules, each containing 4x Intel Xeon E74800 series processors to a single system, with
a single memory image and hypervisor. The
BCS interconnect architecture provides resiliency
features used in mission-critical servers. bullion
provides capabilities more commonly associated

with big iron systems while offering the costefficiency associated with industry-standard x86
servers.
Through a modular approach bullion servers
scale with your needs, with a pricing proportional
to the performance.
In order to prevent the inadequate scaling of
compute capabilities, insufficient reliability
and increased management complexity and
operating cost, Bull resolved the key issue at the
source by breaking down the performance limits
of a 4-socket or even an 8-socket server, scaling
beyond these limits from 4-, to 8, all the way to
16-sockets.
-15-
Native RAS features to meet QoS requirements

On top of the RAS features provided by the BCS architecture bullion benefits from the following native RAS features
to meet QoS requirements.
Reliability
Memory management
One of the unique reliability features of the
bullion server is its RAM memory management.
Memory protection mechanisms guarantee up
to 100% memory reliability on bullion. Over
and above traditional memory correction
mechanisms, such as ECC memory, which
maintains a memory system effectively free
from single-bit errors, bullion provides much
more sophisticated mechanisms such as DDDC
(Double Device Data Correction), which correct
dual errors.
The commonly available DIMM sparing is
now being enhanced to provide rank sparing.
With rank sparing of dual rank DIMMs, only
12.5% of the memory capacity is being used to
enhance the Quality of Service. If, for example,
bullion servers are equipped with 32GB dual
rank DIMM memory kits, each kit consisting of
two DIMMs with a capacity of 16GB, a 32GB
dual rank memory kit thus provides 28GB of
useable space, while the rest is being used for
fail-over, if the level of ECC errors becomes too
high.
Another example of a mechanism to improve
memory reliability is MCA recovery, which
ensures that memory errors detected are
forwarded to the VMware hypervisor, to make
sure that the hypervisor does not use this
deficient memory address space any more.
These two features limit the impact of memory
crashes just to the affected VMs, without having
to provide the memory DIMMs needed for
memory mirroring.
Finally, for 100% memory reliability, bullion
-16-
servers allow for the memory to be configured

in mirroring mode, with data being written
simultaneously in two different memory modules.
Thanks to the large number of memory DIMMs
available in bullion servers and the falling
cost of memory, mirroring is finally becoming
more accessible. So bullion can support 2TB
of memory for each VMware host, even when
memory mirroring is being applied.
Extra levels of redundancy

For customers who require reliability levels
previously only found in UNIX systems, besides
the unique RAM memory protection solutions,
bullion can also be equipped with :
Dual Path I/O HBAs

In traditional blade server architectures, when
applications require more memory, the physical
infrastructure is being extended with more
blades. Each blade has to be connected to
the networks, but the required throughput and
latency is reduced as more servers inside the
farm treat the IOs provided.
Contrary to bullion, in which the networks can
be optimized and limited to the number of IO
connections to serve the IO troughput requested
by the applications.
To improve the QoS and limit security concerns
bullion customers require a fast, low-latency
network connection, as with bullion the IO
connections used can be actually balanced with
the requirements.
The maximum bullion server configuration is
equipped with 26x CPI-e slots, divided over
8 groups of 3 slots managed by a single IO
controller (IOH). Each bullion module provides

the possibility to use up to 3 different HBAs
attached to an IO controller (IOH) and mirrored
inside the same bullion module attached to
the second IO controller. By using so called
teaming technologies, a single IO channel
is connected to two different HBAs and IO
controllers providing a near fault-tolerant IO
connectivity. By using Multi-path IO drivers
the load can be spread over the two different
HBAs, increasing bandwidth and reducing
latency.
High performance processors and large
memories may not be valuable unless a system
has enough I/O to feed all of the components.
I/O bottlenecks are one of the main issues
in large enterprise deployments, and limited
I/O capacity can constrain virtual machines
(VMs). bullion provides up to 24 PCI Express
(PCIe) expansion slots. This enables you, for
example, to provide dedicated I/O for VMs
when consolidating. Bull customer often chooses
to use 10Gb/s Ethernet networks, with separate
physical ports for production, management
(V-Motion) and potentially backup-restore.
To provide booting from SAN and fast dataaccess it is recommended to configure at least
4xFibre Channel SAN port connections.
Additional 1- or 10Gb/s Ethernet or 8Gb
Fibre Channel could be required to meet high
bandwth requirements for specific Virtual
Machines or groups of Virtual Machines.
EMC has shown 1.000.000 IOps throughput on
a single VMware host and 300.000 IOps per
Virtual Machine, providing sufficient throughput
to virtualize nearly any database. In addition,
BCS benefits to the I/O bandwidth when you
have a combination of I/O and processor
traffic. By reducing coherency snoops and
responses required to support both types of

memory access, more bandwidth is available for
transferring requested data.
Ultra-Efficient cooling
Each bullion module contains eight strategically
located hot-swap fans in N+N configuration,
combined with the efficient airflow paths defined
by the unique positioning of the memory DIMM
modules. It provides a highly efficient system
cooling. The fans are arranged to cool four
separate zones each with their own pair of
fans for optimum redundancy and superior
availability levels.
The fans automatically adjust speeds in
response to changing thermal requirements,
depending on the zone, redundancy and
internal temperatures. When the temperature
inside the server increases, the fans speed up to
maintain the proper ambient temperature. When
the temperature returns to a normal operating
level, the fans return to their default speed. With
this solution Bull enables to reduce significantly
the ambient noise, reduce the wear and tear
on the fans and reduce the server electricity
consumption.
Together with the optimized front-bezel
combining the unique form and shape of the
ventilation holes with an innovative design
underlining the reconciliation of energy
consumption and management efficiency.
RAID storage and hot-plug drives

Though Bull strongly recommends to boot its
bullion servers from the Storage Area Network,
any bullion server can be equipped with an
Industry Standard LSI RAID controller supporting
any type of standard RAID configurations
to increase availability of the attached 2,5
Seagate Savvio SAS disks (10- or 15krpm).
-17-
Availability
Active/Passive Power-supplies to reconcile
availability and energy efficiency
The bullion servers are equipped with two
1600W common slot power supplies, which are
80+ Platinum level certified. These two 1600W
power supplies provide in standard a full grid
N+N redundancy for a maximum availability.
To increase even further the efficiency Bull has
developed a patented solution based on an
active/passive power supply principle.
Active/passive power supplies provide the
highest efficiency rate possible, regardless the
requirement and still provide a maximum uptime
possible. In fact with Bulls unique active/

passive power supply solution, Bull provides
an embedded fault resiliency against the most
common electrical grid outages, the so called
micro-outages. Rather than having to rely on
heavy and expensive UPS systems bullion
servers are equipped with an ultra-capacitor
which provides the ability to switch from the
active to the passive power supply in case of
failure, as well as being protected against micro
outages. The ultra-capacitor provides a 300ms
autonomy, sufficient to switch-over or to avoid
application un-availability during micro-outages.
Serviceability
Maintainability and Availability
With bullion a new simplified path is taken to
ease the replacement of the most frequently
failing motorized components, such as the
ventilators, power-supplies and disk-drives.
Those three components are responsible for over
80% of hardware failures, but have no impact
whatsoever in the production on bullion servers.
In fact the Bull engineers have done an excellent
work to ease the maintainability of these
components. Thanks to these efforts replacing
these components are now part of the Customer
Replaceable Units (CRUs). This program
empowers you to repair your own machine and it is easier than you may think. In situations
where a computer failure can be attributed to
an easily replaceable part ( a CRU), Bull sends
you the new part. Without needing any special
tools or skills, you swap the old part for the new
one. It is simple. The major advantage: really
fast service for you and reduced support and
maintenance fees.
-18-
Elements replaceable by Support are the Field

Replaceable Units (FRU):
All FRUs contain Identification and Technical
Revision information in a local EEPROM.
Under the correct conditions, some FRUs
can be excluded from the system at boot
time. This could be for a variety of reasons,
ranging from hardware failure to energy
savings. PSUs, processors, cores, QPI links,
XQPI links, PCIe boards, embedded Ethernet
controllers are among the elements which
can be excluded.
Exclusion and hot plug minimize the down time
of a bullion server after a hardware failure.
RAS Monitoring and Service Processor

Functions
Each bullion module contains an embedded
Baseboard Management Controller (BMC) for
monitoring and administration functions. This
embedded controller runs the Server Hardware
Console (SHC). bullion servers offer the
following built-in functions:
SHC access to all of the module components
by standard out-of-band (non-functional)
paths the I2C and SMBus interfaces.
A dedicated network to interconnect all of
the SHCs of a server without affecting the
customer's network.
Dynamic communications between the SHC
and the BIOS.
Finally, various hardware mechanisms exist that
allow the SHC to perform monitoring functions
within each module of a server. Some of the
items monitored are:
Hardware errors
Power supply and distribution system
Fans
Temperature
Bull System Manager (BSM)

The Bull System Manager allows system
administration of heterogeneous data centers
from a single application. BSM supports the
entire Bull catalog of systems, including bullion.
It provides a rich set of data center monitoring
and remote administration features. It interacts
with both the bullion hardware (via the SHC)
and with the VMware hypervisor as well as the
Windows and Linux Virtual Machines.
iCare
bullion can also interface with the iCare
software package developed by Bull. The iCare
package facilitates the maintenance of the
bullion system by collecting error events and
log files transmitted by the bullion systems into
a central database. It provides a suite of tools
to aid in the analysis and diagnosis of system
events, and assistance in identifying possible
preventive maintenance actions. It can also
serve as an autocall concentrator, allowing
rules and actions for autocalls to be defined for
events.
The SHC provides this information to Bull System

Manager and vCenter or any other industry
standard System Management solution, with
support for IPMI, SNMP and other industry
standard interfaces.
-19-
VMware vSphere 5 on bullion
VMware vSphere 5 is a NUMA-aware hypervisor that allocates memory local to or close to a requesting core or thread to
minimize memory latency and link bandwidth consumption. VMware vSphere 5 automatically optimizes the Virtual Machines
deployed and can further be tuned with parameters and other mechanisms to adapt the default NUMA behaviour for various
workloads to achieve improved performance scalability.
The scalability of vSphere 5 enables the

highest guest VM density available today
vSphere 5 provides as well the highest virtual
hardware support for resource intensive
applications. While providing the highest level
of scalability, vSphere 5 also delivers the highest
levels of availability, through industry proven
features including HA, VMotion, and SRM.
VMware vSphere 5 is the natural choice

for the virtualization of business-critical
applications
VMware vSphere 5 is consequently the natural
choice for the virtualization of business-critical
applications such as MS SQL or Exchange.
VMware provides multiple tiers of package
licensing in order to allow customers to acquire
feature support according to their individual
needs. However the ROI for VMware is best
achieved through high density of virtualized
guests.
bullion: the ideal server platform with

hardware scalability and availability
features that are consistent with that of
VMware vSphere 5
Along these lines, customers considering

advanced versions of ESX should select a
server platform, such as bullion, with hardware
scalability and availability features that are
consistent with that of VMware vSphere 5.
The bullion architecture provides this scalability
along with industry leading RAS (Reliability,
Availability, and Serviceability) hardware
enhancements that is a best-fit for server
consolidation needs.
VMware provides a virtualization solution

to address nearly all scenarios
With such key characteristics of performance,
availability, scalability in addition to a broad
portfolio of product offerings and guest/host
support, VMware provides a virtualization
solution to address nearly all scenarios.
Advanced features may require a significant
investment in hypervisor and management
infrastructure; therefore, a supporting hardware
system such as bullion should be used in order
to best maintain availability of the virtual
environment and achieve maximum ROI through
high virtual machine density.
Limited number of servers inside the architecture

The number of VMs is not altered but the
number of hypervisor images and physical
assets to manage will be reduced as well.
-20-
Server management costs are much higher than

acquisition costs and having fewer physical
systems to manage will no doubt reduce this.
Virtualizing databases efficiently on larger Virtual Machines

Until recently some production databases
and large applications could not benefit from
virtualization. Databases are only able to scaleup within a single instance and cant scale-out.
This is true for Microsoft SQL, PostgreSQL and
IBM DB2 databases. The amount of resources
available on a single VM limited the expansion
of databases and large applications during
production.
With the introduction of VMware vSphere 5, the
amount of resources available on a single VM
increased fourfold, compared to the previous
VMware version. These huge VMs are called
Monster-VMs.
For a Monster-VM to do any work the hypervisor
will have to find and schedule 32 physical CPU
cores. It is unadvisable to put 32 vCPU VMS on
a 32 core system. For this machine to do any
work the scheduler has to clear out all other
virtual machines and put their work on hold.
This will reduce performance of both the large
VM and the other VMs on the server. There are
more permutations of ways to schedule multiple
VMs with more physical CPU cores to choose
from. This relationship is based on the formula
n choose r, where n is the number of CPU cores
available to the scheduler and r is the number
of vCPUs in the VM. n choose r is not a linear
relationship. If we set r constant at 2 and look at
the scheduling opportunites on a 4,8,16 and 32
core system we get the following graph:
Scheduling permutations of a 2 vCPU VM

on different sized hosts
The chart shows that the opportunities to fit VMs

onto a host in any particular time slice grow
exponentially.
A 16 core system has more permutations than
2, 8 core machines. Likewise a 32 core system
has many times more permutations than 4, 8
core machines. It is very difficult to quantify how
much this phenomenon will effect performance
of a particular environment. There are too many
variables involved and it would require degree
level statistical analysis.
Thus a scalable server with more than 4
processor sockets, is ideally suited for this type
of VM. With Monster-VMs on a bullion server,
Bull enables its customers to virtualize their
production databases.
Bull ran the SPECint_rate2006 industry
standard benchmark on a bullion server,
both in a virtualized and in a non-virtualized
environment. In the virtualized environment, a
single Monster-VM was used. The result shows a
remarkable efficiency and scalability of VMware
on bullion servers. The efficiency is confirmed
with less than 5% overhead achieved versus a
non-virtualized environment. bullion benchmark
results on non-virtualized environment: 723
(http://www.spec.org/cpu2006/results/
res2011q4/cpu2006-20111010-18660.
html), on virtualized environment: 693 (http://
www.spec.org/cpu2006/results/res2011q4/
cpu2006-20111010-18661.html) resulting in a
1- 693/723= 4,1% overhead.
-21-
Memory de-duplication ratios

In VMware vSphere5, this memory deduplication is also known as Transparent page
memory Sharing. The kernel of the hypervisor
scans pages in all of its VMs for similar pages.
If it finds two pages that are exactly the same
it will deduplicate and point both VMs at the
same physical address space. If one of the
VMs tries to write to the page then the memory

sharing is forced to end. If all of the VMs on
a host run the same O/S then it is very likely
that deduplication will be possible thus you
can have many VMS all reading from the same
physical address space and your deduplication
advantage will be higher.
Fewer host servers fewer unused pools of resources

One of the problems and reasons that SANs
have become so popular is that they avoid the
silos of unused disk space that occurs when you
attach disk locally. The same argument can be
said of using smaller virtual hosts systems.
Any CPU time or memory space that is not
being used on one host cannot be used by a
VM struggling to find resource on another host.
A live migration may be instigated to rebalance

the VM but this also requires CPU resource and
uses up network bandwidth. Having fewer large
hosts reduces the probability that a VM would
need to jump ship just to find more resource,
which in turn reduces live migration network
traffic and CPU overhead.
Fewer network ports needed and integrated networks

Fibre Channel switches, HBAs, cables, SFPs
and port licenses are expensive, so customers
will want to make sure they are utilizing them
to their full capacity. Having fewer physical
hosts will help keep costs of these components
down and help to increase their utilization.
Many hypervisors allow you to create integrated
-22-
networking objects such as virtual switches

within the kernel of the hypervisor. This means
that inter VM traffic does not have to be sent out
over the network. Inter-server IP traffic will incur
minimal latency and help reduce the load on the
external Ethernet switches.
Conclusion
This white paper highlights the superiority

of the scale-up architecture versus the scaleout architecture in particular in terms of
administration simplification and the many
advantages of the glued architecture based on
node controller versus the glueless architecture.
It shows that among the glueless architectures
the BCS architecture designed by Bull and
implemented in its bullion server delivers
more scalability, resiliency and efficiency to
meet the requirements of the most demanding
applications in high performance computing as
well as in business computing.
On top of the RAS features provided by the
BCS architecture bullion benefits from native
RAS features to meet QoS requirements. bullion
provides unprecedented level of service whilst
improving TCO and is designed for the best
energy efficiency in the market.
bullion servers provide customers the means to
obtain UNIX level reliability, availability and
serviceability to virtualize any application in
their journey to cloud computing.
With bullion servers customers can simplify
significantly their IT infrastructure, with a
significant reduction of TCO, thanks to Bulls
innovative and unique server architecture.
And that is what Bulls engineers have been
working on for over five years now: to design
an infrastructure that effectively reconciles
complexity, low energy consumption, high
availability, scalability and performance.
And above all, without piling on more

problems. To bring simplicity back into the
heart of IT infrastructures. bullion is specifically
designed to virtualize business critical
applications.
bullion is designed to take full advantage of
the most advanced and powerful Intel Xeon
processors, coupled to BCS, Bull's unique
architecture based on a node controller.
Thanks to the BCS the bullion server can grow
with application and SLA demand in a pricing
proportional to the performance fashion.
Bull, Intel and VMware technologies are
combined in the bullion server to deliver the
balanced scaling and self-healing resiliency
needed to virtualize the most demanding
business-critical applications.
Bull innovations add breakthrough efficiency to
enable you to control costs and focus more of
your resources on service delivery.
bullion server architecture and key technologies
contribute to its scalability, resiliency, and
efficiency. It should help IT professionals
understand the capabilities of bullion to
virtualize business-critical applications, to
accelerate the road to cloud-computing.
-23-
Bull rue Jean Jaurs - 78340 Les Clayes-sous-Bois France

UK: Bull Maxted Road, Hemel Hempstead, Hertfordshire HP2 7DZ
USA: Bull 300 Concord Road, Billerica, MA 01821
This white paper is printed on paper combining 40% eco-certified fibers from sustainable forests management and 60% recycled fibers in line with current environment
standards (ISO 14001).
Design: T2BH / Photos: Bull - Getty images
W-BCS-en1
Bull SAS - 2012 - Bull acknowledges the rights of proprietors of trademarks mentioned herein. Bull reserves the right to modify this document at any time without notice. Some offers or parts of
offers described in this document may not be available in your country. Please consult your local Bull correspondent for information regarding the offers which may be available in your country.
This document has no contractual significance. Intel and Intel Xeon are trademarks or registered registered trademarks of Intel Corporation in the US and other countries.

Bullion-Efficient Server Architecture For Virtualization

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Bullion-Efficient Server Architecture For Virtualization

Încărcat de

Drepturi de autor:

Formate disponibile

An efficient server architecture

for the virtualization of businesscritical applications

Scale-out and scale-up architectures................................................................................................... 4

To scale horizontally (or scale out) means to

To scale vertically (or scale up) means to add

The two main scale-up server architectures

The glueless architecture

4-socket glueless architecture

Advantages of a glueless architecture

8-socket glueless architecture

Drawbacks of a glueless architecture

The "glued" architecture based on node controllers

Communication and coordination

Reduction of average memory latency

Glueless architecture versus Bulls node controller architecture

Bulls glued architecture

Number of processor hops to

% of bandwidth required for

Maximum memory support

Max specint-rate performance

Max specint-rate performance

Max specint-rate performance

BCS: Bulls implementation of a glued architecture

The node controller developed by Bull plays

BCS architecture with 16 sockets

As shown in the above figure, the bullion

BCS key technical characteristics

The international SPECint_rate2006 benchmark

Source: Specint_rate 2006 benchmark July 2012

Performance testing using industry standard

overhead, the bullion server delivers better

Source: Specint_rate 2006 benchmark July 2012

The 6 fastest X86 business servers

bullion vs risc servers

Sun Fire X4800

Enhanced system performance

of memory lines, the BCS targets any remote

Via the eXtended QPI (X-QPI) network the

bullion measured performance vs the maximum

Tackle the availability challenge

systems reboot. The system will re-initialize

High-speed link resiliency

Bull designed the BCS to have RAS features

If using the spare wire does not make the

bullion server based on BCS

The BCS Architecture implemented in bullion

provides capabilities more commonly associated

Native RAS features to meet QoS requirements

servers allow for the memory to be configured

Extra levels of redundancy

Dual Path I/O HBAs

controller (IOH). Each bullion module provides

responses required to support both types of

RAID storage and hot-plug drives

possible. In fact with Bulls unique active/

Elements replaceable by Support are the Field

RAS Monitoring and Service Processor

Bull System Manager (BSM)

The SHC provides this information to Bull System

VMware vSphere 5 on bullion

The scalability of vSphere 5 enables the

VMware vSphere 5 is the natural choice

bullion: the ideal server platform with