Documente Academic
Documente Profesional
Documente Cultură
REFERENCE
ARCHITECTURE
Single Rack Object Store
Summary
This document presents a reference architecture for a small digital content repository, designed for simplicity and low cost, while still
delivering moderate throughput and high reliability. A good example might be an Indie film production company that needs:
highly reliable storage for their valuable raw footage and edited results
high performance temporary storage for their editing and rendering tools
a system that be built for a very low initial cost, and operated inexpensively
a system that can grow incrementally as they grow
There are many other applications, with similar needs, for which this system would also be appropriate:
It would is also be suited as a proof-of-concept implementation for a much larger system: a small system on which performance, reliability and operational scenario testing can be performed to validate its suitability for a much larger deployment.
This system requires only a single 10G switch, simple networking, and a single Ceph Object Gateway. Higher throughput and availability
can be obtained by adding additional switches, networks, object gateways and load balancers .
1. brief overview of the use case, key system characteristics, and the hardware, software and networking components
2. detailed discussion of the servers and networking, and why those choices are right for this use case
3. discussion of the recommended software distributions, versions, and configuration
4. brief overview of Inktank Proof of Concept and product support services
Solution architects and system administrators tasked with designing and deploying Ceph-based storage solutions will benefit
from studying the design considerations of this reference architecture. Developers looking to improve their content repository
solution by integrating with Ceph can also get a sense of how the storage subsystem will be deployed.
The storage sub-system presents S3-compatible RESTful APIs to the repository management software running on one or more
servers. The Ceph system is built across four commodity servers, each holding twelve 3TB SATA drives. This provides the studio
with
Such a system should be able to service up to 2000 storage requests per second. Streaming write throughput is expected to
reach approximately 200 MB/s and reads up to 600 MB/s.
This system could easily be expanded to five times this capacity and throughput, adding only additional servers. Further growth
would also be incremental, but would require additional racks and switches
Summary Diagram
The following diagram shows the logical components of the system: Four applications which are consumers and producers of
data, and the storage sub-system composed of four machines.
1.Solution Overview
This chapter provides an overview of the ingredients that went into the reference architecture, describes how the software components
are deployed on the participating nodes, and dependencies towards the underlying operating system.
From a technical viewpoint, it should be recognized that because this system uses only a single switch and a single (active) Ceph
Object Gateway:
it is not highly available (the switch is a single point of failure)
the aggregate client throughput is limited to what can be handled by a single Ceph Object Gateway.
However, for our Indie film company, these two limitations are not too much of a concern, and they are willing to make these
trade-offs. They are getting excellent durability of their data for a very low budget. In the event of a switch outage, they are will
ing to take the risk of having to wait a while until a replacement part will be installed.
This is a relatively small system, designed for a minimum of four nodes, and expandable to around twenty nodes and several
hundreds of terabytes. It is intended to all fit in a single rack, served by a single switch. Because this system needs to be
able to run on a small number of nodes, we have chosen to co-locate all of the services on identical servers (each with 12+2 disks
and 64G of RAM). In larger systems one would use different types of machines for storage nodes, monitor nodes, and Gateway
servers.
A small cluster, served by a single Gateway server, can carry all client and internal traffic on a single 10G network, served by a
single switch, and requires only a single 10G NIC per storage node. Even small clusters must be lights-out manageable, i.e. even
after the failure of a NIC or switch. For this reason, we recommend that separate 1G networks be set up for IPMI and manage
ment.
A reasonable Ceph system (whether for testing or deployment) should have at least three nodes:
three nodes must be running the monitor service so that two can still form a quorum if one fails
three nodes must be providing storage service so that we can still maintain two copies if one fails
fortunately we can run both monitor and object storage daemons on the same node
If three-copy replication is to be used, then a minimum of four nodes are needed. To run a cluster with a minimum number
of servers it is necessary to co-locate multiple services on each node. In a minimal four-node system we might distribute
functionality among the four nodes as follows
The Object Storage Daemons, Monitors, and Ceph Object Gateway are 100% user-mode code and able to run on most recent
Linux distributions. That having been said, however, these systems should be running stable releases with 3.0 or later kernels
(to take advantages of bug fixes and the syncfs system call) and the best available version of the chosen OSD file system.
2.Hardware Components
In this chapter we will recommend specific classes of hardware for each component and briefly discuss the rationale for those recommendations.
Thus there is a tradeoff to be made. For smaller Ceph deployments we recommend a balanced architecture that utilizes a stan
dard 12-drive 2U chassis configuration that is offered by multiple popular hardware vendors. For much larger systems (with only
moderate throughput demands) many more disks can easily be supported per node, as long as the memory and CPU power are
increased accordingly. Generally we recommend roughly 1GHz of 1 CPU Core and at least 1-2GB of memory per OSD.
One of the most fundamental system design questions is how many disks we want per storage node:
More disks per node generally result in a denser, and lower cost solution.
Each disk represents added throughput capacity, but only up to the point of saturating the nodes NIC, CPU, memory
or storage adaptor.
A storage node is a single point of failure. The more disks per node, the greater the fraction of our storage that can
be lost in a single incident. The amount of time, network traffic and storage node load required to deal with a storage
failure is proportional to the amount of storage that has been lost.
It is recommended that the operating system and Ceph software be installed on (and boot from) a RAID-mirrored disk pair. This
prevents the (.7%/year) failure of a system disk from taking an entire node out of service. If that cost is deemed too high, a
single local disk can be used.
For a small operation, booting off of local disks is almost surely the right answer. In larger organizations that have the appropri
ate networking and image management infrastructure, centralized boot images may make node management much easier.
Network booting reduces our dependency on local disks, but is a slower process that is dependent on network infrastructure
and multiple additional servers.
Ceph storage nodes use a journal device (or partition) to quickly persist and acknowledge writes, while retaining the ability to
efficiently schedule disk updates. For systems that are expected to receive heavy write traffic, performance can be increased
by maintaining these journals on separate SSD drives. Journals can alternatively be stored on the same drives that hold the
corresponding data. This is simpler, less expensive, and more reliable (having fewer components), but will not be capable of as
high a write throughput.
Because this reference architecture is optimized for simplicity and low cost rather than high write throughput, we recommend
the simpler same-disk journal configuration. Clearly, this was a better fit for the tight budget for the Indie film company.
When determining what kind of disk controller to use with Ceph, there are two distinct classes of controllers that should be
considered: the first is basic SAS JBOD controllers with no on-board cache. This works well when SSD journals are utilized as
there is no contention between journal writes and data writes on the same device. The second class are RAID capable
controllers with battery backup units and write-back cache. This kind of controller is extremely useful when journals and data
are stored on the same disk. Write-back cache reduces contention between journal and data writes and generally improves
performance, though not necessarily to the levels that SSD journals do.
To see examples of how SSDs and write-back cache affect write performance, please see our Ceph Argonaut vs Bobtail
Performance Preview:
http://ceph.com/uncategorized/argonaut-vs-bobtail-performance-preview/#4kbradoswrite
Typical System:
This reference architecture is focusing on building simple, reliable, and well balanced Ceph nodes for small to medium sized
clusters. To that end, weve chosen a very common 12-disk platform that is available from many different hardware vendors.
Journals have been left on the same disks as the data, but utilize a controller with write-back cache to improve performance.
Weve specified 64GB or more of ram which is more than the minimum needed to support the OSDs on the system. The extra
RAM provides additional buffer cache, allows the systems to also host MONS or RGW services, and should not add significantly
to the price.
System Specifications:
System Disks
CPU(s)
At least 6 Intel or AMD cores running at 2.2GHz+. (2.0GHz is acceptable if monitor or RGW services are not on the same nodes)
Memory
64GB+
Storage Controller
Battery-backed write-back cache recommended (ie LSI SAS2208 class card with BBU unit or
similar)
Network Ports
Management Ports
IPMI Port
There are several offerings from different vendors that meet these specifications:
Vendor
Model
Supermicro
Link
Dell
R720xd (Note: Flex bay option, H710 or H710p controller, and http://www.dell.com/us/
10GbE adapter recommended. Please speak with your Dell
enterprise/p/poweredge-r720xd/
representative).
pd
HP
http://shopping1.hp.com/is-bin/
INTERSHOP.enfinity/WFS/WWUSSMBPublicStore-Site/en_US/-/
USD/ViewStandardCatalog-Browse
?CatalogCategoryID=DSwQ7hacs9s
AAAE3Do9ObFx_
For small clusters like this one, Ceph monitor services can be run on the same nodes that the OSDs are running on. We recom
mend slightly over-provisioning the CPU and memory resources if OSD nodes are also used for monitoring. For example, a node
hosting 12 OSDs and 1 monitor could be configured with a 2.2+GHz 6-core CPU and 64GB of RAM to support the OSDs, MON, and
provide additional memory for buffer cache. A larger system disk may be desired to store additional logs as well.
This configuration has been optimized for simplicity and low price. Larger clusters with more storage nodes and disks will cause
the monitors to use more CPU and memory resources. For larger configurations we generally recommend dedicated monitor
nodes:
System Disks
CPU(s)
64bit Intel or AMD CPU (XEON E3-1200, XEON E5-2400, or Opteron 4100 series processor acceptable)
Memory
8GB+
Network Ports
Management Ports
IMPI Ports
Example offerings from hardware vendors that meet these specifications include:
Vendor
Model
Link
Supermicro
Dell
R420
http://www.dell.com/us/enterprise/p/poweredger420/fs
HP
DL160
http://h10010.www1.hp.com/wwpc/us/en/sm/WF2
5a/15351-15351-3328412-241644-3328421-5211699.
html?dnr=1
A Ceph Object Gateway implements RESTful (S3 or Swift) operations on top of a RADOS cluster. It receives S3/Swift requests
from client nodes, and translates those into operations on the RADOS objects that represent the users, buckets, and file
objects. Most of the processing in the Gateway server is receiving and sending network messages. All of the actual data storage
is in the RADOS cluster. The same platforms described (above) for Monitor nodes would also be a good choice for dedicated
Ceph Object Gateway with two key differences: networking and log storage:
For high throughput applications it might be desirable to put incoming RESTful (S3 and Swift) traffic on a separate NIC
(and perhaps network) from the outgoing RADOS object traffic. Forcing these two data streams have to compete for a
single NIC could significantly reduce the achievable throughput.
Ceph Object Gateways maintain extensive logs of all of the requests they serve. These logs are often critical for diag
nosing customer complaints (to determine exactly what requests were made when). For this reason, it is a good prac
tice to dedicate a 1TB drive (or perhaps even a RAID-1 pair) to log storage.
Because this reference architecture is optimized for simplicity and low cost rather than high write throughput, we recommend
the simpler configuration, where the Ceph Object Gateways are co-located in one of the storage nodes. Adding a load balancer
would make it possible to support multiple active Ceph Object Gateways, significantly improving both throughput and avail
ability. But if our primary concern is availability, a stand-by Object Gateway can be run on another node, and DNS can be used to
reroute traffic to the stand-by if the primary Object Gateway fails.
Network design is fairly simple for small systems, because it does not have to address high availability and inter-rack through
put requirements.
A basic four-node proof-of-concept system can be served by a few spare ports (four 10G and four 1G) on an existing switch. Even
the largest system covered by this reference architecture (20 servers, each with separate front-side and back-side 10G NICs
and separate 1G IPMI and management networks) can easily be handled by a pair of 48 port switches (one 1G, one 10G). But, as
mentioned previously, putting all of the data traffic through a single switch creates a single point of failure for the entire clus
ter. Larger clusters to provide higher availability require multiple switches (and are described in other Reference Architec
tures).
In this (small) configuration, there are no load balancers (eliminating network ii), and the Ceph Object Gateway is co-located with
RADOS storage nodes (combining networks i and iii). Because all traffic in this cluster is funneled through a single Ceph Object
Gateway, it is not likely that there will be enough traffic to justify the separation of networks iii and iv. In larger configurations
(with load balancers and discrete Gateway servers) these four networks would probably be distinct.
Whether you choose to use spare ports on an existing switch, dedicated small switches, or dedicated large switches depends on
your expectations for the future:
If this is a temporary proof-of-concept where you expect to do some testing and then recycle the components, there
is little reason to dedicate new switches to this system.
If this is expected to always be a small system (e.g. starting at four nodes and perhaps growing to eight), relatively
small (e.g. eight or 16 port) switches will surely suffice.
If this system is expected to grow to a full rack (or even multiple racks) you would be well advised to start out with
rack-scale (e.g. 48 port) switches and separate front-side and back-side data networks.
A single client can easily generate data at rates of 1 gigabyte per second or more. A storage node with twelve drives could easily
stream data to or from disk at an aggregate rate of 1 gigabyte per second or more. Unless it is known (e.g. this is an archival ser
vice) that data will only be trickling in into this system 1G network fabric (or a Layer 1 switch) would surely become a critical
bottleneck. We recommend at least a Layer 2, non-blocking, 10G switch.
If this cluster is to be more than four nodes and we expect it to see a great deal of traffic from clients who are not on the same
switch, the interconnection to the client network may need to be much faster (e.g. 40GB/s).
If a RADOS cluster is expected to receive significant write traffic, it is recommended that the cluster be served by separate 10G
front-side and back-side data networks:
the client can easily use 100% of his NIC throughput to write data into the RADOS cluster (front side network).
if multiple copies are to be made, the server that received the initial write will forward copies to secondary servers
(over the back-side network). Thus, if the storage pool is configured for three copies, each front-side write
will give rise to two back-side writes.
in addition to initial write-replication, the back-side network is also used for rebalancing and recovery.
If a cluster is expected to make N copies of each write, the back-side network should be able to handle N+1 times the traffic that
is on the front side. In extremely high throughput situations (continuous large writes) it may even be desirable to bond together
multiple 10G interfaces to handle the corresponding back-side traffic. As with the front-side, if there is to be a separate back-
side data network, we recommend at least a Layer 2, non-blocking, 10G switch.
Because the only traffic carried on the back-side network is data transfers between storage nodes, it may be desirable to provi
sion this network as a distinct VLAN. There is no reason for any other systems to have access to this network.
You may wonder why we do not recommend putting the front-side and back-side networks on different switches. The switches
are probably more reliable than the servers or their NICs, and adding a second 10G switch to this configuration would greatly
increase the cost, without greatly increasing reliability. In larger (multi-rack) configurations, however, careful thought must be
given to which servers are on which switches.
While client data and replication can easily saturate 10G front-side and back-side networks, status reporting, statistics collec
tion, logging, and management activity generate (comparatively) little traffic, and can easily be accommodated by a 1G network.
Creating a separate 1G network for management offers a few advantages:
it prevents management traffic from interfering with performance-critical data traffic.
it creates a completely independent path (including the switch) to each node, enabling better failure detection and
easier diagnosis of failures in the data path.
in larger systems (where multiple switches are required) it enables the use of a less expensive switch for the traffic
that does not require 10G throughput.
As general (non-management) clients will have no need to participate in these interactions, this network too can be put on a
distinct VLAN.
Independently of whether or not you are running Ceph software, remotely hosted servers in a lights-out environment will prob
ably need additional networking to enable server and switch problems to be corrected without a service call:
remote serial console access (to both servers and switches) from a highly accessible serial console server.
a distinct IPMI subnet (and perhaps VLAN), preferably served by a different switch than the one that serves the
management subnet.
3.Software Components
Ceph Client OS
Since the Ceph Object Gateway will be sharing resources with OSD daemons, the Ceph Client OS should be
the same as the Ceph Server OS.
Ceph Version
Inktank recommends the use of the latest Ceph Bobtail release (0.56.4) for this Reference Architecture.
The underlying filesystem for OSD should be XFS formatted with the following options: -i size=2048. The OSD filesystem
should be mounted with the following options noatime,inode64.
Inktank recommends the use of Cephs apache2 and mod_fastcgi forks. The apache2 and mod_fastcgi forks have been optimized
for HTTP 100-continue response, resulting in performance improvements. Our mod_fastcgi fork also provides support for the
HTTP 1.1 Chunked Transfer Encoding.
For better performance, it is also recommended to deactivate RGW operations logging on the host running the gateway. While
it is possible to send RGW operations logs to a socket, this configuration is out of the scope of this RA. Logging should be dea
ctivated for performance testing and reactivated afterwards.
CRUSH will, by default, create replicas on different host. The default number of replica is 2, the primary and one copy. If you
wish to have more replicas, you can do so by recreating the pools used by RGW. Note however, that a replication level higher than
4 will not be possible in this RA.
We also recommend the use of the default path values for OSD and mon directories, especially when using Upstart and/or Ceph
deployment solutions (Chef cookbooks, ceph-deploy).
It is important to make sure that your system disks do not fill up, especially on the nodes hosting the monitors. Setting proper l
log rotation policy for all system services including Ceph is very important. Regular inspection of disk utilization is
also suggested. Be aware that increasing ceph debugging verbosity can generate over 1GB of data per hour. If you are planning
on creating a separate partition for the /var directory on the system, please plan accordingly.
4.Next Steps
If this sounds like an interesting architecture, Inktank can help you realize it.
Unlike proprietary storage solutions from established vendors, it does not take much to get started with Ceph. There is no need
for upfront investment into specialized new hardware, or software licenses. To try out the basic functionality, you can use any
old existing commodity server hardware, attach a bunch of hard drives, deploy Ceph and take it for a test drive. Many hobbyists
and software-defined storage enthusiasts have stood up sizeable Ceph clusters just by following the Ceph documentation and
by following discussions on the ceph-users community mailing list.
However, for many corporate users that is not an option. Firstly, because of resource constraints. The ideal person to do a Ceph
POC is someone who understands Open Source technology, has an appreciation for scalable store clusters, as well as network
ing infrastructure. Those resources are not easy to find, and if you are lucky enough to have them, they will be in high demand for
many projects. Secondly, tight project timelines often make it prohibitive to spend too much time on a proof-of-concept. For
management to make a decision, you need to gather facts, and have answers much quicker.
Because of these reasons, many users turn to Inktank Professional Services to assist with proof-of-concept projects. Inktank
PS can assist with your POC by providing the following services:
analysing your use case, and documenting functional and non-functional requirements for your storage cluster
selecting hardware to match these requirements, including review of bill of materials for server and networking
hardware (from CPU power to disk drives)
designing a solutions architecture that best fits the requirements, including design for future scaling
performing performance analysis of the assembled configuration (on various levels, from pure disk performance to
cluster performance under heavy load)
making recommendations how to insights from the POC will apply if you build a much larger production system
Last but not least, the Inktank PS engineers are experts with proof of concept projects and pilot implementations. They have
plenty of document templates, tried deployment scripts, and hands-on expertise. They are familiar with common problems that
you might run into, they can quickly help out with advice, and restore the health of your cluster if you should accidentally
damage your Ceph cluster during experimentation.
Inktank Professional Services has the technical talent, business experience and long-term vision to help you implement, opti
mize, and manage your infrastructure as your needs evolve. We are committed to helping you get the most value out of Ceph by
leveraging our expertise, dedication, and enterprise-grade services and support.
If you would like to have a conversation with Intank to plan a proof of concept, contact sales@inktank.com or
call +(855) 465-8265 * 1 (Sales Team).