Sunteți pe pagina 1din 12

CEPH

REFERENCE
ARCHITECTURE
Single Rack Object Store

Copyright, Inktank Storage Inc. 2013

Summary
This document presents a reference architecture for a small digital content repository, designed for simplicity and low cost, while still
delivering moderate throughput and high reliability. A good example might be an Indie film production company that needs:



highly reliable storage for their valuable raw footage and edited results
high performance temporary storage for their editing and rendering tools
a system that be built for a very low initial cost, and operated inexpensively
a system that can grow incrementally as they grow

There are many other applications, with similar needs, for which this system would also be appropriate:


storage for work-group collaboration products


photo, video, and music storage for a web site
archival storage for on-line backups or data-tiering

It would is also be suited as a proof-of-concept implementation for a much larger system: a small system on which performance, reliability and operational scenario testing can be performed to validate its suitability for a much larger deployment.
This system requires only a single 10G switch, simple networking, and a single Ceph Object Gateway. Higher throughput and availability
can be obtained by adding additional switches, networks, object gateways and load balancers .

Structure of this document

1. brief overview of the use case, key system characteristics, and the hardware, software and networking components
2. detailed discussion of the servers and networking, and why those choices are right for this use case
3. discussion of the recommended software distributions, versions, and configuration
4. brief overview of Inktank Proof of Concept and product support services

Intended Audience for this Document

Solution architects and system administrators tasked with designing and deploying Ceph-based storage solutions will benefit
from studying the design considerations of this reference architecture. Developers looking to improve their content repository
solution by integrating with Ceph can also get a sense of how the storage subsystem will be deployed.

The storage sub-system presents S3-compatible RESTful APIs to the repository management software running on one or more
servers. The Ceph system is built across four commodity servers, each holding twelve 3TB SATA drives. This provides the studio
with

35 TB of editing scratch space


20 TB (6,000 hours) of triple replicated, high-resolution, film projects
sufficient free space to maintain this redundancy after the failure of any server

Such a system should be able to service up to 2000 storage requests per second. Streaming write throughput is expected to
reach approximately 200 MB/s and reads up to 600 MB/s.

This system could easily be expanded to five times this capacity and throughput, adding only additional servers. Further growth
would also be incremental, but would require additional racks and switches

Brief Description of System

Summary Diagram

The following diagram shows the logical components of the system: Four applications which are consumers and producers of
data, and the storage sub-system composed of four machines.

1.Solution Overview
This chapter provides an overview of the ingredients that went into the reference architecture, describes how the software components
are deployed on the participating nodes, and dependencies towards the underlying operating system.

1.1 Relevant Use Cases and Environments

This is a good solution for the Indie film company because:



it can provide three-copy redundancy for valuable raw footage and finished products while allowing scratch space to

be unreplicated.

it can provide very good streaming write throughput for a small number of editing stations and excellent streaming

read throughput for a larger number of editing and viewing stations.

it will continue providing service after the complete failure of any single node.

it can be implemented with a single rack and switch and four servers.

From a technical viewpoint, it should be recognized that because this system uses only a single switch and a single (active) Ceph
Object Gateway:

it is not highly available (the switch is a single point of failure)

the aggregate client throughput is limited to what can be handled by a single Ceph Object Gateway.

However, for our Indie film company, these two limitations are not too much of a concern, and they are willing to make these
trade-offs. They are getting excellent durability of their data for a very low budget. In the event of a switch outage, they are will
ing to take the risk of having to wait a while until a replacement part will be installed.

1.2 Component Overview

This is a relatively small system, designed for a minimum of four nodes, and expandable to around twenty nodes and several
hundreds of terabytes. It is intended to all fit in a single rack, served by a single switch. Because this system needs to be
able to run on a small number of nodes, we have chosen to co-locate all of the services on identical servers (each with 12+2 disks
and 64G of RAM). In larger systems one would use different types of machines for storage nodes, monitor nodes, and Gateway
servers.

1.3 Connectivity Overview

A small cluster, served by a single Gateway server, can carry all client and internal traffic on a single 10G network, served by a
single switch, and requires only a single 10G NIC per storage node. Even small clusters must be lights-out manageable, i.e. even
after the failure of a NIC or switch. For this reason, we recommend that separate 1G networks be set up for IPMI and manage
ment.

1.4 Software Overview

A reasonable Ceph system (whether for testing or deployment) should have at least three nodes:

three nodes must be running the monitor service so that two can still form a quorum if one fails

three nodes must be providing storage service so that we can still maintain two copies if one fails

fortunately we can run both monitor and object storage daemons on the same node


If three-copy replication is to be used, then a minimum of four nodes are needed. To run a cluster with a minimum number

of servers it is necessary to co-locate multiple services on each node. In a minimal four-node system we might distribute

functionality among the four nodes as follows

The Object Storage Daemons, Monitors, and Ceph Object Gateway are 100% user-mode code and able to run on most recent
Linux distributions. That having been said, however, these systems should be running stable releases with 3.0 or later kernels
(to take advantages of bug fixes and the syncfs system call) and the best available version of the chosen OSD file system.

2.Hardware Components
In this chapter we will recommend specific classes of hardware for each component and briefly discuss the rationale for those recommendations.

2.1 OSD Nodes

Thus there is a tradeoff to be made. For smaller Ceph deployments we recommend a balanced architecture that utilizes a stan
dard 12-drive 2U chassis configuration that is offered by multiple popular hardware vendors. For much larger systems (with only
moderate throughput demands) many more disks can easily be supported per node, as long as the memory and CPU power are
increased accordingly. Generally we recommend roughly 1GHz of 1 CPU Core and at least 1-2GB of memory per OSD.

One of the most fundamental system design questions is how many disks we want per storage node:

More disks per node generally result in a denser, and lower cost solution.

Each disk represents added throughput capacity, but only up to the point of saturating the nodes NIC, CPU, memory

or storage adaptor.

A storage node is a single point of failure. The more disks per node, the greater the fraction of our storage that can

be lost in a single incident. The amount of time, network traffic and storage node load required to deal with a storage

failure is proportional to the amount of storage that has been lost.

2.1.1 System Disks

It is recommended that the operating system and Ceph software be installed on (and boot from) a RAID-mirrored disk pair. This
prevents the (.7%/year) failure of a system disk from taking an entire node out of service. If that cost is deemed too high, a
single local disk can be used.
For a small operation, booting off of local disks is almost surely the right answer. In larger organizations that have the appropri
ate networking and image management infrastructure, centralized boot images may make node management much easier.
Network booting reduces our dependency on local disks, but is a slower process that is dependent on network infrastructure
and multiple additional servers.

2.1.2 Journal Configuration

Ceph storage nodes use a journal device (or partition) to quickly persist and acknowledge writes, while retaining the ability to
efficiently schedule disk updates. For systems that are expected to receive heavy write traffic, performance can be increased
by maintaining these journals on separate SSD drives. Journals can alternatively be stored on the same drives that hold the
corresponding data. This is simpler, less expensive, and more reliable (having fewer components), but will not be capable of as
high a write throughput.

Because this reference architecture is optimized for simplicity and low cost rather than high write throughput, we recommend
the simpler same-disk journal configuration. Clearly, this was a better fit for the tight budget for the Indie film company.

2.1.3 Storage Controllers

When determining what kind of disk controller to use with Ceph, there are two distinct classes of controllers that should be
considered: the first is basic SAS JBOD controllers with no on-board cache. This works well when SSD journals are utilized as
there is no contention between journal writes and data writes on the same device. The second class are RAID capable
controllers with battery backup units and write-back cache. This kind of controller is extremely useful when journals and data
are stored on the same disk. Write-back cache reduces contention between journal and data writes and generally improves
performance, though not necessarily to the levels that SSD journals do.

To see examples of how SSDs and write-back cache affect write performance, please see our Ceph Argonaut vs Bobtail
Performance Preview:
http://ceph.com/uncategorized/argonaut-vs-bobtail-performance-preview/#4kbradoswrite

Typical System:

This reference architecture is focusing on building simple, reliable, and well balanced Ceph nodes for small to medium sized
clusters. To that end, weve chosen a very common 12-disk platform that is available from many different hardware vendors.
Journals have been left on the same disks as the data, but utilize a controller with write-back cache to improve performance.
Weve specified 64GB or more of ram which is more than the minimum needed to support the OSDs on the system. The extra
RAM provides additional buffer cache, allows the systems to also host MONS or RGW services, and should not add significantly
to the price.

System Specifications:

System Disks

1 or 2 (RAID1) 250GB+ hard drive(s)

OSD Data Disks

12 x 3.5 3TB+ 7200 RPM hard drives

OSD Journal Disks

2GB partition on each OSD Data Disk

CPU(s)

At least 6 Intel or AMD cores running at 2.2GHz+. (2.0GHz is acceptable if monitor or RGW services are not on the same nodes)

Memory

64GB+

Storage Controller

Battery-backed write-back cache recommended (ie LSI SAS2208 class card with BBU unit or
similar)

Network Ports

At least 110GbE port for data.

Management Ports

1 1GbE or 10GbE port for management.

IPMI Port

Optional dedicated IPMI port.

There are several offerings from different vendors that meet these specifications:
Vendor

Model

Supermicro

6027R-E1R12T (Note: Optional parts MCP-220-82609-0N


http://www.supermicro.com/prodand BTR-0022L-LSI00279 recommended. CPU, Memory, and ucts/system/2U/6027/SSG-6027RDisk purchased separately. Please speak with your Supermi- E1R12T.cfm
cro representative or System Integrator).

Link

Dell

R720xd (Note: Flex bay option, H710 or H710p controller, and http://www.dell.com/us/
10GbE adapter recommended. Please speak with your Dell
enterprise/p/poweredge-r720xd/
representative).
pd

HP

DL380e (Note: Optional Rear Drive bay, P420 controller, and


10GbE adapter recommended. Please speak with your HP
representative.

http://shopping1.hp.com/is-bin/
INTERSHOP.enfinity/WFS/WWUSSMBPublicStore-Site/en_US/-/
USD/ViewStandardCatalog-Browse
?CatalogCategoryID=DSwQ7hacs9s
AAAE3Do9ObFx_

2.2 Monitor Nodes

For small clusters like this one, Ceph monitor services can be run on the same nodes that the OSDs are running on. We recom
mend slightly over-provisioning the CPU and memory resources if OSD nodes are also used for monitoring. For example, a node
hosting 12 OSDs and 1 monitor could be configured with a 2.2+GHz 6-core CPU and 64GB of RAM to support the OSDs, MON, and
provide additional memory for buffer cache. A larger system disk may be desired to store additional logs as well.

This configuration has been optimized for simplicity and low price. Larger clusters with more storage nodes and disks will cause
the monitors to use more CPU and memory resources. For larger configurations we generally recommend dedicated monitor
nodes:

System Disks

1 or 2 (RAID1) x 3.5 250GB+ hard drive(s)

CPU(s)

64bit Intel or AMD CPU (XEON E3-1200, XEON E5-2400, or Opteron 4100 series processor acceptable)

Memory

8GB+

Network Ports

1 1GbE or 10GbE port for monitor traffic.

Management Ports

1 1GbE or 10GbE port for management.

IMPI Ports

Optional dedicated IPMI port.

Example offerings from hardware vendors that meet these specifications include:
Vendor

Model

Link

Supermicro

5017R-MTRF (Note: CPU, Memory, and Disk http://www.supermicro.com/products/


purchased separately. Please speak with
system/1U/5017/SYS-5017R-MTRF.cfm
your Supermicro representative or System
Integrator).

Dell

R420

http://www.dell.com/us/enterprise/p/poweredger420/fs

HP

DL160

http://h10010.www1.hp.com/wwpc/us/en/sm/WF2
5a/15351-15351-3328412-241644-3328421-5211699.
html?dnr=1



A Ceph Object Gateway implements RESTful (S3 or Swift) operations on top of a RADOS cluster. It receives S3/Swift requests



from client nodes, and translates those into operations on the RADOS objects that represent the users, buckets, and file
objects. Most of the processing in the Gateway server is receiving and sending network messages. All of the actual data storage
is in the RADOS cluster. The same platforms described (above) for Monitor nodes would also be a good choice for dedicated
Ceph Object Gateway with two key differences: networking and log storage:

For high throughput applications it might be desirable to put incoming RESTful (S3 and Swift) traffic on a separate NIC
(and perhaps network) from the outgoing RADOS object traffic. Forcing these two data streams have to compete for a
single NIC could significantly reduce the achievable throughput.

Ceph Object Gateways maintain extensive logs of all of the requests they serve. These logs are often critical for diag
nosing customer complaints (to determine exactly what requests were made when). For this reason, it is a good prac
tice to dedicate a 1TB drive (or perhaps even a RAID-1 pair) to log storage.

Because this reference architecture is optimized for simplicity and low cost rather than high write throughput, we recommend
the simpler configuration, where the Ceph Object Gateways are co-located in one of the storage nodes. Adding a load balancer
would make it possible to support multiple active Ceph Object Gateways, significantly improving both throughput and avail
ability. But if our primary concern is availability, a stand-by Object Gateway can be run on another node, and DNS can be used to
reroute traffic to the stand-by if the primary Object Gateway fails.

Network design is fairly simple for small systems, because it does not have to address high availability and inter-rack through
put requirements.

A basic four-node proof-of-concept system can be served by a few spare ports (four 10G and four 1G) on an existing switch. Even
the largest system covered by this reference architecture (20 servers, each with separate front-side and back-side 10G NICs
and separate 1G IPMI and management networks) can easily be handled by a pair of 48 port switches (one 1G, one 10G). But, as
mentioned previously, putting all of the data traffic through a single switch creates a single point of failure for the entire clus
ter. Larger clusters to provide higher availability require multiple switches (and are described in other Reference Architec
tures).

It is useful to distinguish as many as four data networks in an object storage system:


i. the client service network by which client requests reach the load balancer(s).
ii. the Gateway service network which interconnects the load balancer(s) to the Ceph Object Gateways.
iii. the front-side data network by which the Ceph Object Gateway reaches the RADOS servers.
iv. the back-side data network across which RADOS storage nodes perform replication, data redistribution, and recovery.

In this (small) configuration, there are no load balancers (eliminating network ii), and the Ceph Object Gateway is co-located with
RADOS storage nodes (combining networks i and iii). Because all traffic in this cluster is funneled through a single Ceph Object
Gateway, it is not likely that there will be enough traffic to justify the separation of networks iii and iv. In larger configurations
(with load balancers and discrete Gateway servers) these four networks would probably be distinct.

2.4 Rack Networking


Whether you choose to use spare ports on an existing switch, dedicated small switches, or dedicated large switches depends on

your expectations for the future:

If this is a temporary proof-of-concept where you expect to do some testing and then recycle the components, there

is little reason to dedicate new switches to this system.

If this is expected to always be a small system (e.g. starting at four nodes and perhaps growing to eight), relatively

small (e.g. eight or 16 port) switches will surely suffice.

If this system is expected to grow to a full rack (or even multiple racks) you would be well advised to start out with

rack-scale (e.g. 48 port) switches and separate front-side and back-side data networks.

2.4.1 Front-Side Data Network

A single client can easily generate data at rates of 1 gigabyte per second or more. A storage node with twelve drives could easily
stream data to or from disk at an aggregate rate of 1 gigabyte per second or more. Unless it is known (e.g. this is an archival ser
vice) that data will only be trickling in into this system 1G network fabric (or a Layer 1 switch) would surely become a critical
bottleneck. We recommend at least a Layer 2, non-blocking, 10G switch.

If this cluster is to be more than four nodes and we expect it to see a great deal of traffic from clients who are not on the same
switch, the interconnection to the client network may need to be much faster (e.g. 40GB/s).

2.4.2 Back -Side Data Network


If a RADOS cluster is expected to receive significant write traffic, it is recommended that the cluster be served by separate 10G

front-side and back-side data networks:

the client can easily use 100% of his NIC throughput to write data into the RADOS cluster (front side network).

if multiple copies are to be made, the server that received the initial write will forward copies to secondary servers

(over the back-side network). Thus, if the storage pool is configured for three copies, each front-side write

will give rise to two back-side writes.

in addition to initial write-replication, the back-side network is also used for rebalancing and recovery.



If a cluster is expected to make N copies of each write, the back-side network should be able to handle N+1 times the traffic that
is on the front side. In extremely high throughput situations (continuous large writes) it may even be desirable to bond together
multiple 10G interfaces to handle the corresponding back-side traffic. As with the front-side, if there is to be a separate back-
side data network, we recommend at least a Layer 2, non-blocking, 10G switch.

Because the only traffic carried on the back-side network is data transfers between storage nodes, it may be desirable to provi
sion this network as a distinct VLAN. There is no reason for any other systems to have access to this network.

You may wonder why we do not recommend putting the front-side and back-side networks on different switches. The switches
are probably more reliable than the servers or their NICs, and adding a second 10G switch to this configuration would greatly
increase the cost, without greatly increasing reliability. In larger (multi-rack) configurations, however, careful thought must be
given to which servers are on which switches.

2.4.3 Management Network

While client data and replication can easily saturate 10G front-side and back-side networks, status reporting, statistics collec

tion, logging, and management activity generate (comparatively) little traffic, and can easily be accommodated by a 1G network.

Creating a separate 1G network for management offers a few advantages:

it prevents management traffic from interfering with performance-critical data traffic.

it creates a completely independent path (including the switch) to each node, enabling better failure detection and

easier diagnosis of failures in the data path.

in larger systems (where multiple switches are required) it enables the use of a less expensive switch for the traffic

that does not require 10G throughput.

As general (non-management) clients will have no need to participate in these interactions, this network too can be put on a
distinct VLAN.

A simple design with no separate back-side data network

2.4.4 Emergency Network

Independently of whether or not you are running Ceph software, remotely hosted servers in a lights-out environment will prob

ably need additional networking to enable server and switch problems to be corrected without a service call:

remote serial console access (to both servers and switches) from a highly accessible serial console server.

a distinct IPMI subnet (and perhaps VLAN), preferably served by a different switch than the one that serves the

management subnet.

A network design with higher throughput, separation, reliability, and flexibility

3.Software Components

3.1 Distributions and Versions


Ceph Server OS

The recommended operating system for this RA is Ubuntu Precise (12.04).


Inktank provides support for the following distributions: CentOS/RHEL, Debian, Ubuntu, SLES, and Open
SUSE.

Ceph Client OS

Since the Ceph Object Gateway will be sharing resources with OSD daemons, the Ceph Client OS should be
the same as the Ceph Server OS.

Ceph Version
Inktank recommends the use of the latest Ceph Bobtail release (0.56.4) for this Reference Architecture.

3.2 Ceph Configuration


Ceph Policies

The underlying filesystem for OSD should be XFS formatted with the following options: -i size=2048. The OSD filesystem
should be mounted with the following options noatime,inode64.

Inktank recommends the use of Cephs apache2 and mod_fastcgi forks. The apache2 and mod_fastcgi forks have been optimized
for HTTP 100-continue response, resulting in performance improvements. Our mod_fastcgi fork also provides support for the
HTTP 1.1 Chunked Transfer Encoding.

For better performance, it is also recommended to deactivate RGW operations logging on the host running the gateway. While
it is possible to send RGW operations logs to a socket, this configuration is out of the scope of this RA. Logging should be dea
ctivated for performance testing and reactivated afterwards.

CRUSH will, by default, create replicas on different host. The default number of replica is 2, the primary and one copy. If you
wish to have more replicas, you can do so by recreating the pools used by RGW. Note however, that a replication level higher than
4 will not be possible in this RA.

For this RA, we recommend the following default RGW configuration:



[client.radosgw.`hostname`]

host = `hostname`

keyring = /etc/ceph/ceph.client.radosgw.`hostname`.keyring

rgw socket path = /tmp/radosgw.sock

log file = /var/log/ceph/radosgw.log

rgw enable ops log = false

We also recommend the use of the default path values for OSD and mon directories, especially when using Upstart and/or Ceph
deployment solutions (Chef cookbooks, ceph-deploy).

3.3 Other Service Configuration

It is important to make sure that your system disks do not fill up, especially on the nodes hosting the monitors. Setting proper l
log rotation policy for all system services including Ceph is very important. Regular inspection of disk utilization is
also suggested. Be aware that increasing ceph debugging verbosity can generate over 1GB of data per hour. If you are planning
on creating a separate partition for the /var directory on the system, please plan accordingly.

For more information on setting Ceph log rotation policy, see:


http://ceph.com/docs/master/rados/operations/debug/

4.Next Steps
If this sounds like an interesting architecture, Inktank can help you realize it.

4.1 Proof of Concept

Unlike proprietary storage solutions from established vendors, it does not take much to get started with Ceph. There is no need
for upfront investment into specialized new hardware, or software licenses. To try out the basic functionality, you can use any
old existing commodity server hardware, attach a bunch of hard drives, deploy Ceph and take it for a test drive. Many hobbyists
and software-defined storage enthusiasts have stood up sizeable Ceph clusters just by following the Ceph documentation and
by following discussions on the ceph-users community mailing list.

However, for many corporate users that is not an option. Firstly, because of resource constraints. The ideal person to do a Ceph
POC is someone who understands Open Source technology, has an appreciation for scalable store clusters, as well as network
ing infrastructure. Those resources are not easy to find, and if you are lucky enough to have them, they will be in high demand for
many projects. Secondly, tight project timelines often make it prohibitive to spend too much time on a proof-of-concept. For
management to make a decision, you need to gather facts, and have answers much quicker.


Because of these reasons, many users turn to Inktank Professional Services to assist with proof-of-concept projects. Inktank

PS can assist with your POC by providing the following services:

analysing your use case, and documenting functional and non-functional requirements for your storage cluster

selecting hardware to match these requirements, including review of bill of materials for server and networking

hardware (from CPU power to disk drives)

designing a solutions architecture that best fits the requirements, including design for future scaling

performing performance analysis of the assembled configuration (on various levels, from pure disk performance to

cluster performance under heavy load)

making recommendations how to insights from the POC will apply if you build a much larger production system

Last but not least, the Inktank PS engineers are experts with proof of concept projects and pilot implementations. They have

plenty of document templates, tried deployment scripts, and hands-on expertise. They are familiar with common problems that

you might run into, they can quickly help out with advice, and restore the health of your cluster if you should accidentally

damage your Ceph cluster during experimentation.

4.2 Inktank Professional Services

Inktank Professional Services has the technical talent, business experience and long-term vision to help you implement, opti
mize, and manage your infrastructure as your needs evolve. We are committed to helping you get the most value out of Ceph by
leveraging our expertise, dedication, and enterprise-grade services and support.

If you would like to have a conversation with Intank to plan a proof of concept, contact sales@inktank.com or
call +(855) 465-8265 * 1 (Sales Team).

S-ar putea să vă placă și