Sunteți pe pagina 1din 13

Why Pods?

This document was written in June 2014, by Scott Hansma & Ian
Varley, to explain why we scale with Pods (and Superpods) in
Salesforce infrastructure. sfdc.co/pods
New! Now in presentation form and video form.

For noobs: What are Pods (& Superpods)?

Pods are identical1 collections of hardware & software that support a discrete subset of
our customer base. Any customer organization (org) exists on exactly one Pod, and
only moves between them via migration or split. We call Pods instances to our
customers, and the name of the instance is a visible part of the URL they use (e.g.
http://na1.salesforce.com)2. As of mid-2014 we have ~50 Pods. Read more.
Superpods are sets of several Pods, in the same location, along with shared services
used by all of those Pods (like DNS, Hadoop, etc). Superpods arent directly visible to
customers. Pods can move3 between Superpods during Pod migrations. Superpod is a
confusing term; it also refers to the design of how multiple Pods and services are laid
out, and to just the shared services; more on that below: Naming Is Hard.
Every bit of hardware and data in a production Pod has an identical mirror-image copy in
a DR (Disaster Recovery) data center, somewhere else in the world. (So technically, an
instance is composed of two Pods in different data centers, but only one is ever
running. We havent done many failovers; were getting better.)

Why do we have Pods?


To be honest, we kind of chose the Pod design accidentally; our databases were growing
faster than we could scale them, and splitting them was the only answer. (For a fun look
back in time, check out this deck from the NA1 split back in 2006.)
Our Pods started to proliferate, and we realized wed hit a limit on vertical growth. The
choice was to either double down on the Pod strategy, or start powering the Oracle
databases with magical uranium, because shit was getting crazy4. So we doubled down.

1 Theyre not really identical. Theyre all beautiful snowflakes. But that is a story for another time.
2 It can be masked by custom domains (like org62.my.salesforce.com) but not all customers do that. In
fact, hardly any do.

3 Move logically, not physically; we dont actually move the machines in a Pod migration, just the data.
4 Basically, we exceeded our capacity to scale and had a painful few months. There are 2 distinct historical
developments: the move from E25k to linux 8 node RAC, and the proliferation of new Pods recognizing that
we'd hit a limit on vertical growth of a Pod. Ask an old-timer about it, fun times.

Nowadays, the Pod strategy is our explicit choice for scaling. Why? We like it for 3 big
reasons:
The database is the center of gravity
Fault domains work
We like predictability
Lets explore each one in turn.

One: The Database Is The Center of Gravity


Data is the center of the Salesforce universe; specifically, our
relational database backbone (which causeth us both joy and
sorrow). The transactional correctness of Salesforce, as a product,
relies directly on having each orgs master data stored in a single
database, which is shared across many orgs, and has near-perfect
availability and low latency from every other server in the Pod.
We share one relational database across many customers (multi-tenancy is
the cornerstone of our architecture5). But we dont share one database
across all customers, because that would be too big. Relational databases
arent designed to scale horizontally, they have max size limits. For
Salesforces Oracle databases, weve found the practical happy size limit
to be around 30TB, run by ~8 beefy RAC nodes and an attached SAN. That
configuration runs about 10K small orgs, give or take a few big ones.
Fortunately, we have a clean way to shard our data: by customer. So instead of scaling
up, we scale out, with multiple databases. Each customer lives on one database; when
the DB gets to 60% capacity, we stop letting new orgs sign up there (and fill the rest of
the space via organic growth, from the existing orgs on that Pod).
So why should all the other infrastructure components (like app servers, networks, file
servers, etc) also be sharded into Pods, like the DB? There are two answers: one for
stateless services, and one for stateful ones.
Stateless Services
For stateless services (like The Core App), the main concern is database
connectivity. Every customer request does multiple DB reads & writes, so
we keep a pool of database connections open at all times (more on that
here). Weve talked about hooking the same app servers up to multiple
databases (in a project code-named 2-headed-chicken) but for a lot of
stupid reasons, its harder than it should be.
5 Good video on that multi-tenant magic from 2009, by former CTO Craig Weissman: Salesforce MultiTenant Architecture

So instead, we cluster an appropriate amount of compute (about 30 app servers)


around a fixed size of database (8 RAC Nodes), and it works pretty well. We also size
memcached (which runs on the app servers6) accordingly. The same goes for MQ and
other services that provide compute on top of the database.
Now, the core app isnt just java; it also includes a cool million lines of PL/SQL that run on
the database, and must be in sync with the app. So if your app servers talked to many
databases, theyd need to be running the exact same version of PL/SQL, down to the erelease level. That would be a pain in the ass.7 (This is part of whats hard about 2headed chicken.)
So, keeping all of the stateless processing logic in the app in orbit around the relational
DB makes sense.
Stateful Services
For stateful services (in particular, for data stores like FFX and HBase), theres a more
pressing reason to orbit the relational database: DR (disaster recovery). If we fail over
to another data center, we must be able to reliably fail over all the data, and we need to
be able to prove that its correct. For individual data items, these other stores are their
own master; but for the overall org state, the relational database is the brain. If you
could fail over some parts of the data but not others, youd get into a situation thats
very difficult to reason about (and likely has app servers talking to data stores in another
data center!). The Keystone project is our attempt to reason about this explicitly, but
were not there yet.
Many data stores are not capable of taking writes
from two data centers at the same time, which
means that if you had a single Pods DB fail, you
would need to fail over all the Pods that used that
larger data store, and that makes the DR domain
larger than wed like it to be -- youd end up having
to DR a lot of Pods when you only wanted to DR one, because you cant disentangle the
arbitrary network of databases. Thatd be confusing and bad.
One area where we currently violate this is File Force, and the "TI-Don't-Copy" feature,
which avoids copying files to sandboxes, and instead points at those same files in the
production org. In those cases, a single sandbox org's dataset spans both its sandbox
6 If memcached lived on, say, 2-4 boxes, rather than the whole app tier, it would reduce response variance
a lot. We should do that.

7 Of course if the database were completely presented as a service with forward and backward
compatibility then you wouldn't have that problem; instead you'd have a worse problem: trying to replace
the standard database abstraction layer (SQL) with a "better" one. Many have died trying.

Pod and original source Pod and that leads to this exact type of grief: it makes it really
hard to reason about DR.
There are also some services that are in the gray area between stateful and stateless.
One example is search, which is not technically a System of Record (SOR) because its
just a transformed copy of the primary database (so you could make multiple copies,
recreate it, etc8). But, because it cant be recreated fast enough to deal with an outage,
we have to treat it like SOR data. So here again, it makes sense for search to orbit the
single relational database it indexes.

8 Though in reality, losing it for big orgs would be tantamount to a service outage because it would take
several days to recreate it.

Two: Fault Domains: It Works


Look at this screen grab from trust.salesforce.com:

You see how theres no columns where all the icons are
doesnt happen. (Often.9)

or

? Exactly: it

When something goes wrong with Salesforces service, its really important to our
business that any disruption is localized; we dont want all our eggs (AOV10) in one basket
(pod). And if you have radically uncorrelated systems, you have a much greater shot at
doing this. The worst shit in the world could happen to one Pod, and we wouldnt kill all
the golden geese. (Wed have a lot of egg on our faces, though. (Sorry folks, just a yolk.))
This kind of protection isnt just about failures of software or infrastructure: its also
about service protection. Customers can do sophisticated things on our platform, like run
massive reports and Apex triggers and Pig pipelines. Part of service protection is that
9 It does happen sometimes, most recently in April 2014 when a DNS provider failure brought all Pods
down. (This was outside our control, but of course we still should have had a strategy that didnt depend
on a single one and we do now.)

10 AOV = Annual Order Value, i.e. the money our customers (including renewals) pay us. For this and
other handy acronyms, see here.

when we do find a customer abusing the system, the impact of that degradation is
limited. No customer can hose customers in other Pods, no matter how hard they try 11.
But wait! you say. What about Superpods? Do we really have fault isolation, if failure
in a shared service like NTP can hose many Pods at once? Yeah, youre right. Superpods
are a compromise, so we can divorce our real estate strategy from our scaling strategy.
But, look at the evidence; the number of times when screw-ups at the super-pod level
cause customer issues is a tiny fraction of how often Pod-level problems do.
In the end, nothing provides perfect fault isolation; to
paraphrase Randy, Earth is a single point of failure. In other
words, while you could argue it's pointless to have a smaller
fault zone (pods) because, hey, a datacenter can still fail,
right?. But the reality is more about probability: servers fail
more often than Pods, which fail more often than datacenters,
which fail more often than the entire planet; there will always
be some larger fault domain, but that doesn't make it pointless.

11 They most certainly can hose other customers in their own Pod (and org). Preventing that is the work of
the service protection team.

3: Goldilocks & The Three Bears of Predictability


Heres a story. Goldilocks went to the bears house. One of the bears was
a total asshole. He made people test every damn thing in lab
environments, and was always requiring a zillion approvals, and imposing
change moratoriums and shit. He said that nothing should change, ever.
Nobody liked this, but it was very predictable.
Another bear was a total stoner. He didnt test anything, changed stuff randomly in
production and then went on vacation. What a dick. That might be OK for Etsy, buddy.
But the third bear, she was pretty cool. She understood that the infrastructure would
have to change over time, and you cant stop it. You need to add new services, roll in
new hardware, and build cool new stuff. But you cant play fast and loose with a service
that runs critical services for hundreds of thousands of businesses.
The third bear had a strategy called the Immutable Pod Design Pattern. It goes like
this: once you get a known-good Pod (or Superpod) design, you pretty much stick with it.
When capacity starts to be an issue, you just stamp out another instance, you dont
change what it means to be an instance. So, e.g., when you need 8 more nodes of Oracle
RAC capacity, you stamp out another tried-and-true 8 node cluster (and all the
supporting services around it) instead of expanding your existing cluster from 8 to 16
and praying that 16 nodes works in production.
Now, of course, sometimes you do need to make a change (like, say, adding another
Fileforce buddy pair, or rolling out a new kind of Search service). But, in these cases, you
do it carefully: you dark launch12 it in one instance, you watch the stats like a hawk, and
only when its known to be stable in production do you roll it out everywhere. You dont
treat infra changes lightly, because in a heavily coupled, complex system like ours, they
can have unpredictable effects.
This is what we aim for at Salesforce. Were a multi-billion dollar company, and we do
need predictability. Pods let us avoid science experiments in our critical production
services, so we can stamp out repeatable, known-good designs.13

12 A dark launch is one where you expose new functionality in a way that allows you to verify it, without
turning it on for all live customer requests. That can mean either exposing it to a small subset of customer
requests, or launching it in a parallel, unobserved way.

13 We learned this lesson anew with the autobuild infrastructure (which runs all our bazillion unit tests
every time someone checks in a change to a component, like Hodor). We increased capacity, only to be
greeted with raging cascading failures.

Having multiple Pods gives us more ways to canary14 our changes. We roll out first to
GS0, then the sandboxes, then NA1 in R0, then over the following weeks, we roll out R1
and R2 to the rest of the Pods. Glaring issues at any stage give us time to react and fix
the problems before the majority of our customers see them. We can do this with
infrastructure changes, too: roll out HBase cap adds to one Pod first, verify that the sky
didnt fall, and then do the rest. This is a good thing.
Now, this approach is not about complete immutability. In particular, some services are
themselves intended to be horizontally scalable, like Keystone and HBase. It's fine for
horizontally scalable services to have different server counts per Pod. It would be
madness to say that because NA5 needs some extra File Force or HBase space, we have
to add File Force or HBase capacity everywhere.
But for now, adding a new rack of servers to any of these
services follows the same careful rollout model,15 because
its not just your service youre changing; its the chemistry
of the whole Superpod (the network, memcached, the WAN
pipe, etc). Even with a scalable data store like HBase, the
philosophy would be to pick some expected max size (say,
300TB) and if were pushing the limits of that, consider
splitting the Pod, just like we would if the DB got too big or
APT got too high. (And maybe over time, that threshold
changes to 350TB or 400TB, but it doesnt suddenly jump
to 30PB).
Well always have variability between Pods. But, our goal is to track and understand that
variability. Exceptions and irregularities need to justify their existence (i.e. be
explained in an obvious, visible way). If the justification for variance isn't good enough,
we generally want to eradicate it in the name of predictability.

Salesforce as Distributed System: Splits & Migrations


One way to think about the Pod & Superpod design is as one
giant, massive distributed database. In some distributed
databases (like HBase), when the load on a single node in the
system gets too high, the node is often split (for HBase, this
means that when a Region gets too big, its automatically split
into two smaller ones).
14 As in canary in a coal mine -- the idea that you make a change to a small portion of infrastructure
first to see if it dies a horrible painful death, because the horrible painful death of a canary is much better
than of a miner.

15 Or, Scott would say, when you add capacity, a puppy dies.

At the macro level for salesforce, this same process happens, but at the Pod level: when
a Pod gets too big, we split it. We havent done many lately, but in the next year we have
over a dozen splits planned. Its critical for this process to be easy and repeatable. Right
now, it takes months and hundreds of people. :(
The other option we have in this process is org migration. Because of our multi-tenant
architecture and the way traditional relational databases work, its quite difficult to
migrate an org from one Pod to another without much downtime. Recent projects like
Buffalo16 have made huge strides in improving this, but were far from the holy grail of
seamless, instant, zero touch migration. Keystone aims to be a leg up in that fight. If we
could do fast, seamless migration, we really shouldnt ever have to do a split again.

16 Buffalo is a recursive acronym for BUFFalo A Live Org. Also this is important for you to read.

What goes in the Pod, vs the Superpod?


The default answer is that most stuff should go in the Pod, because of the 3 reasons
listed above. Things that have to stay in-pod are:
Services that provide system of record data storage
Services that are transactionally coupled with our Systems of Record (Oracle,
FFX, HBase, etc)
Services that are built in such a way that they cant be shared across DBs
(like memcached, QPID, etc)
Things that

agile

can live at the Superpod level include:


Much of the network (routers, load balancers, etc).
Hadoop, because its not a SOR, and its a batch system with no SLA
Insights, because its not a SOR, and needs to be horizontally scalable and
The ops & M&M stacks (Gigantor, Ajna, kerberos infrastructure, etc)
Other logging and monitoring functions.
Various other shared services like Raiden, UMPS, LA, etc.

And actually, some of this stuff lives at an even broader level than Superpod: some of it
lives at the Data Center level, like iDB. You can see a lot more detail about what goes
where in the Pod & Superpod link library.

Pods Also Suck


From this document, you might think that Pods are all sweetness and
light. Theyre not; there are definitely downsides we should be up front
about. Here are a few.

They prohibit an elastic (AWS-like) model


One of the great infrastructure trends in recent years is elasticity; AWS
(including EC2, S3, etc) is the prime example. As traffic increases, you bring on more
capacity transparently, and as it decreases, you shed it. At a high level, this is exactly
what we offer to our customers (were a SaaS platform) but we dont have it ourselves on
the implementation side. Tieing the service to specific metal makes the service more
vulnerable to outages and security breaches (unlike VMs that can move around).

Truly global shared data is hard


There are a small number of things (like the global set of users, orgs, ISV packages, etc)
that need to be synchronized across all Pods at the database level. This prompted us to
build a feature in ~2004 called Org Replication, which syncs database tables
(ALL_USERS, ALL_ORGANIZATIONS) and makes them identical everywhere. The move
from ~4 Pods to ~50 Pods caused a rewrite in 2011, and well need another rewrite when
the number of Pods cross 100, because the process is still inherently O(n^2) in the
number of instances.

Inherently scalable services suffer by being partitioned


Some services, like Keystone and HBase, are built on a horizontally scalable architecture;
they dont just tolerate being deployed in larger installations, they actually thrive on it.
Deploying a 20-node HBase cluster has relatively high overhead (5 master nodes vs 17
data nodes) whereas a larger cluster has lower overhead (100 data nodes would still
need only 5 master nodes). The degree of parallelism is generally higher in a larger
cluster, and fluctuations in load are amortized. The Pod architecture forces us to deploy
services like this in many small clusters rather than the preferred approach of fewer
big clusters; but, for we cant.

Getting a holistic view of your services is hard


Splunk indexes are per-pod; Graphite too. You can combine several in a single view, but
you face a slowdown with each Pod you add, so youre discouraged from doing that. But,
looking individually at 50+ different graphs is madness. This is part of why we are so
shite at looking at things in production: its really difficult b/c of Pods.

We cant take advantage of our many Pods for High Availability (HA)
Pods are the unit of availability. But we have (un)scheduled downtime on Pods. So things
in Pods are not HA. While wed love to get to HA within the Pods (and have projects, like
Active Data Guard and ReadOnlyDB, to help that), until then, we should be smarter about
services that can't go down (e.g. Login/Identity).

Modifying global state is hard


If you want to make a change to something in black tab, or add a rule in Gater, you have
to do it once per Pod; theres no master switches. As the rest of this document makes

clear, this is a pain-in-the-ass by design, for the protection of the service. But its still a
pain-in-the-ass. This is a problem that automation can fix.

Without modern automation, its a big snowflake party


We started down the Pods road before we understood the importance of avoiding
snowflake infrastructure. But, due to the Pod design, were stuck with a LOT of
snowflakes today, and automation is an uphill battle.

Naming Is Hard
The names Pod and Superpod (as described in this doc) are primarily internal usages.
There are also some other words in use, so this section is a (possibly futile) attempt to
make a little sense of it.
You may have heard the words kingdom and estate, which relate to future Data Center
Automation ideas:
"Kingdom" is the group of machines controlled by a single R2 controller.
"Estate" is a group of services within a kingdom. It's also a security
boundary: all services within an estate can talk to each other freely and directly to
hosts (without any ACLs, etc).
So depending how we implement it, a Superpod may be a kingdom; or a kingdom may
contain multiple Superpods17. For more on that, see here.
For now, the external name for Pod is Instance. Theres no external name for Superpod
(Instance Group has been proposed but isnt widely used). Superpods are an
implementation detail, so we dont talk about them (except when we do). Theres been
talk of banning pod and using instance internally, too.
People use the word Superpod to mean many different things:
A single collection of several specific Pods.
A generation of Superpod design (sp1, sp2, sp3, etc). We should call this
Superpod Generation.
The supporting services for a collection of Pods (e.g. insights is part of the
Superpod, not the Pod).
The collection of instances AND the supporting services (1 + 3 above).
This means that if someone says Were going to build18 a Superpod!, it could either
mean one of two things, depending on whats already in place:
17 We haven't decided yet. Likely it would be 1:1. And a Pod may be a single estate, though more likely
we'd have a few estates making up a Pod, something like hbase-na1, app-na1, db-na1 estates.

18 Also "build" means 2 things: either physically rack and cable metal or set up software on said metal.
And pod, to some people, means a room in the data center. Good times.

If the Pods are in place, it means we're going to build the supporting
services";
If the Pods arent in place, "we're going to build the supporting services and
the Pods".
We also have the not-very-helpful addition of the HP
Superpod to our marketing, which has exactly nothing to do
with any of this. (Thanks, Marc!) Thats the idea that one
company could have a Pod all to itself. If it were just one org,
that would be somewhat impractical (since our RAC node
architecture is deeply predicated on an org living on exactly one RAC node, or at most
two). But in reality, most big companies have a lot of individual orgs (business units,
acquisitions, etc) so its not quite as goofy as it sounds: their set of orgs would
legitimately get service protection from the foibles of other companies. Were internally
referring to things like this (HP Superpods, or other dedicated Pods) as Blue Pods to
reduce the confusion.

S-ar putea să vă placă și