Documente Academic
Documente Profesional
Documente Cultură
This document was written in June 2014, by Scott Hansma & Ian
Varley, to explain why we scale with Pods (and Superpods) in
Salesforce infrastructure. sfdc.co/pods
New! Now in presentation form and video form.
Pods are identical1 collections of hardware & software that support a discrete subset of
our customer base. Any customer organization (org) exists on exactly one Pod, and
only moves between them via migration or split. We call Pods instances to our
customers, and the name of the instance is a visible part of the URL they use (e.g.
http://na1.salesforce.com)2. As of mid-2014 we have ~50 Pods. Read more.
Superpods are sets of several Pods, in the same location, along with shared services
used by all of those Pods (like DNS, Hadoop, etc). Superpods arent directly visible to
customers. Pods can move3 between Superpods during Pod migrations. Superpod is a
confusing term; it also refers to the design of how multiple Pods and services are laid
out, and to just the shared services; more on that below: Naming Is Hard.
Every bit of hardware and data in a production Pod has an identical mirror-image copy in
a DR (Disaster Recovery) data center, somewhere else in the world. (So technically, an
instance is composed of two Pods in different data centers, but only one is ever
running. We havent done many failovers; were getting better.)
1 Theyre not really identical. Theyre all beautiful snowflakes. But that is a story for another time.
2 It can be masked by custom domains (like org62.my.salesforce.com) but not all customers do that. In
fact, hardly any do.
3 Move logically, not physically; we dont actually move the machines in a Pod migration, just the data.
4 Basically, we exceeded our capacity to scale and had a painful few months. There are 2 distinct historical
developments: the move from E25k to linux 8 node RAC, and the proliferation of new Pods recognizing that
we'd hit a limit on vertical growth of a Pod. Ask an old-timer about it, fun times.
Nowadays, the Pod strategy is our explicit choice for scaling. Why? We like it for 3 big
reasons:
The database is the center of gravity
Fault domains work
We like predictability
Lets explore each one in turn.
7 Of course if the database were completely presented as a service with forward and backward
compatibility then you wouldn't have that problem; instead you'd have a worse problem: trying to replace
the standard database abstraction layer (SQL) with a "better" one. Many have died trying.
Pod and original source Pod and that leads to this exact type of grief: it makes it really
hard to reason about DR.
There are also some services that are in the gray area between stateful and stateless.
One example is search, which is not technically a System of Record (SOR) because its
just a transformed copy of the primary database (so you could make multiple copies,
recreate it, etc8). But, because it cant be recreated fast enough to deal with an outage,
we have to treat it like SOR data. So here again, it makes sense for search to orbit the
single relational database it indexes.
8 Though in reality, losing it for big orgs would be tantamount to a service outage because it would take
several days to recreate it.
You see how theres no columns where all the icons are
doesnt happen. (Often.9)
or
? Exactly: it
When something goes wrong with Salesforces service, its really important to our
business that any disruption is localized; we dont want all our eggs (AOV10) in one basket
(pod). And if you have radically uncorrelated systems, you have a much greater shot at
doing this. The worst shit in the world could happen to one Pod, and we wouldnt kill all
the golden geese. (Wed have a lot of egg on our faces, though. (Sorry folks, just a yolk.))
This kind of protection isnt just about failures of software or infrastructure: its also
about service protection. Customers can do sophisticated things on our platform, like run
massive reports and Apex triggers and Pig pipelines. Part of service protection is that
9 It does happen sometimes, most recently in April 2014 when a DNS provider failure brought all Pods
down. (This was outside our control, but of course we still should have had a strategy that didnt depend
on a single one and we do now.)
10 AOV = Annual Order Value, i.e. the money our customers (including renewals) pay us. For this and
other handy acronyms, see here.
when we do find a customer abusing the system, the impact of that degradation is
limited. No customer can hose customers in other Pods, no matter how hard they try 11.
But wait! you say. What about Superpods? Do we really have fault isolation, if failure
in a shared service like NTP can hose many Pods at once? Yeah, youre right. Superpods
are a compromise, so we can divorce our real estate strategy from our scaling strategy.
But, look at the evidence; the number of times when screw-ups at the super-pod level
cause customer issues is a tiny fraction of how often Pod-level problems do.
In the end, nothing provides perfect fault isolation; to
paraphrase Randy, Earth is a single point of failure. In other
words, while you could argue it's pointless to have a smaller
fault zone (pods) because, hey, a datacenter can still fail,
right?. But the reality is more about probability: servers fail
more often than Pods, which fail more often than datacenters,
which fail more often than the entire planet; there will always
be some larger fault domain, but that doesn't make it pointless.
11 They most certainly can hose other customers in their own Pod (and org). Preventing that is the work of
the service protection team.
12 A dark launch is one where you expose new functionality in a way that allows you to verify it, without
turning it on for all live customer requests. That can mean either exposing it to a small subset of customer
requests, or launching it in a parallel, unobserved way.
13 We learned this lesson anew with the autobuild infrastructure (which runs all our bazillion unit tests
every time someone checks in a change to a component, like Hodor). We increased capacity, only to be
greeted with raging cascading failures.
Having multiple Pods gives us more ways to canary14 our changes. We roll out first to
GS0, then the sandboxes, then NA1 in R0, then over the following weeks, we roll out R1
and R2 to the rest of the Pods. Glaring issues at any stage give us time to react and fix
the problems before the majority of our customers see them. We can do this with
infrastructure changes, too: roll out HBase cap adds to one Pod first, verify that the sky
didnt fall, and then do the rest. This is a good thing.
Now, this approach is not about complete immutability. In particular, some services are
themselves intended to be horizontally scalable, like Keystone and HBase. It's fine for
horizontally scalable services to have different server counts per Pod. It would be
madness to say that because NA5 needs some extra File Force or HBase space, we have
to add File Force or HBase capacity everywhere.
But for now, adding a new rack of servers to any of these
services follows the same careful rollout model,15 because
its not just your service youre changing; its the chemistry
of the whole Superpod (the network, memcached, the WAN
pipe, etc). Even with a scalable data store like HBase, the
philosophy would be to pick some expected max size (say,
300TB) and if were pushing the limits of that, consider
splitting the Pod, just like we would if the DB got too big or
APT got too high. (And maybe over time, that threshold
changes to 350TB or 400TB, but it doesnt suddenly jump
to 30PB).
Well always have variability between Pods. But, our goal is to track and understand that
variability. Exceptions and irregularities need to justify their existence (i.e. be
explained in an obvious, visible way). If the justification for variance isn't good enough,
we generally want to eradicate it in the name of predictability.
15 Or, Scott would say, when you add capacity, a puppy dies.
At the macro level for salesforce, this same process happens, but at the Pod level: when
a Pod gets too big, we split it. We havent done many lately, but in the next year we have
over a dozen splits planned. Its critical for this process to be easy and repeatable. Right
now, it takes months and hundreds of people. :(
The other option we have in this process is org migration. Because of our multi-tenant
architecture and the way traditional relational databases work, its quite difficult to
migrate an org from one Pod to another without much downtime. Recent projects like
Buffalo16 have made huge strides in improving this, but were far from the holy grail of
seamless, instant, zero touch migration. Keystone aims to be a leg up in that fight. If we
could do fast, seamless migration, we really shouldnt ever have to do a split again.
16 Buffalo is a recursive acronym for BUFFalo A Live Org. Also this is important for you to read.
agile
And actually, some of this stuff lives at an even broader level than Superpod: some of it
lives at the Data Center level, like iDB. You can see a lot more detail about what goes
where in the Pod & Superpod link library.
We cant take advantage of our many Pods for High Availability (HA)
Pods are the unit of availability. But we have (un)scheduled downtime on Pods. So things
in Pods are not HA. While wed love to get to HA within the Pods (and have projects, like
Active Data Guard and ReadOnlyDB, to help that), until then, we should be smarter about
services that can't go down (e.g. Login/Identity).
clear, this is a pain-in-the-ass by design, for the protection of the service. But its still a
pain-in-the-ass. This is a problem that automation can fix.
Naming Is Hard
The names Pod and Superpod (as described in this doc) are primarily internal usages.
There are also some other words in use, so this section is a (possibly futile) attempt to
make a little sense of it.
You may have heard the words kingdom and estate, which relate to future Data Center
Automation ideas:
"Kingdom" is the group of machines controlled by a single R2 controller.
"Estate" is a group of services within a kingdom. It's also a security
boundary: all services within an estate can talk to each other freely and directly to
hosts (without any ACLs, etc).
So depending how we implement it, a Superpod may be a kingdom; or a kingdom may
contain multiple Superpods17. For more on that, see here.
For now, the external name for Pod is Instance. Theres no external name for Superpod
(Instance Group has been proposed but isnt widely used). Superpods are an
implementation detail, so we dont talk about them (except when we do). Theres been
talk of banning pod and using instance internally, too.
People use the word Superpod to mean many different things:
A single collection of several specific Pods.
A generation of Superpod design (sp1, sp2, sp3, etc). We should call this
Superpod Generation.
The supporting services for a collection of Pods (e.g. insights is part of the
Superpod, not the Pod).
The collection of instances AND the supporting services (1 + 3 above).
This means that if someone says Were going to build18 a Superpod!, it could either
mean one of two things, depending on whats already in place:
17 We haven't decided yet. Likely it would be 1:1. And a Pod may be a single estate, though more likely
we'd have a few estates making up a Pod, something like hbase-na1, app-na1, db-na1 estates.
18 Also "build" means 2 things: either physically rack and cable metal or set up software on said metal.
And pod, to some people, means a room in the data center. Good times.
If the Pods are in place, it means we're going to build the supporting
services";
If the Pods arent in place, "we're going to build the supporting services and
the Pods".
We also have the not-very-helpful addition of the HP
Superpod to our marketing, which has exactly nothing to do
with any of this. (Thanks, Marc!) Thats the idea that one
company could have a Pod all to itself. If it were just one org,
that would be somewhat impractical (since our RAC node
architecture is deeply predicated on an org living on exactly one RAC node, or at most
two). But in reality, most big companies have a lot of individual orgs (business units,
acquisitions, etc) so its not quite as goofy as it sounds: their set of orgs would
legitimately get service protection from the foibles of other companies. Were internally
referring to things like this (HP Superpods, or other dedicated Pods) as Blue Pods to
reduce the confusion.