Sunteți pe pagina 1din 5

!

GitHub Engineering

Orchestrator at GitHub
shlomi-noach " December 08, 2016

GitHub uses MySQL to store its metadata: Issues, Pull Requests, comments,
organizations, notifications and so forth. While git repository data does not need MySQL
to exist and persist, GitHubs service does. Authentication, API, and the website itself all
require the availability of our MySQL fleet. Failure detection
Our replication topologies span multiple data centers and this poses a challenge not only orchestrator takes a dierent approach to failure detection than the common
for availability but also for manageability and operations. monitoring tools. The common way to detect master failure is by observing the master: via
ping, via simple port scan, via simple SELECT query. These tests all suer from the same
Automated failovers problem: What if theres an error?

We use a classic MySQL master-replicas setup, where the master is the single writer, and Network glitches can happen; the monitoring tool itself may be network partitioned. The

replicas are mainly used for read trac. We expect our MySQL fleet to be available for naive solutions are along the lines of try several times at fixed intervals, and on the n-th

writes. Placing a review, creating a new repository, adding a collaborator, all require write successive failure, assume master is failed. While repeated polling works, they tend to

access to our backend database. We require the master to be available. lead to false positives and to increased outages: the smaller n is (or the smaller the interval
is), the more potential there is for a false positive: short network glitches will cause for
To that eect we employ automated master failovers. The time it would take a human to unjustified failovers. However larger n values (or longer poll intervals) will delay a true
wake up & fix a failed master is beyond our expectancy of availability, and operating such failure case.
a failover is sometimes non-trivial. We expect master failures to be automatically detected
and recovered within 30 seconds or less, and we expect failover to result with minimal loss A better approach employs multiple observers, all of whom, or the majority of whom must

of available hosts. agree that the master has failed. This reduces the danger of a single observer suering
from network partitioning.
We also expect to avoid false positives and false negatives. Failing over when theres no
orchestrator uses a holistic approach, utilizing the replication cluster itself. The master
failure is wasteful and should be avoided. Not failing over when failover should take place
means an outage. Flapping is unacceptable. And so there must be a reliable detection is not an isolated entity. It has replicas. These replicas continuously poll the master for

mechanism that makes the right choice and takes a predictable course of action. incoming changes, copy those changes and replay them. They have their own retry
count/interval setup. When orchestrator looks for a failure scenario, it looks at the
master and at all of its replicas. It knows what replicas to expect because it continuously
orchestrator observes the topology, and has a clear picture of how it looked like the moment before
We employ Orchestrator to manage our MySQL failovers. orchestrator is an open failure.
source MySQL replication management and high availability solution. It observes MySQL
orchestrator seeks agreement between itself and the replicas: if orchestrator cannot
replication topologies, auto-detects topology layout and changes, understands replication
reach the master, but all replicas are happily replicating and making progress, there is no
rules across configurations and versions, detects failure scenarios and recovers from
failure scenario. But if the master is unreachable to orchestrator and all replicas say:
master and intermediate master failures.
Hey! Replication is broken, we cannot reach the master, our conclusion becomes very
powerful: we havent just gathered input from multiple hosts. We have identified that the what.
replication cluster is broken de-facto. The master may be alive, it may be dead, may be
orchestrator understands all replication rules and picks a replica that makes most
network partitioned; it does not matter: the cluster does not receive updates and for all
sense to promote based on a set of rules and the set of available servers, their
practical purposes does not function. This situation is depicted in the image below:
configuration, their physical location and more. Depending on servers configuration, it is
able to do a two-step promotion by first healing the topology in whatever setup is easiest,
then promoting a designated or otherwise best server as master.

We build trust in the failover procedure by continuously testing failovers. We intend to write
more on this in a later post.

Anti-flapping and acknowledgements


Flapping is strictly unacceptable. To that eect orchestrator is configured to only
Masters are not the only subject of failure detection: orchestrator employs similar logic perform one automated failover for any given cluster in a preconfigured time period. Once
to intermediate masters: replicas which happen to have further replicas of their own. a failover takes place, the failed cluster is marked as blocked from further failovers. This
mark is cleared after, say, 30 minutes, or until a human says otherwise.
Furthermore, orchestrator also considers more complex cases as having unreachable
replicas or other scenarios where decision making turns more fuzzy. In some such cases, it To clarify, an automated master failover in the middle of the night does not mean
is still confident to proceed to failover. In others, it suces with detection notification only. stakeholders get to sleep it over. Pages will arrive, even as failover takes place. A human
will observe the state, and may or may not acknowledge the failover as justified. Once
We observe that orchestrator s detection algorithm is very accurate. We spent a few acknowledged, orchestrator forgets about that failover and is free to proceed with
months in testing its decision making before switching on auto-recovery. further failovers on that cluster should the case arise.

Failover Topology management


Once the decision to failover has been made, the next step is to choose where to failover Theres more than failovers to orchestrator . It allows for simplified topology
to. That decision, too, is non trivial. management and visualization.

In semi-sync replication environments, which orchestrator supports, one or more We have multiple clusters of diering size, that span multiple datacenters (DCs). Consider
designated replicas are guaranteed to be most up-to-date. This allows one to guarantee the following:
one or more servers that would be ideal to be promoted. Enabling semi-sync is on our
roadmap and we use asynchronous replication at this time. Some updates made to the
master may never make it to any replicas, and there is no guarantee as for which replica
will get the most recent updates. Choosing the most up-to-date replica means you lose
the least data. However in the world of operations not all replicas are created equal: at any
given time we may be experimenting with a recent MySQL release, that were not ready yet
to put to production; or may be transitioning from STATEMENT based replication to ROW
based; or have servers in a remote data center that preferably wouldnt take writes. Or you
may have a designated server of stronger hardware that youd like to promote no matter
server to upgrade its hardware; if it serves as a local/intermediate master taking it
oine would break replication on its own replicas.

orchestrator allows for easy and safe refactoring and management of such complex
topologies:

It can failover dead intermediate masters, eliminating the point of failure problem.
Refactoring (moving replicas around the topology) is made easy via GTID or Pseudo-
GTID (an application level injection of sparse GTID-like entries).
orchestrator understands replication rules and will refuse to place, say, a 5.6
server below a 5.7 server.

The dierent colors indicate dierent data centers, and the above topology spans three orchestrator also serves as the de-facto topology state/inventory indicator. It
DCs. Cross-DC network has higher latency and network calls are more expensive than complements puppet or service discoveries configuration which imply desired state, by
within the intra-DC network, and so we typically group DC servers under a designated actually observing the existing state. State is queryable at various levels, and we employ
intermediate master, aka local DC master, and reduce cross-DC network trac. In the orchestrator at some of our automation tasks.
above instance-64bb (blue, 2nd from bottom on the right) could replicate from
instance-6b44 (blue, bottom, middle) and free up some cross-DC trac.
Chatops integration
This design leads to more complex topologies: replication trees that go deeper than one or
We love our chatops as they make our operations visible and accessible to our greater
two levels. There are more use cases to having such topologies:
group of engineers. While the orchestrator service provides a web interface, we rarely use
Experimenting with a newer version: to test, say, MySQL 5.7 we create a subtree of it; ones browser is her own private command center, with no visibility to others and no
5.7 servers, with one acting as an intermediate master. This allows us to test 5.7 history.
replication flow and speed.
We rely on chatops for most operations. As a quick example of visibility we get by
Migrating from STATEMENT based replication to ROW based replication: we again
chatops, lets examine a cluster:
migrate slowly by creating subtrees, adding more and more nodes to those trees until
they consume the entire topology.
By way of simplifying automation: a newly provisioned host, or a host restored from shlomi-noach
.orc cluster sample-cluster
backup, is set to replicate from the backup server whose data was used to restore the
host.
Hubot
Data partitioning is achieved by incubating and splitting out new clusters, originally host lag status version mode format extra
dangling as sub-clusters then becoming independent. --- --- --- --- --- --- ---
instance-e854 0s ok 5.6.26-74.0-log rw STATEMENT >>,P-GTID
+ instance-fadf 0s ok 5.6.26-74.0-log ro STATEMENT >>,P-GTID
Deep nested replication topologies introduce management complexity: + instance-9d3d 0s ok 5.6.31-77.0-log ro STATEMENT >>,P-GTID
+ instance-8125 0s ok 5.6.31-77.0-log ro STATEMENT >>,P-GTID
All intermediate masters turn to be point of failure for their nested subtrees. + instance-b982 0s ok 5.6.26-74.0-log ro STATEMENT >>,P-GTID
+ instance-c5a7 0s ok 5.6.31-77.0-log ro STATEMENT >>,P-GTID
Recoveries in mixed-versions topologies or mixed-format topologies are subject to + instance-64bb 0s ok 5.6.31-77.0-log rw nobinlog P-GTID
cross-version or cross-format replication constraints. Not any server can replicate + instance-6b44 0s ok 5.6.31-77.0-log rw STATEMENT >>,P-GTID
+ instance-cac3 14400s ok 5.6.31-77.0-log rw STATEMENT >>,P-GTID
from any other.
Maintenance requires careful refactoring of the topology: you cant just take down a
Say we wanted to upgrade instance-fadf to 5.6.31-77.0-log . It has two replicas orchestrator continues to be free and open source, and is released under the Apache
attached, that I dont want to be aected. We can: License 2.0.

Migrating the project to the GitHub repo had the unfortunate result of diverging from the
shlomi-noach original Outbrain repo, due to the way import paths are coupled with repo URI in golang .
.orc relocate-replicas instance-fadf below instance-c5a7
The two diverged repositories will not be kept in sync; and we took the opportunity to
Hubot make some further diverging changes, though made sure to keep API & command line
instance-9d3d spec compatible. Well keep an eye for incoming Issues on the Outbrain repo.
instance-8125

Outbrain
To the eect of:
It is our pleasure to acknowledge Outbrain as the original author of orchestrator . The
project originated at Outbrain while seeking to manage a growing fleet of servers in three
shlomi-noach data centers. It began as a means to visualize the existing topologies, with minimal
.orc cluster sample-cluster
support for refactoring, and came at a time where massive hardware upgrades and
Hubot datacenter changes were taking place. orchestrator was used as the tool for
host lag status version mode format extra refactoring and for ensuring topology setups went as planned and without interruption to
--- --- --- --- --- --- ---
instance-e854 0s ok 5.6.26-74.0-log rw STATEMENT >>,P-GTID
service, even as servers were being provisioned or retired.
+ instance-fadf 0s ok 5.6.26-74.0-log ro STATEMENT >>,P-GTID
+ instance-b982 0s ok 5.6.26-74.0-log ro STATEMENT >>,P-GTID Later on Pseudo-GTID was introduced to overcome the problems of
+ instance-c5a7 0s ok 5.6.31-77.0-log ro STATEMENT >>,P-GTID
unreachable/crashing/lagging intermediate masters, and shortly afterwards recoveries
+ instance-9d3d 0s ok 5.6.31-77.0-log ro STATEMENT >>,P-GTID
+ instance-8125 0s ok 5.6.31-77.0-log ro STATEMENT >>,P-GTID came into being. orchestrator was put to production in very early stages and worked
+ instance-64bb 0s ok 5.6.31-77.0-log rw nobinlog P-GTID
on busy and sensitive systems.
+ instance-6b44 0s ok 5.6.31-77.0-log rw STATEMENT >>,P-GTID
+ instance-cac3 14400s ok 5.6.31-77.0-log rw STATEMENT >>,P-GTID
Outbrain was happy to develop orchestrator as a public open source project and
provided the resources to allow its development, not only to the specific benefits of the
The instance is now free to be taken out of the pool. company, but also to the wider community. Outbrain authors many more open source
projects, which can be found on their GitHubs Outbrain engineering page.
Other actions are available to us via chatops. We can force a failover, acknowledge
recoveries, query topology structure etc. orchestrator further communicates with us on Wed like to thank Outbrain for their contributions to orchestrator , as well as for their
chat, and notifies us in the event of a failure/recovery. openness to having us adopt the project.

orchestrator also runs as a command-line tool, and the orchestrator service


supports web API, and so can easily participate in automated tasks.
Further acknowledgements
orchestrator was later developed at Booking.com. It was brought in to improve on the
orchestrator @ GitHub existing high availability scheme. orchestrator s flexibility allowed for simpler hardware
setup and faster failovers. It was fortunate to enjoy the large MySQL setup Booking.com
GitHub has adopted orchestrator , and will continue to improve and maintain it. The
employs, managing various MySQL vendors, versions, configurations, running on clusters
github repo will serve as the new upstream and will accept issues and pull requests from
ranging from a single master to many hundreds of MySQL servers and Binlog Servers on
the community.
multiple data centers. Booking.com continuously contributes to orchestrator .
Wed like to further acknowledge major community contributions made by Google/Vitess
( orchestrator is the failover mechanism used by Vitess), and by Square, Inc.

Related projects
We are working to release a public puppet module for orchestrator , and will edit this
post once released.

Chef users, please consider this Chef cookbook by @silviabotros.

Shlomi Noach
Senior Infrastructure Engineer

Website GitHub Profile Twitter Profile

How we made diff pages three times faster

For updates, follow us on Twitter or join our team. Check out the feed if you do the RSS thing.

# with by %

S-ar putea să vă placă și