Documente Academic
Documente Profesional
Documente Cultură
Ali-Reza Adl-Tabatabai
Uber Technologies
Abstract Recent studies [27, 38] have shown that the monolithic model
Giant monolithic source-code repositories are one of the of source code management results in simplified dependency
fundamental pillars of the back end infrastructure in large management, unified versioning, ease of collaboration, im-
and fast-paced software companies. The sheer volume of ev- proved code visibility, and development velocity.
eryday code changes demands a reliable and efficient change As shown by Perry et al. [37], concurrently committing
management system with three uncompromisable key re- code changes by thousands of engineers to a big repository,
quirements — always green master, high throughput, and low with millions of lines of code, may lead to a master breakage
commit turnaround time. Green refers to a master branch (e.g., compilation error, unit test failure, integration test fail-
that always successfully compiles and passes all build steps, ure, and unsignable artifact). Tracking down and rolling back
the opposite being red. A broken master (red) leads to de- the faulty change is a tedious and error-prone task, which
layed feature rollouts because a faulty code commit needs very often needs human intervention. Since developers may
to be detected and rolled backed. Additionally, a red master potentially develop over a faulty master (mainline hereafter),
has a cascading effect that hampers developer productivity— their productivity may get hampered substantially. Addition-
developers might face local test/build failures, or might end ally, a breakage in the master can cause delays in rolling out
up working on a codebase that will eventually be rolled back. new features.
This paper presents the design and implementation of For example, prior to the launch of a new version of our
SubmitQueue. It guarantees an always green master branch mobile application for riders, hundreds of changes were com-
at scale: all build steps (e.g., compilation, unit tests, UI tests) mitted in a matter of minutes after passing tests individually.
successfully execute for every commit point. SubmitQueue Collectively though, they resulted in substantial performance
has been in production for over a year, and can scale to regression, which considerably delayed shipping the appli-
thousands of daily commits to giant monolithic repositories. cation. Engineers had to spend several hours bisecting the
mainline in order to identify a small list of changes that
ACM Reference Format:
caused performance regressions. New changes further aggra-
Sundaram Ananthanarayanan, Masoud Saeida Ardekani, Denis
vated the problem such that the mainline required blocking
Haenikel, Balaji Varadarajan, Simon Soriano, Dhaval Patel, and Ali-
Reza Adl-Tabatabai. 2019. Keeping Master Green at Scale. In Four- while the investigation continued.
teenth EuroSys Conference 2019 (EuroSys ’19), March 25–28, 2019, To prevent the above issues, the monorepo mainline needs
Dresden, Germany. ACM, New York, NY, USA, 15 pages. https: to remain green at all times. A mainline is called green if all
//doi.org/10.1145/3302424.3303970 build steps (e.g., compilation, unit tests, UI tests) can success-
fully execute for every commit point in the history. Keeping
1 Introduction the mainline green allows developers to (i) instantly release
Giant monolithic source-code repositories (monorepos here- new features from any commit point in the mainline, (ii) roll
after) form one of the fundamental pillars of backend in- back to any previously committed change, and not neces-
frastructure in many large, fast-paced software companies. sarily to the last working version, and (iii) always develop
against the most recent and healthy version of the monorepo.
Permission to make digital or hard copies of part or all of this work for To guarantee an always green mainline, we need to guaran-
personal or classroom use is granted without fee provided that copies are tee serializability among developers’ requests called changes.
not made or distributed for profit or commercial advantage and that copies
Conceptually, a change comprises of a developer’s code patch
bear this notice and the full citation on the first page. Copyrights for third-
party components of this work must be honored. For all other uses, contact padded with some build steps that need to succeed before
the owner/author(s). the patch can be merged into the mainline. Therefore, com-
EuroSys ’19, March 25–28, 2019, Dresden, Germany mitting a change means applying its patch to the mainline
© 2019 Copyright held by the owner/author(s). HEAD only if all of its build steps succeed. Observe that
ACM ISBN 978-1-4503-6281-8/19/03. totally ordering changes is different from totally ordering
https://doi.org/10.1145/3302424.3303970
1
EuroSys ’19, March 25–28, 2019, Dresden, Germany Ananthanarayanan et al.
code patches, which a conventional code management sys- development life cycle, and gives a high level overview of
tem (e.g., git server) does. While the latter can still lead to SubmitQueue. We explain our speculation model and conflict
a mainline breakage, totally ordering changes guarantees a analyzer in Section 4 and Section 5, respectively. Section 6 ex-
green mainline since it ensures that all build steps succeed plains how SubmitQueue performs build steps. SubmitQueue
before committing the patch to the mainline. implementation and evaluation are discussed in Section 7
In addition to ensuring serializability, the system also and Section 8. We review additional related work in Section 9.
needs to provide SLAs on how long a developer would need We discuss limitations and future work in Section 10, and
to wait before her changes are merged into the mainline, conclude the paper in Section 11.
called turn-around time. Ideally, this has to be small enough
that engineers are willing to trade speed for correctness (i.e., 2 Background
always green mainline). 2.1 Intermittently Green Mainline
Ensuring the above two properties at scale is challenging.
The trunk-based development approach typically commits a
The system needs to scale to thousands of commits per day
code patch into the mainline as soon as it gets approved, and
while it may take hours to perform all build steps. For ex-
passes pre-submission tests. A change management system
ample, performing UI tests of an iOS application requires
then launches an exhaustive set of tests to make sure that
compiling millions of lines of code on Mac Minis, upload-
no breakage exists, and all artifacts can be generated. In case
ing artifacts to different iPhones, and running UI tests on
an error is detected (e.g., due to a conflict or a failing test
them. Additionally, the emergence of bots that continuously
case), the system needs to roll back the committed change.
generate code (e.g., Facebook’s Configurator [42]), further
However, since several patches could have been committed
highlights the need for a highly scalable system that can
in the meantime, detecting and reverting the faulty patch is
process thousands of changes per day. Thus many existing
tedious and error-prone. In many cases, build sheriffs and
solutions try to only alleviate, rather than eliminate, the
engineers need to get involved in order to cherry-pick a set
aforementioned issues by building tools [38, 47] for rapid
of patches for roll backs. Besides, a red mainline has the
detection and removal of faulty commits.
following substantial drawbacks:
This paper introduces a change management system called
SubmitQueue that is responsible for continuous integration • Cost of delayed rollout: for many companies, the cost
of changes into the mainline at scale while always keep- of delaying rollouts of new features or security patches,
ing the mainline green. Based on all possible outcomes of even for a day, can result in substantial monetary loss.
pending changes, SubmitQueue constructs, and continuously • Cost of rollbacks: in case an issue arises with a re-
updates a speculation graph that uses a probabilistic model, lease, engineers can only roll back to the last working
powered by logistic regression. The speculation graph allows version, and not any arbitrary version.
SubmitQueue to select builds that are most likely to succeed, • Cost of hampered productivity: a red mainline can
and speculatively execute them in parallel. Our system also significantly affect productivity of engineers. At the
uses a scalable conflict analyzer that constructs a conflict very least, engineers might face local test and build
graph among pending changes. The conflict graph is then failures. Even worse, they can end up working on a
used to (1) trim the speculation space to further improve the codebase that will eventually be rolled back.
likelihood of using remaining speculations, and (2) deter- To alleviate the above issues, software companies try to
mine independent changes that can commit in parallel. In minimize post-submit failures by building pre-submit in-
summary, this paper makes the following contributions: frastructures (e.g., Google’s presubmit infrastructure [38]),
which tests patches before committing them to the mainline.
• Based on our experience, we describe the problem
Pre-submit testing approach solely focuses on individual
space and real challenges of maintaining large monore-
changes, and does not consider concurrent changes. There-
pos, and why it is crucial to keep the mainline green
fore, testing infrastructures still need to (1) perform tests
all the time.
(e.g., integration tests) on all affected dependencies upon
• We introduce our change management system called
every commit, (2) quickly identify patches introducing re-
SubmitQueue, and detail its core components. The
gressions [47], and (3) ultimately roll back the faulty patch
main novelty of SubmitQueue lies in picking the right
in case of a breakage. These steps are commonly referred to
set of techniques, and applying them together for build-
as post-submit testing.
ing a highly scalable change management system.
Figure 1 shows the probability of real conflicts as the num-
• We evaluate SubmitQueue, compare it against similar
ber of concurrent (i.e., pending) and potentially conflicting
approaches, and report how SubmitQueue performs
changes increases for our Android and iOS monorepos over
and scales in production.
the course of nine months. Roughly speaking, n concurrent
The outline of this paper is as follows: we first review changes are considered potentially conflicting if they touch
background work in the next section. Section 3 explains the the same logical parts of a repository. On the other hand, n
2
Keeping Master Green at Scale EuroSys ’19, March 25–28, 2019, Dresden, Germany
Prob. of Breakage
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
2 4 6 8 10 12 14 16 0 1 2
10 10 10
# Concurrent and Potentially Conflicting Changes Change Staleness (hour)
Figure 1. Probability of real conflicts as the number of con- Figure 2. Probability of a mainline breakage for iOS/Android
current and potentially conflicting changes increases. monorepos as change staleness increases.
batch if all build steps for every change in the batch suc-
concurrent changes have real conflicts if all the following ceed. Otherwise, the batch needs to get divided into smaller
conditions hold: (1) applying changes 1 to n − 1 do not lead batches, and get retried. As the batch size grows, which is
to a mainline breakage, (2) applying the nt h change individ- the case when there are thousands of commits every day,
ually does not break the mainline, and (3) applying all the the probability of failure (e.g., due to a conflict) increases
changes together results in a master breakage. Consequently, significantly. This leads to batches getting divided and re-
the nth change must have a real conflict with one or more tried frequently. Therefore, SLAs cannot be guaranteed as
of 1 to n − 1 changes. Therefore, pre-submit tests cannot they would vary widely based on how successful changes
detect these real conflicts, and they can only be captured are as well as the incoming rate of changes. In addition, this
post-submit. Observe that with even two concurrent and approach also suffers from scalability issues since we still
potentially conflicting changes, there is a 5% chance of a need to test and build one unit of batch at a time.
real conflict. This number grows to 40% with only 16 con- Chromium uses a variant of this approach called Commit
current and potentially conflicting changes. This shows that, Queue [4] to ensure an always green mainline. In Commit
despite all the efforts to minimize mainline breakages, it is Queue, all pending changes go through two steps before
very likely that the mainline experiences daily breakages due getting merged into the mainline. The first step is called pre-
to the sheer volume of everyday code changes committed to commit queue where the system tries to compile a change.
a big monorepo. Changes passing the first step are picked every few hours
Using data collected over one year, Figure 2 plots proba- by the second step to undergo a large suite of tests which
bility of a mainline breakage as change staleness increases. typically take around four hours. If the build breaks at this
Change staleness is measured as how old a submitted change stage, the entire batch gets rejected. Build sheriffs and their
is with respect to the mainline HEAD at the time of submis- deputies spring into action at this point to attribute failures
sion. As expected, the probability of a mainline breakage to particular faulty changes. Non-faulty changes, though,
increases as change staleness increases. However, observe need to undergo another round of batching, and can only get
that even changes with one to ten hour staleness have be- accepted if no faulty change exists in a round. This approach
tween 10% to 20% chance of making the mainline red. Thus, does not scale for fast-paced developments as batches often
while frequent synchronization with the mainline may help fail, and developers have to keep resubmitting their changes.
avoid some conflicts, there is still a high chance of a breakage. Additionally, as mentioned earlier, finding faulty changes
manually in large monorepos with thousands of commits
2.2 Always Green Mainline is a tedious and error-prone process, which results in long
The simplest solution to keep the mainline green is to en- turnaround times. Finally, observe that this approach leads
queue every change that gets submitted to the system. A to shippable batches, and not shippable commits. Thus, a
change at the head of the queue gets committed into the second change that fixes an issue of the first change must be
mainline if its build steps succeed. For instance, the rust- included in the same batch in order for the first change to
project [9] uses this technique to ensure that the mainline pass all its tests. Otherwise, the first patch will never succeed.
remains healthy all the time[2]. This approach does not scale Optimistic execution of changes is another technique be-
as the number of changes grows. For instance, with a thou- ing used by production systems (e.g., Zuul [12]). Similar to
sand changes per day, where each change takes 30 minutes to optimistic concurrency control mechanisms in transactional
pass all build steps, the turnaround time of the last enqueued systems, this approach assumes that every pending change
change will be over 20 days. in the system can succeed. Therefore, a pending change
An obvious solution for scalability limitations of a single starts performing its build steps assuming that all the pend-
queue is to batch changes together, and commit the whole ing changes that were submitted before it will succeed. If a
3
EuroSys ’19, March 25–28, 2019, Dresden, Germany Ananthanarayanan et al.
B1
However, we do not need to speculate on all the possible
builds. For example, consider the example in Figure 5. If B 1
succeeds, and C 1 can be committed, the result of B 2 (i.e., build
B1 fails à C1 rejects B1 succeeds à C1 commits
steps for H ⊕C 2 ) becomes irrelevant since now we can either
commit C 1 or C 1 ⊕ C 2 . Thus B 2 can be stopped. Similarly, if
B2 B1.2
B 1 fails, and C 1 gets rejected, then B 1.2 containing both C 1
and C 2 can never be successful, and should be stopped. Thus,
B12 fails à C2 rejects
B2 fails à C2 rejects depending on the outcome of B 1 , either B 2 or B 1.2 becomes
B2 succeeds à C2 commits irrelevant, and its computation is wasted. Likewise, for C 3 ,
B12 succeedsàC2 commits
only one of {B 3 , B 1.3 , B 2.3 , B 1.2.3 } is needed to commit C 3
B3 B2.3 B1.3 B1.2.3
when result of B 1 , B 2 or B 1.2 becomes known.
To generalize, only n out of 2n − 1 builds will ever be
needed to commit n pending changes. Considering that the
Figure 5. Speculation tree of builds and their outcomes for system receives hundreds of changes an hour, this approach
changes C 1 , C 2 , C 3 leads to substantial waste of resources since the results of
most speculations will not be used.
determine if C 2 can commit or not. If the prediction does not where PBneeded is the probability that the result of build B C
C
hold (i.e., B 1 fails), then we need to build C 2 alone by running will be used to make decision to commit or reject a set of
B 2 . Therefore, the key challenge is to determine which set changes C. BB C is the benefit (e.g., monetary benefit) ob-
of builds we need to run in parallel, in order to improve tained by performing the build on the set of changes C. In
turnaround time and throughput, while ensuring an always this paper, and for the sake of simplicity, we assume the
green mainline. To this end, the speculation engine builds a same benefit for all builds. In practice, we can assign differ-
binary decision tree, called speculation tree, annotated with ent values to different builds. For instance, builds for certain
prediction probabilities for each edge. The speculation tree projects or with certain priority (e.g., security patches) can
allows the planner engine to select the builds that are more have higher values, which in turn will be favored by Sub-
likely to succeed, and speculatively execute them in parallel. mitQueue. Alternatively, we may assign different quotas to
Figure 5 shows the speculation tree for changes C 1 , C 2 , and different teams, and let each team manages the benefits of
C 3 where C 1 is the first submitted change to SubmitQueue, its changes. In our current implementation, we consider the
and C 3 is the last submitted change. We note that in this sec- benefit of all builds to be one.
tion, and for simplicity, we assume that all pending changes Consider the speculation tree shown in Figure 5. In this
conflict with each other. In the next section, we explain how example, there are three pending changes, named C 1 , C 2 and
we detect independent changes, and change the speculation C 3 . The root of the tree, B 1 , is always needed as it is used to
tree to a speculation graph. determine if C 1 can be committed safely. Thus, PBneeded 1
= 1.
B 1.2 is only needed when B 1 succeeds. If B 1 fails, B 1.2 is not
4.1 Speculate Them All needed, but B 2 is needed. Thus,
The fastest and most expensive approach is to speculate on
all possible outcomes for every pending change by assuming PBneeded
1.2
= PBsucc
1
(1)
that the probability of committing a change is equal to the PBneeded
2
= 1 − PBsucc
1
probability of rejecting a change. For instance, we need to
execute all the seven builds shown in Figure 5 in order to The probability that an individual change, Ci , succeeds (de-
commit C 1 , C 2 , C 3 . Therefore, we would need to perform noted as PCsucc
i
) is equal to the probability that its build in-
2n − 1 builds for n pending changes. dependently succeeds: PBsucc 1
= PCsucc
1
. By substituting the
5
EuroSys ’19, March 25–28, 2019, Dresden, Germany Ananthanarayanan et al.
PBneeded
1.2
= PCsucc
1
(2) C1 C2 B1 B2
PBneeded
2
=1− succ
PC1
The above equations can be extended to builds in the next
level: B 1.2.3 is only needed under the condition that both B 1
and B 1.2 succeed. This can be written as
B1 fails B1 fails B1 succeeds B1 succeeds
PBneeded
1.2.3
= PBsucc
1
. PBsucc
1.2 |B 1
(3) B2 fails B2 succeeds B2 fails B2 succeeds
where PBsucc
1.2 |B 1
denotes the probability that B 1.2 succeeds
given the success of B 1 . Let’s consider the inverse. There are C3 B3 B2.3 B1.3 B1.2.3
Revision Conflict Graph Speculation Graph Algorithm 1 Calculating Target Hash Algorithm
1: function thash(tarдet)
C1 B1 2: md ← ∅
3: deps ← dependentTargetsOf(tarдet)
4: for dep ∈ deps do
5: md ← md ⊕ thash(dep)
6: srcs ← sourceFilesOf(tarдet)
B1 succeeds B1 fails B1 succeeds B1 fails 7: for src ∈ srcs do
8: md = md ⊕ contentsOf(src))
9: return hash(md)
C2 C3 B1.3 B3 B1.2 B2
10:
Figure 7. Conflict graph and speculation graph for C 1 con- δ H ⊕Ci ∩name δ H ⊕C j , ∅ implies that Ci and C j conflict since
flicting with both C 2 and C 3 there are some common targets affected by both changes.
Two changes might still conflict even if they do not affect a
common set of targets. For example, consider the case shown
builds to consider for C 3 and the choice is solely based on
in Figure 8. The original build graph for the HEAD contains
C 1 . Therefore, the total number of possible builds decreases
three targets where target Y depends on target X while target
from seven to five.
Z is independent. Numbers inside circles denote target hash
5.1 Build Systems & Targets values. Applying C 1 to the HEAD leads to targets X and
Y being affected. Thus, they have different target hashes
In order to build a conflict graph among pending changes, the compared to target X and Y in the original build graph. More
conflict analyzer relies on on the build system. A build system precisely, δ H ⊕C1 = {(X , 4), (Y , 5)}. After applying C 2 to the
(e.g., Bazel [1], Buck [3]) partitions the code into smaller HEAD, target Z along with its dependencies change: δ H ⊕C2 =
entities called targets. Roughly speaking, a target is a list of {(Z , 6)}. Observe that while these two changes conflict with
source files, their dependencies along with a specification on each other, simply computing the intersection of affected
how these entities can be translated into one or more output targets does not yield any conflict.
files. The directed acyclic graph of targets are then executed Consequently, to accurately determine two independent
in a topological order such that the executed build actions changes, we also need to make sure that every hash of an
are hermetic. affected target after applying both changes to the HEAD is
either observed after applying the first change to the HEAD,
5.2 Conflict Detection
or observed after applying the second change to the HEAD. If
Every change modifies a set of build targets, which is a tran- neither holds, it implies that the composition of both changes
sitive closure of targets whose source files are modified by results in a new affected target, and two changes conflict.
the change along with the set of targets that depend on them. Formally, Ci conflicts with C j if the following equation holds.
Roughly speaking, two changes conflict if they both affect a
common set of build targets. δ H ⊕Ci ∪ δ H ⊕C j , δ H ⊕Ci ⊕C j (6)
To this end, we associate every build target with a unique For instance, in Figure 8, δ H ⊕C1 ∪δ H ⊕C2 = {(X , 4), (Y , 5), (Z , 6)}
target hash that represents its current state. We then use while δ H ⊕C1 ⊕C2 = {(X , 4), (Y , 5), (Z , 7)} .
target hashes to detect which build targets are affected by a Determining whether two changes conflict based on the
change. Algorithm 1 demonstrates how a target hash is com- above approach requires building the target graphs four
puted for a given target. First, we find all the dependencies of times using H , H ⊕ Ci , H ⊕ C j , and H ⊕ Ci ⊕ C j . Therefore,
a target, recursively compute their target hashes, and update for committing n changes, we need to compute the build
a message digest with these hashes (lines 3 to 5). Second, graphs approximately n 2 times. Since it may take several
we find all the source files that the target depends on, and minutes to compute the build graph for a change on a big
update the message digest with their contents (lines 6 to 8). monorepo with millions of lines of code, the above solution
Third, we convert the message digest to a target hash - a cannot scale to thousands of commits per day.
fixed length hash value. Observe that Equation 6 is only needed if a build graph
Formally, let δ H ⊕Ci denote a set of targets affected by Ci changes as a result of applying a change. We observed that
against HEAD. An affected target is defined as a 2-tuple only 7.9% (resp. 1.6%) of changes actually cause a change to
(name, hash) where name represents the target’s name, and the build graph for iOS (resp. Backend) monorepos over a
hash represents the target’s hash value after applying Ci . span of five months. Therefore, if the build graph remains
Also, let ∩name denote intersection on target names. Thus, unchanged, we just need to use the intersection of changed
7
EuroSys ’19, March 25–28, 2019, Dresden, Germany Ananthanarayanan et al.
Original Build Graph for H Build Graph for H ⊕ C1 Build Graph for H ⊕ C2 Build Graph for H ⊕ C1 ⊕ C2
Figure 8. Changes in target graph with different changes. Circle denotes a target hash value. Arrow from target X to target Y
implies output of target X is required for building target Y. Shaded target node means the target is affected by the corresponding
change.
targets to decide whether the two changes conflict. This leads are not part of the received list. It then schedules new builds
to substantial elimination of needed build graphs. according to their priorities in the list using the build con-
Alternatively, we can also determine if Ci and C j conflict troller. In turn, the build controller employs the following
by solely computing a union of build graphs for H , H ⊕Ci , and techniques to efficiently schedule builds.
H ⊕C j instead of also computing a build graph for H ⊕Ci ⊕C j .
This approach requires n build graphs for detecting conflicts Minimal Set of Build Steps. Instead of performing all build
among n changes. steps, the build controller eliminates build steps that are be-
Let GH denote a build graph at HEAD where a node is a ing executed by prior builds. For instance, when scheduling
target, and an edge is a dependency between two nodes. The B 1.2.3 (build steps for H ⊕ C 1 ⊕ C 2 ⊕ C 3 ), the build controller
following algorithm determines if two changes conflict: can eliminate the build steps that are already performed by
prior builds (i.e., B 1 or B 1.2 ). Therefore, the build controller
• Step 1: we first build a union graph among GH , GH ⊕Ci , only needs to perform build steps for targets affected by C 3
and GH ⊕C j as follows: for every node in any of these build (i.e., δ H ⊕C1 ⊕C2 ⊕C3 − δ H ⊕C1 ⊕C2 ).
graphs, we create a corresponding node in the union graph.
Every node in the union graph contains all the target Load Balancing. Once the list of affected targets is deter-
hashes of its corresponding nodes in the individual build mined, the build controller uniformly distributes the work
graphs. Precisely, a node in the union graph is a 4-tuple among the available worker nodes. Accordingly, it maintains
of target name and all of its corresponding hashes in GH , the history of build steps that were performed, along with
GH ⊕Ci , and GH ⊕C j . For example, considering Figure 8, we their average build durations. Based on this data, the build
will have the following nodes: (A, 2, 5, 2), (B, 1, 4, 1), and controller assigns build steps to workers such that every
(C, 3, 3, 6). Lastly, we will add an edge between two nodes worker has an even amount of work.
in the union graph if there exists an edge between its
corresponding two nodes in any of the build graphs. Caching Artifacts. In order to further improve the turn-
• Step 2: a node is tagged as affected by Ci (resp. C j ) if its around times, the build controller also leverages caching
target hash in GH is different from its target hash in GH ⊕Ci mechanisms that exist in build systems to reuse generated
(resp. GH ⊕C j ). Considering the example shown in Figure 8, artifacts, instead of building them from scratch.
targets X and Y are tagged affected by C 1 , and target Z is
tagged affected by C 2 in the union graph. 7 Implementation
• Step 3: we then traverse the union graph in a topological 7.1 API & Core Services
order, and mark a visiting node as affected by Ci (resp. C j ) SubmitQueue’s API Service is a stateless backend service
if any of its dependent nodes is affected by Ci (resp. C j ). that provides the following functions: landing a change, and
Therefore, in our example, target Z is tagged affected by getting the state of a change. The service is implemented in
both C 1 and C 2 . 5k lines of Java using Dropwizard [6] - a popular Java frame-
• Step 4: Ci and C j conflict if there exists a node that is work for developing RESTful web services. It also features a
affected by both Ci and C j . web UI using cycle.js [5] to help developers track the state
of their changes.
6 Planner Engine & Build Controller The core service is implemented with 40k lines of Java
Based on the number of available resources, the planner en- code. It uses Buck [3] as the underlying build system for
gine contacts the speculation engine on every epoch, and constructing conflict graphs, MySQL as the backend storage
receives a prioritized list of builds needed. The planner en- system, Apache Helix for sharding queues across machines,
gine then terminates builds that are currently running, but and RxJava for communicating events within a process.
8
Keeping Master Green at Scale EuroSys ’19, March 25–28, 2019, Dresden, Germany
One challenge in dealing with a speculation graph is the we selected several developer features such as their name,
number of builds that need to be considered to find the most employment length, and levels.
valuable builds. More precisely, for n pending changes in
Speculation. The above identified features, associated with
the system, the speculation engine needs to consider and
revisions/changes, are static: they do not change over time
sort 2n builds in the worst case. In order to scale to hun-
for a given land request. Therefore, while they can initially
dreds of concurrent pending changes, SubmitQueue uses a
help to determine likelihoods of a change succeeding or
greedy best-first algorithm that visits nodes having the high-
conflicting with another change, they cannot help as Sub-
est probability first. Because edges in the speculation graph
mitQueue performs different speculations. This leads to high
represent the probability a path is needed, the value of a
mis-speculation penalties when initial speculations are not
node always goes down as the path length increases. Hence,
accurate. To side step this issue, the number of speculations
greedy best first helps SubmitQueue to navigate the specula-
that succeeded or failed were also included for training.
tion graph optimally in memory while scaling to hundreds of
The dataset was then divided into two sets: 70% for train-
pending changes per hour. Observe that the required space
ing the model and 30% for validating the model. The model
complexity of this approach is O(n).
was trained using scikit [10] in Python, and has an accuracy
of 97%. To avoid overfitting the models and keeping the com-
7.2 Model Training
putation during actual prediction fast, we also ran our model
We trained our success prediction models predictSuccess(Ci ) against recursive feature elimination (RFE) [25]. This helped
conf
and predictConflict(Ci ,C j ) that estimate PCsucc
i
and PCi ,C j in us reduce the set of features to just the bare minimum.
a supervised manner using logistic regression. We selected In our best performing model, the following features had
historical changes that went through SubmitQueue along the highest positive correlation scores: (1) number of suc-
with their final results for this purpose. We then extracted ceeded speculations, (2) revision revert and test plans, and
around 100 handpicked features. Our feature set is almost (3) number of initial tests that succeeded before submitting a
identical for training both of the above models, and can be change. In contrast, number of failed speculations, and num-
categorized as follows: ber of times that changes are submitted to a revision had the
most negative weights. We also note that while developer
Change. Clearly changes play the most important role in features such as the developer name had high predictive
predicting the likelihoods of a change succeeding, or con- power, the correlation varied based on different developers.
flicting with another change. Thus, we considered several
features of a change, including (i) number of affected targets, 8 Evaluation
(ii) number of git commits inside a change, (iii) number of We evaluated SubmitQueue by comparing it against the fol-
binaries added/removed, (iv) number of files changed along lowing approaches:
with number of lines/hunks added or removed, and (v) status
• Speculate-all: as discussed in Section 4.1, this approach
of initial tests/checks run before submitting a change.
tries all possible combinations that exist in the specu-
Revision. As explained in Section 3, a revision is a container lation graph. Therefore, it assumes that the probability
for storing changes. We observed that revisions with mul- of a build to succeed is 50%.
tiple submitted changes tend to succeed more often than • Single-Queue: where all non-independent changes are
newer revisions. Additionally, we noticed that the nature enqueued, and processed one by one, à la Bors [2].
of a revision has an impact on its success. Therefore, we Independent changes, on the other hand, are processed
included features such as (i) number of times changes are in parallel.
submitted to a revision, and (ii) revert and test plan suggested • Optimistic speculation: similar to Zuul [12], this ap-
as part of the revision. proach selects a path in the speculation graph by as-
suming that all concurrent changes will succeed. There-
Developer. Developers also play a major role in determining fore, a pending change starts performing its build steps
whether a change may succeed or not. For instance, expe- assuming that all the pending changes that were sub-
rienced developers do due diligence before landing their mitted before it will succeed. If a change fails, all the
changes while inexperienced developers tend to land buggy builds that speculated on the success of the failed
changes more often. Some developers may also work on change need to get aborted, and a new path in the
more fragile code-paths (e.g., core libraries). Thus their ini- speculation graph is selected.
tial land attempts fail more often. Additionally, we observed To compare these approaches in a more meaningful way,
that developer data helped us tremendously in detecting we also implemented an Oracle, and normalized turnaround
chances of conflicts among changes. This is due to the fact time and throughput of the above approaches against it. Our
that developers working on the same set of features (or code Oracle implementation can perfectly predict the outcome of
path) conflict with each other more often. Consequently, a change. Since Oracle always makes the right decision in
9
EuroSys ’19, March 25–28, 2019, Dresden, Germany Ananthanarayanan et al.
500 2.56 1.77 1.49 1.38 1.26 500 2.92 1.86 1.53 1.41 1.22 500 2.87 1.84 1.52 1.39 1.21
#Changes / Hour
#Changes / Hour
#Changes / Hour
400 2.57 1.87 1.59 1.47 1.42 400 3.33 2.02 1.64 1.53 1.36 400 3.25 2.00 1.62 1.51 1.35
300 2.52 1.87 1.44 1.31 1.28 300 3.44 2.28 1.81 1.53 1.37 300 3.55 2.45 1.83 1.54 1.53
200 2.98 2.04 1.92 1.72 1.54 200 4.03 2.60 1.90 1.68 1.53 200 3.95 2.57 1.95 1.63 1.51
100 1.83 1.00 1.02 1.00 1.00 100 2.00 1.49 1.40 1.28 1.23 100 2.19 1.65 1.46 1.38 1.25
100 200 300 400 500 100 200 300 400 500 100 200 300 400 500
#Workers #Workers #Workers
(a) SubmitQueue P50 Turnaround Time (b) SubmitQueue P95 Turnaround Time (c) SubmitQueue P99 Turnaround Time
500 11.21 10.05 9.44 9.19 9.04 500 13.93 12.52 11.65 11.20 10.91 500 14.24 12.75 11.89 11.32 11.15
#Changes / Hour
#Changes / Hour
#Changes / Hour
400 11.82 10.69 9.80 9.75 9.42 400 14.89 13.55 12.40 12.05 11.75 400 15.08 13.74 12.50 12.16 11.88
300 13.08 11.87 11.00 10.74 10.58 300 18.50 16.43 15.34 14.88 14.66 300 18.86 16.69 15.59 15.16 14.95
200 15.30 14.04 13.14 12.90 12.72 200 23.93 21.47 19.53 19.58 19.03 200 24.37 21.84 20.17 19.95 19.35
100 7.41 6.63 6.46 6.44 6.24 100 14.00 12.53 11.94 11.74 11.29 100 14.10 12.65 12.06 11.85 11.67
100 200 300 400 500 100 200 300 400 500 100 200 300 400 500
#Workers #Workers #Workers
(d) Speculate-all P50 Turnaround Time (e) Speculate-all P95 Turnaround Time (f) Speculate-all P99 Turnaround Time
500 8.54 8.72 8.62 8.57 8.77 500 10.03 10.15 10.13 9.99 10.06 500 10.03 10.08 10.07 9.83 9.98
#Changes / Hour
#Changes / Hour
#Changes / Hour
400 8.75 8.70 8.67 8.74 8.69 400 11.40 11.42 11.41 11.50 11.48 400 11.21 11.41 11.35 11.30 11.29
300 7.33 7.63 7.64 7.56 7.65 300 13.32 13.40 13.58 13.33 13.40 300 13.45 13.52 13.70 13.46 13.52
200 9.60 9.62 9.62 9.64 9.64 200 17.93 17.95 17.40 17.98 18.00 200 18.76 18.78 18.52 18.81 18.83
100 7.46 7.46 7.44 7.44 7.44 100 8.43 8.43 8.40 8.40 8.23 100 8.60 8.60 8.58 8.58 8.58
100 200 300 400 500 100 200 300 400 500 100 200 300 400 500
#Workers #Workers #Workers
(g) Optimistic P50 Turnaround Time (h) Optimistic P95 Turnaround Time (i) Optimistic P99 Turnaround Time
behavior with 200 changes per hour. Additionally, for 100 Among the studied solutions, Single-Queue has the worst
changes per hour, we noticed that SubmitQueue’s through- throughput (i.e., 95% slowdown). Surprisingly, we also ob-
put matches Oracle’s. Throughput results for 100 and 200 served that not only does the Optimistic approach do worse
changes per hour are not included. than Speculate-all, it remains unchanged as we increase the
number of workers. This is due to the fact that with the Op-
timistic approach, the system assumes that all changes will
11
EuroSys ’19, March 25–28, 2019, Dresden, Germany Ananthanarayanan et al.
Normalized Throughput
Normalized Throughput
1.0 1.0 1.0
0.8 0.8 0.8
0.6 0.6 0.6
0.4 0.4 0.4
0.2 0.2 0.2
0.0 0.0 0.0
100 200 300 400 500 100 200 300 400 500 100 200 300 400 500
#Workers #Workers #Workers
(a) 300 Changes / Hour (b) 400 Changes / Hour (c) 500 Changes / Hour
Turnaround Impr.
Turnaround Impr.
0.8 0.8 0.8
0.6 0.6 0.6
0.4 0.4 0.4
0.2 0.2 0.2
0.0 0.0 0.0
100 200 300 400 500 100 200 300 400 500 100 200 300 400 500
#Workers #Workers #Workers
(a) 300 Changes / Hour (b) 400 Changes / Hour (c) 500 Changes / Hour
Figure 13. P95 turnaround time improvement when using conflict analyzer
ways less than 100 in our workload, the throughput remains 100
the same as we increase the number of machines. 75
Finally, observe that adding more workers does not have 50
any effect on the Speculate-all approach. Since our specu- 25
lation graph is deep, adding a few hundred workers does 0
8/31 9/1 9/2 9/3 9/4 9/5 9/6
not have any effect on throughput. However, for wide spec-
ulation graphs (i.e., more independent changes), we expect
much better performance. Figure 14. State of the mainline for iOS monorepo prior to
SubmitQueue launch
8.4 Benefits of Conflict Analyzer
Figure 13 shows how the conflict analyzer improves 95-
percentile turnaround time of different approaches. The turn-
around time of Oracle improves by up to 60% with the conflict a handful of leaf-level nodes) resulting in a large number
analyzer. Likewise, both SubmitQueue and Speculate-all also of conflicts among changes. Consequently, the speculation
substantially benefit from the conflict analyzer as it helps graph has few independent changes that can execute and
them to perform more parallel builds. commit in parallel. Therefore, we expect substantially better
Surprisingly though, the effect of the conflict analyzer improvements when using the conflict analyzer for reposito-
on the Optimistic approach is only 20%. Moreover, observe ries that have a wider build graph.
that for both the Optimistic and Single-Queue approaches,
the turnaround time improvement remains constant as we 8.5 Mainline State Prior to SubmitQueue
increase the number of workers. This is due to the fact that Figure 14 shows the state of iOS mainline prior to Sub-
the build graph on the iOS monorepo is very deep (i.e., only mitQueue. Over a span of one week, the mainline was green
12
Keeping Master Green at Scale EuroSys ’19, March 25–28, 2019, Dresden, Germany
only 52% of the time affecting both development and roll- Transactional Systems. Transactional systems, including
outs. This clearly exhibits the need for a change management distributed data stores, software transactional memories
system that guarantees an always green mainline on fast- and databases, use concurrency control schemes to ensure
moving projects involving thousands of developers. Since that committing a transaction does not break their con-
its launch, our mainlines have remained green at all times. sistency model (e.g., Serializability). To guarantee consis-
tency, they employ different concurrency control mecha-
8.6 Survey on Benefits of SubmitQueue nisms such as pessimistic 2PL mechanism (e.g., Spanner [19]
We conducted an internal survey, and asked for feedback and Calvin [43]), an optimistic concurrency control mech-
from developers, release engineers, and release managers. anism (e.g., Percolator [36] and CLOCC [13]), or conflict
Specifically, we sought answers to the following two ques- reordering mechanism (e.g., Rococo [34]). These schemes
tions: (1) what is the perceived impact of an always green are tailored for relatively short running transactions. Sub-
master on personal productivity and getting code to produc- mitQueue on the other hand aims at ensuring serializability
tion, and (2) how good is the performance of SubmitQueue for transactions (i.e., changes) that may need hours to finish.
in production. Out of 40 people who answered the survey, Therefore, the penalty of both optimistic and pessimistic
close to 70% had some previous experience committing to execution is extremely high as shown in Section 8. Addition-
monorepos, and 60% of responders had experience with an ally, conventional transactional systems can easily validate
always green mainline. a transaction to make sure that it does not conflict with an-
92% of survey takers believed that enforcing an always other transaction before committing it. SubmitQueue, on the
green mainline positively impacts their personal productiv- other hand, needs to ensure the integrity of the mainline by
ity while 8% saw no impact. Interestingly though, none of the performing several build steps before being able to commit
responders claimed that an always green master negatively a change.
impacts their productivity. Similarly, 88% of responders be- Bug & Regression Detections. Bug & regression detections
lieved that an always green master positively impacts deploy- have been studied extensively by PL, software engineer-
ing code to production, while only one responder thought it ing, and systems communities. These works can be clas-
imposes a negative impact. sified in the following categories: (i) techniques [28, 45] to
As for the performance of SubmitQueue on the scale of identify bugs prior to submitting changes to repositories,
1 to 5 (where 5 is the best performance), 94% of responders (ii) tools [7, 29, 30] to find bugs after a change is submitted,
felt that it was either on par or significantly better than prior and (iii) solutions [14, 18, 31, 44] that help online engineers
experiences. and developers to detect and localize bugs post deployment.
Kamei et al. [28] and Yang et al. [45] introduce different
9 Related Work ML techniques to detect if a change is risky and might cause
regressions before it is submitted to a repository. To this
In Section 2, we reviewed various solutions for preventing
end, they use information such as reviews, authors, and
master breakages, and different techniques for recovering
modified files. Similarly, SubmitQueue uses a ML model to
from master breakages. In this section, we focus on other
determine the likelihood of a change’s build succeeding or
research works that are relevant to SubmitQueue, but are
not. However, unlike these techniques, SubmitQueue also
not directly comparable to it.
considers concurrent pending changes. Additionally, instead
Proactive Conflict Detection. An alternative approach to of rejecting a change, SubmitQueue uses the outcome of its
reduce the number of regressions in a monorepo is a proac- model to determine builds that are more likely to succeed.
tive detection of potential conflicts among developers work- Developers can easily leverage many of the above tools
ing on the same project. These solutions provide awareness to detect bugs prior to deploying their code by introduc-
to developers, and notify them that parallel changes are ing new build steps, and let SubmitQueue runs those build
taking place. The granularity of these solutions can be file- steps before committing the code into a repository. For in-
level [15], method-level using static-analysis [21, 24, 41], stance, one can use different tools [8, 11, 22, 26, 35] to elimi-
and repository-level [16]. Yet, these solutions do not prevent nate NullPointerException, or detect data races by simply
mainline breakages, and solely notify developers of potential adding extra build steps that need to run upon submitting
breakages. It is up to developers to further investigate, and new changes.
prevent a breakage. Unlike SubmitQueue, these solutions Test Selection. As a number of changes along with a num-
do not scale because the number of commits that need to ber of tests grow in many giant repositories, it may not be
be considered is extremely high in large monorepos. Ad- feasible to perform all tests during builds. To address this
ditionally, these approaches encourage developers to rush issue, several solutions have been proposed based on static
committing their changes in order to avoid merging with analysis [17, 33, 40, 46], dynamic analysis [23, 39], and data-
conflicting changes [20]. driven [32] using a ML model. Therefore, upon committing
13
EuroSys ’19, March 25–28, 2019, Dresden, Germany Ananthanarayanan et al.
[17] Ahmet Celik, Marko Vasic, Aleksandar Milicevic, and Milos Gligoric. [34] Shuai Mu, Yang Cui, Yang Zhang, Wyatt Lloyd, and Jinyang Li. 2014.
2017. Regression Test Selection Across JVM Boundaries. In Joint Extracting More Concurrency from Distributed Transactions. In Sym-
Meeting on Foundations of Software Engineering (ESEC/FSE). 809–820. posium on Operating Systems Design and Implementation (OSDI). 479–
[18] Trishul M. Chilimbi, Ben Liblit, Krishna Mehra, Aditya V. Nori, and 494.
Kapil Vaswani. 2009. HOLMES: Effective Statistical Debugging via [35] Mayur Naik, Alex Aiken, and John Whaley. 2006. Effective Static Race
Efficient Path Profiling. In International Conference on Software Engi- Detection for Java. In Conference on Programming Languages Design
neering (ICSE). 34–44. and Implementation (PLDI). 308–319.
[19] James C Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christo- [36] Daniel Peng and Frank Dabek. 2010. Large-scale incremental process-
pher Frost, J J Furman, Sanjay Ghemawat, Andrey Gubarev, Christo- ing using distributed transactions and notifications. In Symposium on
pher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Operating Systems Design and Implementation (OSDI). 251–264.
Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David [37] Dewayne E. Perry, Harvey P. Siy, and Lawrence G. Votta. 2001. Parallel
Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Ya- Changes in Large-scale Software Development: An Observational Case
sushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Study. ACM Transactions on Software Engineering and Methodology 10,
Dale Woodford. 2012. Spanner: Google’s Globally-Distributed Data- 3 (July 2001), 308–337.
base. In Symposium on Operating Systems Design and Implementation [38] Rachel Potvin and Josh Levenberg. 2016. Why Google Stores Billions
(OSDI). 251–264. of Lines of Code in a Single Repository. Commun. ACM 59 (2016),
[20] Cleidson R. B. de Souza, David F. Redmiles, and Paul Dourish. 2003. 78–87.
"Breaking the code", moving between private and public work in col- [39] Gregg Rothermel and Mary Jean Harrold. 1997. A Safe, Efficient
laborative software development. In International Conference on Sup- Regression Test Selection Technique. ACM Transactions on Software
porting Group Work (GROUP). 105–114. Engineering and Methodology 6, 2 (April 1997), 173–210.
[21] Prasun Dewan and Rajesh Hegde. 2007. European Conference on [40] Barbara G. Ryder and Frank Tip. 2001. Change Impact Analysis for
Computer Supported Cooperative Work (ECSCW). 159–178. Object-oriented Programs. In Workshop on Program Analysis for Soft-
[22] Dawson Engler and Ken Ashcraft. 2003. RacerX: Effective, Static De- ware Tools and Engineering (PASTE). 46–53.
tection of Race Conditions and Deadlocks. In Symposium on Operating [41] Anita Sarma, Gerald Bortis, and Andre van der Hoek. 2007. Towards
Systems Principles (SOSP). 237–252. Supporting Awareness of Indirect Conflicts Across Software Con-
[23] Milos Gligoric, Lamyaa Eloussi, and Darko Marinov. 2015. Practical figuration Management Workspaces. In International Conference on
Regression Test Selection with Dynamic File Dependencies. In Interna- Automated Software Engineering (ASE). 94–103.
tional Symposium on Software Testing and Analysis (ISSTA). 211–222. [42] Chunqiang Tang, Thawan Kooburat, Pradeep Venkatachalam, Akshay
[24] Mário Luís Guimarães and António Rito Silva. 2012. Improving Early Chander, Zhe Wen, Aravind Narayanan, Patrick Dowell, and Robert
Detection of Software Merge Conflicts. In International Conference on Karl. 2015. Holistic Configuration Management at Facebook. In Sym-
Software Engineering (ICSE). 342–352. posium on Operating Systems Principles (SOSP). 328–343.
[25] Isabelle Guyon, Jason Weston, Stephen Barnhill, and Vladimir Vapnik. [43] Alexander Thomson, Thaddeus Diamond, Philip Shao, and Daniel J.
2002. Gene Selection for Cancer Classification using Support Vector Abadi. 2012. Calvin : Fast Distributed Transactions for Partitioned
Machines. Machine Learning 46, 1 (01 Jan 2002), 389–422. Database Systems. In International Conference on the Management of
[26] Jeff Huang, Patrick O’Neil Meredith, and Grigore Rosu. 2014. Maximal Data (SIGMOD). 1–12.
Sound Predictive Race Detection with Control Flow Abstraction. In [44] Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and
Conference on Programming Languages Design and Implementation Yuanyuan Zhou. 2007. Triage: Diagnosing Production Run Failures at
(PLDI). 337–348. the User’s Site. In Symposium on Operating Systems Principles (SOSP).
[27] Ciera Jaspan, Matthew Jorde, Andrea Knight, Caitlin Sadowski, Ed- 131–144.
ward K. Smith, Collin Winter, and Emerson Murphy-Hill. 2018. Ad- [45] X. Yang, D. Lo, X. Xia, Y. Zhang, and J. Sun. 2015. Deep Learning for
vantages and Disadvantages of a Monolithic Repository: A Case Study Just-in-Time Defect Prediction. In International Conference on Software
at Google. In International Conference on Software Engineering (ICSE). Quality, Reliability and Security (QRS). 17–26.
225–234. [46] Lingming Zhang. 2018. Hybrid Regression Test Selection. In Interna-
[28] Y. Kamei, E. Shihab, B. Adams, A. E. Hassan, A. Mockus, A. Sinha, tional Conference on Software Engineering (ICSE). 199–209.
and N. Ubayashi. 2013. A large-scale empirical study of just-in-time [47] Celal Ziftci and Jim Reardon. 2017. Who Broke the Build?: Automat-
quality assurance. IEEE Transactions on Software Engineering 39, 6 ically Identifying Changes That Induce Test Failures in Continuous
(2013), 757–773. Integration at Google Scale. In International Conference on Software
[29] Sunghun Kim, E. James Whitehead, Jr., and Yi Zhang. 2008. Classifying Engineering: Software Engineering in Practice Track (ICSE-SEIP). 113–
Software Changes: Clean or Buggy? IEEE Transactions on Software 122.
Engineering 34, 2 (2008), 181–196.
[30] S. Kim, T. Zimmermann, K. Pan, and E. J. Jr. Whitehead. 2006. Au-
tomatic Identification of Bug-Introducing Changes. In International
Conference on Automated Software Engineering (ASE). 81–90.
[31] Ben Liblit, Mayur Naik, Alice X. Zheng, Alex Aiken, and Michael I.
Jordan. 2005. Scalable Statistical Bug Isolation. In Conference on Pro-
gramming Languages Design and Implementation (PLDI). 15–26.
[32] Mateusz Machalica, Alex Samylkin, Meredith Porth, and Satish Chan-
dra. 2018. Predictive Test Selection. Computing Research Repository
(CoRR) abs/1810.05286 (2018). arXiv:1810.05286 http://arxiv.org/abs/
1810.05286
[33] Atif Memon, Zebao Gao, Bao Nguyen, Sanjeev Dhanda, Eric Nickell,
Rob Siemborski, and John Micco. 2017. Taming Google-scale Con-
tinuous Testing. In International Conference on Software Engineering:
Software Engineering in Practice Track (ICSE-SEIP). 233–242.
15