Sunteți pe pagina 1din 3

Chapter 1. Section 1.1, "Understanding Improvement".

This is the "Introduction to DevOps" course,


I'm the instructor, John Willis, @botchagalupe on Twitter.
So, understanding improvement - a lot of what we do in DevOps is about improvement.
In fact, the first chapter here, I really wanted to focus on improvement as a
concept, because
throughout this course we're going to understand that continuous improvement is a
cornerstone in DevOps and,
there's a story, this Knight Capital, it's a cautionary tale...
... it's a cautionary tale in that we want to say that the consequences of failure
have never been greater.
The Knight Capital story is ... this was a high frequency trading company,
it was listed as the most highly active company on the NYSE and,
basically what happened, the short story was that they had some bad code get put
into production,
that made erroneous trades and they lost about four hundred million dollars in
about 45 minutes
and, within 24 hours,
went pretty much out of business.
The longer story is, and there's really good links,
we have a link of the Wikipedia page, but the SEC filing is fascinating
and the link to that, and then
John Allspaw, the CTO of Etsy, did a great write-up on kind of criticism of the
SEC write-up, and
one of the things that I want to be cautious here is that, as we use this as a
cautionary tale , the Knight Capital,
we need to be...
we need to understand that we are using a hindsight bias,
so, I think more about the kind of meta of what happened than really what happened
to this particular topic.
Anyway, the story goes that they had an application that they were going to deploy.
It had to be deployed on eight servers, a cluster of eight servers
and, what happened is seven of the servers got updated and the eighth did not.
The new application re-purposed a flag that an old application used.
The old application was still alive on all eight servers,
but dormant, because the flag had been turned off, but when the new application got
deployed to the seven servers, everything was fine.
Unfortunately, the eighth server activated the old application,
The old application arose.. basically started making stock trades
that were not real stock trades and the disaster happened for Knight Capital.
So, if we want to use this is a cautionary tale, some of the ...
kind of hindsight we can look at is, it looks like they were doing manual installs
and
not taking advantage of that automation...
things like Chef, or Puppet, or CFEngine.
It looks like they ... we talk often about this concept of "pets vs. cattle" in
DevOps,
where "Do we treat our servers as pets?" or "Do we treat our servers as cattle?
One could argue it sounds like Knight Capital treated their servers as pets,
because it looks like they didn't rebuild them frequently.
The fact that there was an old application, in fact that, if I remember the SEC
filing, that old application was seven years old.
So, it looked like they were not rebuilding servers.
So, there's all sorts of, you know, what I would call operations' or systems'
administration hygiene problems here.
But we know that the consequences of failure across the board, in fact,
just the other day, Slack was down, and,
in the organization I work it was ... it kind of crippled us and,
and then, if you follow Twitter, you saw a lot of people who rely on Slack for
their business were...
were kind of disrupted.
You know, we know when cloud providers like Amazon go down, it affects all sorts of
commerce.
So, the idea that today IT is extremely important and
we need, you know, and that failure...
you know, not understanding improvement related to failure ...
... again, the consequences have never been greater.
So, the key to understanding performance is that we will look at ...
basically in this chapter, but,
if we look at what we would consider high-performing organizations versus low-
performing organizations,
we would say that companies that use kind of DevOps patterns, if you will,
would be high-performing organizations.
We have the kind of ... cloud titans are the ones that would be obvious...
the Amazon, the Google, the Facebook, the Etsy, and Netflix.
These are companies that routinely deploy hundreds of times a day to production...
... stories of Amazon actually deploying thousands of times per hour.
A lot of these companies will deploy ... they'll have their employees, first day
employees, deploy to production.
You know, Facebook brags about some of their employees who actually put code in
production before they finish their paperwork - the employee paperwork.
Etsy has some videos where their Board members come in and take a story of the
board and put it in production.
So ... and then, we contrast this to organizations that struggle deploying maybe
more than twice a year and have kind of a waterfall deployment model.
And, most of those, we'll see, through some survey data and other areas, tend to be
low-performance organizations.
The other thing we really need to think about when we talk about understanding
improvement,
is the classic core conflict in IT
that you can get reliability or you can get speed.
And here, furthermore we take the kind of Dev and Ops, right?
Developers typically drive speed. The Agile Movement was all about speed.
DevOps actually was born out of Agile ... a fair amount of people who are familiar
and worked in an Agile mode who took on the responsibility of operations,
and wound up wanting to move the deployment as fast as they did the developments,
because, in essence, a lot of them were developers
So, DevOps is a lot about changing or be able to show that we can get, you know,
speed and reliability.
We also have what we call the Iron Triangle on the right, right? This is out of
service management, I tell 77 00:06:47,740 --> 00:06:55,000 You can get two: you
can get speed and cost, but you can't get reliability, and you can get reliability
and speed, but it's going to cost you a lot .
But, what we find with DevOps is, there really is no conflict.
There is no conflict in the speed and reliability, we'll show this in a little bit
in a bunch of examples,
and, in fact, Adrian Cockcroft, one of the primary architects of Netflix,
Netflix has presentations and he does call it "faster, cheaper, safer".
So, you can get all ... you can get speed, you get the reduced cost, and you get
reliability,
and we see this time after ... and not only do we see it in the kind of what I
would call the web titans, or the large massively scalable web organizations.
We also see it in now enterprises: legacy enterprise, hundred-year-old companies.
So, one of the most interesting things that happened over the last four or five
years is the DevOps survey
... the "State of DevOps Report".
It's run by IT Revolution at Puppet Labs.
It's been running for about four years, that ... about 20,000 DevOps professionals
have been surveyed.
In 2015 it was probably the most significant data. 90 00:07:59,020 --> 00:08:06,320
What we find is this contrast between high-performing organizations and low-
performing organizations.
And this is a statistically sound survey.
Nicole Forsgren, she's a PhD Statistics and Psychometrics, so the data is valid.
And what we found in 2015
is that people ... that high-performing organizations have certain cultural
behavior patterns
that allow them to be 30 times ... to deploy software 30 times more frequently than
low-performing organizations.
These same organizations have 200 times shorter Lead Time.
So, Lead Time ... there's different measurements of, or characteristics to what
people decide are this Lead Time...
I like to say it's from the ... the idea of the ... the Whiteboard to the Ka-ching
or the production.
But, in any case, whether it's from a story to deployment,
it measures the amount of time, and it really measures how fast you are.
So, these first two metrics are that high-performance are certainly faster and
they get ...they're faster in how often they do it and they're faster in the time
it takes to get them to put a new fix, a change, or new idea into production.
But here's the ... the interesting thing: the same metrics for speed also correlate
with metrics for reliability.
So, high-performing organizations also have 60 times less failures when they deploy
code and they make changes.
And then, this is my favorite metric as part of the survey, is, high-performing
organizations have 166 times faster MTTR, mean time to resolve.
So, the survey says if you will, that "You can go fast"
and high-performing organizations will deploy 30 times faster and have 200 times
shorter Lead Time
and they have the reliability.
So, in other words, they're faster than low-performing organizations
and they are more reliable; and we'll show you story after story and that the
survey is statistically sound, but we'll also show you enterprises
that have spoken at conferences and proved this out.
And I told you that high-performing organizations have behavior ... cultural
behavior patterns that are different from low-performing organizations.
So, one of the kind of core backdrop of the DevOps survey
is something called the Ron Westrum topology of organizational culture.
And Ron Westrum categorizes really three types of organizations, but, for the
purpose of this discussion, we'll focus on the two:
the Generative one on the right hand side, these are, in the survey,
these are organizations that are high-performing organizations,
so these are the ones that, you know, have the faster speed and better reliability.
They tend to have high cooperation, messages, get trained,
the risks are shared, the bridging is encouraged, they have a healthy attitude
towards failure they're inquisitive and novelty is implemented.
Low-performing organizations, as part of this survey, and are, you know, do slower
and are less reliable, ...
what Ron Westrum calls Pathological.
They are low cooperations, messengers are shot, responsibilities short,
bridging discouraged, they have negative attitudes towards failure,
they try to avoid failure at all costs, scapegoating, and novelty is crushed.

S-ar putea să vă placă și