Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $11.99/month after trial. Cancel anytime.

Linux, Apache, MySQL, PHP Performance End to End
Linux, Apache, MySQL, PHP Performance End to End
Linux, Apache, MySQL, PHP Performance End to End
Ebook605 pages6 hours

Linux, Apache, MySQL, PHP Performance End to End

Rating: 5 out of 5 stars

5/5

()

Read preview

About this ebook

LAMP Performance End To End is a guide to delivering great page speed while reducing server load and increasing capacity. The book covers the entire journey of data from your server's disk to the mind of the end-user explaining the critical bottlenecks along the way and providing practical solutions to performance problems.
Discover
- how SaaS/backoffice systems need a different performance strategy from public facing websites
- what the (currently undocumented) Zend Opcode Optimizer flags actually do
- how to automate content optimization
- how to tune your TCP stack for mobile clients
- which MySQL architecture is right for you
and more.

112,000 words
Links to 240 web published articles and videos
368 pages (PDF version)

LanguageEnglish
Release dateJan 10, 2015
ISBN9781311044747
Linux, Apache, MySQL, PHP Performance End to End
Author

Colin McKinnon

Colin has been playing with computers since he was a teenager. Since leaving university with a Master's degree in Information Technology Systems (shortly after computers were invented) people have been paying him to do this. Having previously worked in software development and systems administration, he now focuses on Service Management, particularly in areas of Security, Performance and Data Integrity.

Related to Linux, Apache, MySQL, PHP Performance End to End

Related ebooks

System Administration For You

View More

Related articles

Reviews for Linux, Apache, MySQL, PHP Performance End to End

Rating: 5 out of 5 stars
5/5

1 rating0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Linux, Apache, MySQL, PHP Performance End to End - Colin McKinnon

    LAMP (Linux Apache, MySQL, PHP) Performance End to End

    ISBN: 9781311044747

    Colin McKinnon

    Distributed by Smashwords

    Content Copyright Colin McKinnon 2014

    Cover photograph (Pygoscelis papua -Nagasaki Penguin Aquarium -swimming underwater-8a) Copyright Ken Funakoshi 2008, (https://www.flickr.com/photos/trektrack/2388729621/in/photolist-4D5SrH-4D5S6t-4DoSNg-4D5SwR-4D5Sh2-2DpiA-2qX17) reproduced under the Creative Commons Attribution-Share Alike 2.0 Generic licence (http://creativecommons.org/licenses/by-sa/2.0/)

    Smashwords Edition, Licence Notes

    This ebook is licensed for your personal enjoyment only. This ebook may not be re-sold or given away to other people. If you would like to share this book with another person, please purchase an additional copy for each recipient. If you’re reading this book and did not purchase it, or it was not purchased for your use only, then please return to your favorite ebook retailer and purchase your own copy. Thank you for respecting the hard work of this author.

    Disclaimer

    While the author has endeavoured to ensure the accuracy of the information contained herein, no warranty is implied. The author shall have no responsibility nor liability with respect to loss or damages arising from the information contained in this book.


    Foreword

    In 2011, someone suggested that I put together a quick 'howto' on optimizing LAMP. Now, 3 years later, like many of the simple tasks I start, it has proved to demand more effort than I had initially thought!

    A lot of performance tuning is based on received wisdom and informed guesswork, but I would not be comfortable expecting to people to pay for a list of unjustified recommendations, or a rehash of documentation available in a coherent form in the form of manuals. Hence behind this book lies a lot of research and testing – I could have written a much longer book, but edited out a lot of the content that might simply too ephemeral, or off topic. There is more coverage of networking and TCP/IP than I had originally intended; while most developers and systems administrators have limited scope for improving performance through tinkering with the network, it is a fundamental constraint on the performance of web browsing.

    I have included references to source material found on the internet and resources for further reading. There is a wealth of great material on the internet (and some very bad information too) however most of the good stuff written about web performance comes from companies like Google and Yahoo who have characterized and challenged the limitations of the World Wide Web. But they have started this journey from a mature skill base in managing backend systems – standing on the shoulders of giants, if you like. There seems to be a paucity of information on optimizing servers. A further gap I have tried to address here is to look at the problems in delivering small scale and/or business applications over HTTP as opposed to massive public-facing information and retail sites. These have different objectives and subtly different performance problems.

    In researching and writing the book, I have gained new knowledge and several times found my own pre-conceptions did not stand up to scrutiny. But despite my efforts, and the efforts of other people I have cited here, the best advice I can give to you is to experiment and measure – don't take my word for it. Knowledge only teaches you the right questions to ask.

    About the book

    While print layouts afford a degree of precision over layout, ebooks are a different story. And I have learnt that a word processor is surprisingly poor tool for writing one in – particularly when it includes tables, charts and diagrams. To try to ensure portability I have not used many of the typographical tricks common in print publishing such as callouts and footnotes, and restricted the fonts to sans-serif for headings and serif for body text. Code snippets are presented in monospace font, on a grey background inside a box. Each line is initially indented by 2 spaces so that wrapping text can be distinguished from new lines:

      This is a short line.

      This is a very long line of text and should, at some point wrap. But wrapping is controlled by your device, and indeed may even change with its configuration or orientation.

    Citations are referenced by chapter number and serial number in square brackets e.g. [2.5] is the 5th reference document in chapter 2. The full citations appear at the end of each chapter.

    Figures are numbered similarly and each has a legend (in italics) describing its content.Tables are implemented as images.

    Many diagrams are presented as hand-drawn sketches. While I do have access to some excellent tools for producing high quality diagrams, this is intended to encourage you to think about sketching out your ideas on paper both as tool for gaining insight and a means of recording and communicating your design decisions.

    Of course all documentation is out of date as soon as it is saved. For updates and more performance tips, please visit the LAMP E2E blog at http://lampe2e.blogspot.co.uk/

    Even though the book is reflowable, it will be difficult to read on a screen smaller than 7 inches. The quality of eBook readers varies greatly - for Android, the best free eBook readers I have encountered are Universal Book Reader by Mobile Systems and Gitden Reader by Gitden Inc. Both are available from the Google Play Store. See the LAMP E2E blog for more info.


    Chapter 1: A Performance Strategy

    I should start with a definition of what I mean by the term performance. There are 2 sides to performance: How quickly a user perceives a web page as loading: And how much infrastructure is required to deliver pages in an acceptable/optimal time to a growing number of clients. The latter is often referred to as capacity. Improving the former drives engagement and improves sales/productivity, while optimizing the latter should improve the former and reduce operational costs.

    Since you're already reading this book, you're concerned about service performance. But where does this fit in to the bigger picture of service provision? How do you assess the importance of performance and convince others? And how do you achieve great website performance? These are some of the questions this book will attempt to answer.

    For a computer system to have value, then it needs to be able to carry out some function: It needs to be accessible in order to carry out that function: It should not be used for functions it is not intended for: It has to carry out that function quickly enough that it adds value.

    Figure 1.1: Utility (value) of a service depends on functionality, availability, security and performance

    Figure 1.1: Utility (value) of a service depends on functionality, availability, security and performance

    There are very immediate costs if the system is not capable of performing its designated function. So it is perhaps not surprising that functionality is usually very well defined at an early stage of development, as are the processes for ensuring compliance of the system with the functional requirements. Typically compliance is established via testing (unit, integration, system through user-acceptance).

    Security is the flip-side of functionality – while there is a finite and well-defined set of things the system should do, there are a lot of things the system shouldn't do. Indeed, it's not practical to enumerate all the things a system shouldn't do, hence documented requirements for system security are often a lot less specific. Compliance is typically addressed via parallel routes to functional testing, however increasingly, service providers are looking at ways of integrating ongoing security monitoring within services; SIEM, intrusion detection systems and data honeypots being examples. While the costs incurred via a failure of security can be greater than the overall utility of a system, these are offset by the likelihood of those costs not being realised or at least delayed.

    The intrinsic functionality and security of a system are unlikely to change over time without specific intervention. Although these are unlikely to be proven to be complete at handover, they are well defined. Some sort of warranty on these is common.

    Availability and performance are similar in that they will change over time but unlike functionality/security are relatively easy to measure.

    Functionality and security are strongly dependant on, and aligned with software, while availability and performance are often perceived to be more associated with hardware. This is only partially true.

    If you buy shrink wrapped software then the contract will usually only address the functionality and security of the product. Unfortunately, when buying hardware, you will rarely get any measure of the performance in terms relevant to the user experience, and only with very expensive hardware are you likely to get an indication of it's availability.

    If the value of a system is in it's utility, then it's clearly in the customers interest to pay for a service which addresses all four aspects. This should be set out in the Service Level Agreement (SLA). A detailed document which sets out not only what targets for the 4 components will be set, but how data will be collected and processed to measure these targets.

    Note, in ITIL terminology, SLA explicitly relates to a binding agreement between a service provider and an external service consumer – here I'm using the term to describe any formal statement regarding the expected quality of the service. This may include internal and/or informal agreements over the quality of the service.

    Having clearly defined boundaries is also in the suppliers interest regardless of the size of the supplier's organization. Even if there is no requirement to provide a Service Level Agreement to the users, establishing one is a basis for assigning resources, division of responsibilities, setting targets, managing priorities and establishing expectations. Internal IT departments are often perceived as cost centres – increasingly organisations are looking beyond cash flow to measure effectiveness. Being able to demonstrate your success and quantify change can put IT in a strong position within an organization. But even if the entire business is operated by a single person, it provides a tool for controlling your workflow.

    1.1 How fast is fast enough?

    Jakob Nielsen has published a great deal of research on web usability, and much of it is accessible on the internet. In 1993 [1.1] he set out 3 intervals which are the milestones in the users' perception of performance:

    0.1 second is about the limit for having the user feel that the system is reacting instantaneously, meaning that no special feedback is necessary except to display the result.

    1 second allows the user’s flow of thought to stay uninterrupted, even though the user will notice the delay. Normally, no special feedback is necessary during delays of more than 0.1 but less than 1.0 second, but the user does lose the feeling of operating directly on the data.

    up to 10 seconds is the limit for keeping the user’s attention focused on the dialogue. For longer delays, users will want to perform other tasks while waiting for the computer to finish, so they should be given feedback indicating when the computer expects to be done.

    1.2 Impact on customer facing sites

    In 2000 [1.2 ] Google carried out usability research on their search results which indicated that users wanted more results on a page – but when they implemented the recommendation, they found they were losing customers. Google identified that the problem was not the amount of content but the speed it was delivered to the customers which affected loyalty / satisfaction. Increasing page load times from 0.5 to 0.9 seconds was resulting in a 20% drop in traffic.

    In a later study in 2009, [1.3] they found that adding 2% latency to Google Checkout resulted in a 2% drop in traffic; a similar rate to that reported by Amazon in 2006 [1.4]who found that revenues dropped by 1% for each 100ms added to response times.

    AOL report that the fastest 10% of pages get 3 times more views per visit than the slowest 10%. [1.5]

    In 2010, Akmai found that 57% of customers will abandon a site taking more than 3 seconds to load. [1.6]

    Page performance directly affects customer loyalty, satisfaction and conversion rates. Making your site gives you a competitive advantage. [1.7] If you provide a service to customers who can obtain a comparable service from other organisations, the answer to the question of how fast is simple: Faster than your competitors.

    1.3 Impact on B2B, SAAS and web-delivered applications

    The web offers a great platform for delivering enterprise applications; low deployment costs, rich interaction and well-defined security. Most such applications are all about data entry and processing. However simplistic time and motion studies give a rather distorted picture.

    In 2008 I ran a study of such an application for a large financial institution. This was on a web application used by staff to process loan applications and payments. I saw an average delay between page requests of 23 seconds. Page load times were taking about 2.5 seconds – or about 11% of the users time was spent navigating between pages. Employees felt that the system was often slow and difficult to work with. Hence the key metric was the percentage of time spent in navigation. Some relatively simple changes to the caching could reduce page load times to under 1 second.

    If the users were taking 21 seconds to fill in the forms and the response times reduced from 2.5 to 1 second, then navigation time should have dropped to approx 4.5% / productivity should have increased by about 6%. But when the change was applied, the time spent in navigation only went down to 5.8%.

    What went wrong? It was a big improvement, certainly, but not what I had expected. But when I looked at the raw data, it showed that navigation was indeed taking 1 second. Was it my calculations which were wrong?

    In fact the average time spent on a page dropped to around 17 seconds seconds. Users were spending less time waiting for pages to load and completing the data entry on each page faster giving an overall 23% improvement in productivity!  

    This application did have a Service Level Agreement for performance – 5 second average page load time. Although the original performance was well within this boundary, and the caching changes alone only down-shifted the curve by a second or two, there was a 78% drop in reports of performance issues by the users, resulting in reduced support costs.

    Published research on response times and productivity for such web-based applications seems to be very rare in comparison to the data available for public facing websites, however there is a large body of earlier research on computer interaction available. These portray a very similar picture that of customer facing websites – and both exhibit the same milestones that are described by psychological and neurological studies of of short- and working-term memory.

    In 1984, Ben Schneiderman [1.8] reported significant impact on productivity and accuracy as a direct result of varying response times in a review of the current literature. In particular he cites work from 1978 where users were asked to perform a simple manipulation of an image (make dots fit inside a shape) with delays of 0.16, 0.72 and 1.49 seconds in providing feedback. There was no measurable difference between user performance at 0.16 and 0.72 seconds, but at 1.49 seconds it took 50% longer to complete the task. He cites many further papers all showing that the user needs to spend more time thinking about a task as the delay time increases. In a 1982 study, error rates were 18% higher with six second response time compared to a two second response time.

    The evidence clearly supports Jakob Neilsen's 1 second target time.

    In 2010, Foviance / Glasgow Caledonian University measured the level of users' stress when interacting with websites using electroencaphalograhy (EEG). Decreasing the bandwidth of the internet connection from 5Mbit/s to 2 increased the users stress levels by 50%. [1.9]

    Where you effectively have a captive market, be it the public sector, an enterprise application or you simply don't have any competition, customer satisfaction is still an important goal; it drives engagement, improves productivity, improves accuracy and reduces complaints.

    1.4 Losing battle? Arms race?

    The trend from the various published studies on customer facing sites shows that the time users consider acceptable for page loading times is decreasing towards around 0.5 seconds. At the same time, page sizes are growing.

    Figure 1.2: Average page size in Kb vs time (from http://www.httparchive.org/interesting.php

    Figure 1.2: Average page size in Kb vs time (from http://www.httparchive.org/interesting.php

    While the speed of domestic broadband is also increasing, it's not doing so at the same rate (and taking into account the growth of mobile internet, the trend flatlines). But while it's always possible to add more bandwidth, as we will see later, latency and packet loss have much more impact on web throughput and responsiveness. These are not so easy to change.

    1.5 Make a plan

    The first step in solving any problem is to make a plan and set targets. There's lots of methodologies for strategic and tactical planning, however in the author's experience the SMART (mnemonic) rules provide a generally applicable checklist for verifying the quality of any plan.

    The outputs of the performance plan should be stated within the Service Level Agreement (SLA), but importantly it should also be a formal definition of how these outputs will be measured and assessed, and how non-compliance will be addressed. An SLA can be written in plain English, but there are suggestions for computer readable formats (e.g. http://www.ibm.com/wsla). Examples of the former are available on the internet, but you may need to dig deep through the search results to find good ones addressing performance.

    The SMART letters (usually) stand for:

        Specific

        Measurable

        Attainable

        Relevant

        Timely

    1.5.1 Specific

    It's not enough to say the system should respond promptly. This should be quantified. Often it is unavoidable that some parts of a system will go faster than others – do you apply an average across all transactions? Across all transaction types? Across a defined subset of interactions? Do you subdivide the system into high level components? If so what are they, what operations take place in what components? What parts of the system lie outwith the SLA (hint: for service delivered over the internet, then the internet itself) and how is the behaviour of these external components accounted for.

    Any assumptions or exceptions in an SLA should be clearly stated, for example:

    Performance metrics will only be collected / reported for the time periods where the availability of the system is guaranteed in section 3 paragraph 2

    1.5.2 Measurable

    You are going to need the machinery to capture and report on performance metrics. For the service provider this machinery needs to go much further than just producing the agreed metrics. Capturing more detailed information will allow for faster, more effective diagnosis of performance issues and capacity planning. On the other hand, a service consumer shouldn't have to rely on the provider for authoritative data.

    1.5.3 Attainable

    An SLA is a guarantee of the quality of service – if a provider can not deliver then the consumer is compromised. Indeed most bilateral agreements will stipulate penalties to deal with cases where a supplier does not deliver on is/her promises, and where penalties are not an explicit part of the contract, they are provided by the wider legal framework. An SLA is not about the aspirations of the service provider, nor what the service provider would like the service to be, it must be realistic.

    1.5.4 Relevant

    This seems to be a rather elusive one for many IT departments. There is no end of tools available off the shelf which will tell you what your CPU usage is, how full your disks are (perhaps even when you will run out of space) and how much memory is used. Service consumers don't care about this. Service consumers should not care about this. You must be able to measure the speed of transactions processed by the system, preferably you should be measuring the responsiveness of the user-interface.

    1.5.5 Timely

    How long should you wait before you are aware of a problem? Do the company directors need to know immediately a service breeches SLA thresholds? Should you be proactive in letting your customers know that the service is degraded (but that you're working to fix the problem).

    And it's not just about the real-time notification and escalation. Is historical data available when you need it – to turn in your monthly report, or to back a case for expenditure on a new server?

    1.6 What should you measure?

    There are many web analytics packages available, these are starting to address performance measurement but it's far from a mature market.

    In 2009, the Aberdeen Group reported Best-in-Class organizations are six-times more likely to measure application response times for every business-critical application as compared to Laggards. [1.10]

    1.6.1 Webserver response times

    The log formats pre-configured on Apache (referer, agent, vhost, common, combined and combinedio) don't log the time taken to service a request. Prior to 2.0, the %T log config directive was available to log the time on the server but only in integer seconds; ideally your responses should be handled in much shorter time frames. Since Apache 2.0, mod_log_config added support for microsecond logging. I would recommend adding '%D' to your access_log format string as a starting point.

    It is important to understand that the time logged is the interval between the request headers arriving at the server and the body of the response being written out – i.e. it's the request response time. There is some additional network overhead at both ends of this interval. We'll look at this in more detail when discussing networks.

    However even taking this into consideration, this interval is still much less than the time it takes for the page to load and render at the client. And that is the user's experience of performance. If you want to measure the performance as perceived by the customer then you need to look at page loading times.

    It's impractical to measure page loading times from the server. But by using javascript you can collect data on the client and send it back to a server for analysis.

    Addressing performance at the client not only addresses the customer experience directly but also offers greater scope for improving functionality – allowing you to prioritize performance improvements based on the effort required to achieve them as well as the performance gain.

    On the other hand, measuring the webserver request times behind a caching reverse proxy will show how well your PHP code, webserver and database are performing. For larger organisations, responsibility for performance may be split between HTML designers, PHP developers, Javascript developers, server and database administrators. This architecture seperates at least the network and browser part from backend system delays if not completely aligned with the organisational structure of the service provision team.

    1.6.2 Real or artificial?

    While it may seem a simple option to test your site using your own scripts (it's relatively simple to script complex interaction, e.g. using HTTP::Recorder and WWW::mechanize or jMeter) this can give a rather distorted picture. You need to test the full range of functionality but without irreversibly carrying out operations like processing payments against real credit cards, adjusting stock levels and sending physical goods out the door. But from a performance point of view, it is difficult to recreate the impact of the physical network on performance, measure the behaviour of javascript components and to avoid repeatedly hitting the same data items (which will therefore end up being cached at the various tiers). Ideally you want to measure and report on the response times for all real interactions. Note the use of the word 'all' here – maintaining a truly representative sample is difficult and often requires more effort than analysing all the data. The overhead of processing this in near to real time is rarely more than a fraction of the effort involved in serving up the content – both in computing and human resources.

    Prior to go-live, you can only use synthetic testing to ensure performance. But for a system which is in production, Real User Measurement (RUM) provides better data and shows you the problems which are really affecting your organisation.

    Synthetic testing does however eliminate a lot of other variables which can be hard to capture and model. While there are lots of open source web load generators available most are relatively crude in terms of the complexity of interaction they can recreate. The 'ab' tool included with the Apache webserver is only really suitable for benchmarking primarily static content. For an OLTP site, benchmarking requires complex scripting to get realistic results. In addition, your performance capacity testing framework needs to be able to both capture and present relevant metrics – again this is a common weak point in load generators. ShowSlow [1.11] is a notable exception. This uses publicly available testing services to create the requests, collects the data and provides an effective front end for interacting with the data.

    Your performance/capacity testing should not ignore the function of the website. One company I worked for had spent a significant amount of money and effort building a dedicated capacity testing environment for their OLTP system. Despite using the best commercial testing software available, they started seeing problems on their production systems which were not in occurring in test. It didn't take long to find out the reason; due to a configuration issue in the test environment, every transaction in every test run over a 6 month period had failed with an error. But because the environment was dedicated for capacity testing and the test scripts checked only the throughput and not the outcome, nobody knew!

    1.6.3 Real User Measurement

    This approach , sometimes also described as 'last-mile' performance, is based on the idea of collecting information about the delay between the initiation of a navigation event and the content being presented to the user. It relies on a Javascript agent in the browser. As the most valuable measure of performance and its relationship to engagement we will look at this in a lot more depth later in the book.

    1.6.4 Lies, Damned Lies and Statistics

    While I don't expect to you to read every entry in your webserver logs in realtime, there are pitfalls in how you aggregate the data into something more meaningful. An average is a poor indicator of performance metrics.

    Figure 1.3: Two populations with the same average

    Figure 1.3: Two populations with the same average

    Figure 1.3 shows two response time populations with a near normal distribution, both with the same average – but while one population has no members over the threshold, the other has around 20% taking too long. Using the maximum value gives an even more distorted picture of the performance of a system.

    While you could quote an average and a standard deviation (or at least a standard error) it seems that using a single figure reduces confusion – the mean + twice the standard deviation is 95% of a normal distribution and used commonly.

    Another important consideration is the period over which you aggregate data to create a single data point. When I first started getting involved in web performance analysis, my employers at the time already had processes in place for gathering and analysing data – stats were aggregated by hour (at approximately 50,000 hits / hour). This provided visibility of some performance issues, but it was only when I started looking at data aggregated over a minute or less that the majority of the problems users were complaining about become apparent; database locking, firewalls dropping connections triggering network timeouts, DNS issues...there's lot of causes for transient performance issues.

    That's not to say that you shouldn't look at stats averaged over longer periods – this is where you will see capacity problems, issues slipping through from development, changes in usage trends...

    1.6.5 Measuring change

    Change happens – new functionality is added to your site, hardware needs to be replaced, data accumulates, and (hopefully) performance is tuned. Tracking the impact on performance needs to be on-going process.

    Comparing the same metrics before and after is a good way to get an indicator of the impact on performance, but a better approach (which reduces the contribution of external factors such as marketing campaigns, large client side outages etc) is to compare both setups at the same time by splitting your clients into two or more groups – and not by a dependant variable such as location! i.e. A/B testing.

    A good system/network administrator should be able to take an informed guess about what is a sensible range for a specific configuration setting. However the optimal setting can usually only be found by experimentation. With the complexity of interacting components on a website, small changes can have a large impact – and local minima are a common feature. Tuning is not easy.

    1.7 How do I make my site go faster?

    Identify the limiting factors and change them one at time. Of course this makes it sound much simpler than the reality of finding and solving a performance problem, but it does illustrate again that consistent measurement is key to achieving good performance.

    Some things require more effort to change, and different components of the system will have different scope for improving performance. Increasing the capacity of your DNS server will have little impact on a public facing internet webserver – few clients will ever fetch DNS data directly from your systems. On the other hand how your content is structured and the caching instructions it was served up with will affect performance at the client even when they don't fetch the item direct from the origin server.

    Figure 1.4: Benefits and effort

    Figure 1.4: Benefits and effort

    The diagram shown in figure 1.4 compares the effort involved in tuning parts of a system compared with the typical impact on performance. This is based purely on long experience with fixing performance problems rather than collated metrics from different (and possibly not really comparable) sources.

    To use the image above, draw a vertical line to reflect your current page load times, then starting with the topmost span, investigate how that aspect of your system is optimized. If this is over 10 seconds, then you should probably start by looking at the queries running on your underlying database. If you are currently loading your whole page in 0.5 seconds then well done – you're outperforming most of the Alexa top 100 – but if you want to go faster still, then begin by looking at your webserver config.

    The sequence of chapters in the book is not aligned with either axis of the diagram. I have tried to order it in a way that tells a progressive story and builds on topics already covered.

    1.8 High level hardware architecture

    If the mean time before failure (MTBF) for a server is 500 days, and the time it takes to fix it when it breaks is 10 hours then a good estimate of the probability it will be in failed state at any time is 10/(500*24) =0.000833. Or to put it another way, the availability is 99.91 % That's not bad for availability.

    There are things which can be done during the design and manufacture to increase the MTBF and decrease the Mean Time To Recovery (MTTR) but they do add cost – redundant power supplies, using high quality components (e.g. polymer based rather than electrolytic capacitors). However when you start trying to manufacture PCs based on high quality components, with higher quality assurance testing, what seems like small spec changes result in rapid cost escalations as your product prices itself outside of the mainstream and you lose economies of scale. i.e. consumers end up paying a lot more for a product which is only slightly better.

    1.8.1 Availability

    Going back to our 3 9's computer. A fairly obvious way to increase capacity is to use 2 computers instead of just one. Suppose we run a webserver on one with some sort of content management system and a database server on the other. Now if either box fails, then the service is lost. The probability of failure of either node is 0.000833 + 0.000833 The availability has dropped to 99.8%

    Now suppose that instead of the serial arrangement we set them up in parallel, both running a webserver and a database. For the service to be unavailable, both nodes must fail. The probability of this occurring is 0.000833 x 0.000833 = 0.0000007 or 99.99993 % availability. By simply arranging the components differently the service is more than 2000 times more reliable!

    The other approach to high availability is to switch in a backup component when the live one fails. There are some problems with this:

    you don't get the benefit of increased capacity/performance compared with distributing the workload across multiple devices

    you can't be sure the mechanism is going to work until a failover occurs (this may sound a trivial argument, however in the author's experience it is alarming how infrequently such configurations are properly tested and how frequently failover fails).

    You need additional components to detect and implement the switching

    1.8.2 Performance

    From a performance point of view, stacking machines to create services instead of running them in parallel introduces a further hop on the network; even between a reverse proxy and a webserver running on the same machine there will be some delay. However this is usually relatively small compared to the delays in the network between the server stack and the client.

    Much of what applies to the economies of building more reliable computers also applies to the problem of building faster

    Enjoying the preview?
    Page 1 of 1