Ait 5

Enterprise Systems Architecture (ESA)
Definition - What does Enterprise Systems Architecture (ESA) mean?

Enterprise system architecture (ESA) is the overall IT system architecture of an
organization.
This architecture is the key part of managing and evolving IT systems, and therefore the
business operations, of an organization.
It consists of the architectures of individual systems and their relationships in the
perspective of an organization.
An organization's enterprise system architecture must not be a monolithic illustration of the
structure of its IT systems. Instead, it must be organized to mirror the dynamic and static
structure of an organization in order to assist in every aspect of an organizations business
tasks.
Enterprise system architecture corresponds to the organizational entities at different stages
of granularity, such as the individual information systems, enterprise, enterprise units, etc.
the advantages of adopting efficient enterprise system architecture include:
Architecture Analysis: Assists in performing system analysis at the architectural level.
This helps to support the system design process.
Business/System Understanding: Offers a concrete foundation for effectively
understanding the business operations of an enterprise, which results in improved
business management.
Business/System Planning: Offers a useful tool to plan numerous business activities,
from strategic directions to local enhancement.
Restructuring and System Integration: Helps to make restructuring and system
integration possible whenever a change in business operations happens in the
organization, for example, at the time of mergers and diversification.
System Evolution: Offers required grounds for evaluating the outcome of major
transformations in an organization, such as by replacing old systems with new
systems, adding brand-new systems and decommissioning of outdated systems.
Enterprise architecture
Enterprise Architecture (EA) regards the enterprise as a large and complex system, which
ought to be well planned and well described from an abstract level to a detailed level.
The term "enterprise architecture" has been and still is used with various meanings.
For a history of the term, showing how modern enterprise architecture (EA) frameworks
emerged from information system architecture frameworks; go to the history section of
Enterprise Architecture framework.
The Enterprise Architecture Research Forum defines EA as the continuous practice of
describing the essential elements of a socio-technical organization, their relationships to each
other and to the environment, in order to understand complexity and manage change.
The MIT Center for Information Systems Research (MIT CISR) defines enterprise
architecture as the specific aspects of a business that are under examination:
Enterprise architecture is the organizing logic for business processes and IT
infrastructure reflecting the integration and standardization requirements of the
company's operating model. The operating model is the desired state of business
process integration and business process standardization for delivering goods and
services to customers.
Scope
The term enterprise covers all kinds of business organization, public or private, large or
small, including
Public or private sector organizations
An entire business or corporation
1
A part of a larger enterprise (such as a business unit)

A conglomerate of several organizations, such as a joint venture or partnership
A multiple outsourced business operation
Many collaborating public and/or private organizations in multiple countries
The term enterprise includes the whole complex, socio-technical system including people,
information, processes and technologies.
The term architecture refers to a high-level or abstract description of the enterprise as a
system - its boundary, the products and services it provides, and its internal structures and
behaviors, both human and technical.
It is assumed that designers, developers or engineers will complete the most detailed and
concrete descriptions of specific enterprise systems and the architect will retain responsibility
for governing that lower level work.
Developing and using an Enterprise Architecture Description
An enterprise architecture description contains a variety of lists, tables and diagrams known
as artifacts.
These artifacts describe the logical business functions or capabilities, business processes,
human roles and actors, the physical organization structure, data flows and data stores,
business applications and platform applications, hardware and communications infrastructure.
The architecture of an enterprise is described with a view to improving the manageability,
effectiveness, efficiency or agility of the business, and ensuring that money spent on
information technology (IT) is justified.
Paramount to changing the enterprise architecture is the identification of a sponsor, his/her
mission, vision and strategy and the governance framework to define all roles, responsibilities
and relationships involved in the anticipated transformation.
Changes considered by enterprise architects typically include:
innovations in the structure or processes of an organization
innovations in the use of information systems or technologies
the integration and/or standardization of business processes,
improving the quality and timeliness of business information.
A methodology for developing and using an enterprise architecture to guide the
transformation of a business from a baseline state to a target state, sometimes through several
transition states, is usually known as an enterprise architecture framework.
An Enterprise Architecture framework provides a structured collection of processes,
techniques, artifact descriptions, reference models and guidance for the production and use of
an enterprise-specific architecture description.
Benefits of enterprise architecture
As new technologies arise and are implemented, the benefits of enterprise architecture
continue to grow.
Enterprise architecture defines what an organization does; who performs individual
functions within the organization, and within the market value chain; how the organizational
functions are performed; and how information is used and stored.
IT costs are reduced and responsiveness with IT systems is improved. However, to be
successful, continual development and periodic maintenance of the enterprise architecture is
essential.
Building enterprise architecture could take considerable time and proper planning is
essential, including phasing the project in slowly, prior to implementation.
If the enterprise architecture is not kept up to date, the aforementioned benefits will become
useless.
Examples of enterprise architecture use
Documenting the architecture of enterprises is done within the U.S. Federal Governmentin
the context of the Capital Planning and Investment Control (CPIC) process.
The Federal Enterprise Architecture (FEA) reference models guides federal agencies in the
development of their architectures.
Companies such as Independence Blue Cross, Intel, Volkswagen AG and InterContinental

Hotels Group use enterprise architecture to improve their business architectures as well as to
improve business performance and productivity.
For various understandable reasons, commercial organizations rarely publish substantial
enterprise architecture descriptions. However, government agencies have begun to publish
architectural descriptions they have developed. Examples include
US Department of the Interior,
US Department of Defense Business Enterprise Architecture, or the 2008 BEAv5.0
version.
Treasury Enterprise Architecture Framework
Relationship to other disciplines
Enterprise architecture is a key component of the information technology governance
process in many organizations, which have implemented a formal enterprise architecture
process as part of their IT management strategy.
While this may imply that enterprise architecture is closely tied to IT, it should be viewed in
the broader context of business optimization in that it addresses business architecture,
performance management and process architecture as well as more technical subjects.
Depending on the organization, enterprise architecture teams may also be responsible for
some aspects of performance engineering, IT portfolio management and metadata
management.
Recently, protagonists like Gartner and Forrester have stressed the important relationship of
Enterprise Architecture with emerging holistic design practices such as Design Thinking and
User Experience Design.
Analyst firm Real Story Group suggested that Enterprise Architecture and the emerging
concept of the Digital workplace were "two sides to the same coin."
The following image from the 2006 FEA Practice Guidance of US OMB sheds light on the
relationship between enterprise architecture and segment (BPR) or Solution architectures.
Reliability
Reliability refers to the consistency of a measure. A test is considered reliable if we get the
same result repeatedly.
For example, if a test is designed to measure a trait (such as introversion), then each time the
test is administered to a subject, the results should be approximately the same.
Unfortunately, it is impossible to calculate reliability exactly, but it can be estimated in a
number of different ways.
1. Test-Retest Reliability
To gauge test-retest reliability, the test is administered twice at two different points in time.
This kind of reliability is used to assess the consistency of a test across time.
This type of reliability assumes that there will be no change in the quality or construct being
measured.
Test-retest reliability is best used for things that are stable over time, such as intelligence.
Generally, reliability will be higher when little time has passed between tests.
2. Inter-rater Reliability
This type of reliability is assessed by having two or more independent judges score the test.
The scores are then compared to determine the consistency of the raters estimates.
One way to test inter-rater reliability is to have each rater assign each test item a score. For
example, each rater might score items on a scale from 1 to 10.
Next, you would calculate the correlation between the two ratings to determine the level of
inter-rater reliability.
Another means of testing inter-rater reliability is to have raters determine which category
each observation falls into and then calculate the percentage of agreement between the raters.
So, if the raters agree 8 out of 10 times, the test has an 80% inter-rater reliability rate.
3. Parallel-Forms Reliability
Parellel-forms reliability is gauged by comparing two different tests that were created using
the same content.
This is accomplished by creating a large pool of test items that measure the same quality
and then randomly dividing the items into two separate tests.
The two tests should then be administered to the same subjects at the same time.
4. Internal Consistency Reliability
This form of reliability is used to judge the consistency of results across items on the same
test.
Essentially, you are comparing test items that measure the same construct to determine the
tests internal consistency.
When you see a question that seems very similar to another test question, it may indicate
that the two questions are being used to gauge reliability.
Because the two questions are similar and designed to measure the same thing, the test taker
should answer both questions the same, which would indicate that the test has internal
consistency.
In general, reliability (systemic def.) is the ability of a person or system to perform and
maintain its functions in routine circumstances, as well as hostile or unexpected
circumstances.
Reliability may refer to:
Reliability (engineering) and also a branch of statistics, the ability of a system or
component to perform its required functions under stated conditions for a specified
period of time.
Reliability (psychometrics), of a set of data and experiments
High reliability is informally reported in "nines"
Reliabilism in philosophy and epistemology
Data reliability, a property of some disk arrays in computer storage
Reliability theory, as a theoretical concept, to explain biological aging and species
longevity
Reliability (computer networking), a category used to describe protocols
Reliability (semiconductor), outline of semiconductor device reliability drivers
8
Availability
In reliability theory and reliability engineering, the term availability has the following
meanings:
The degree to which a system, subsystem or equipment is in a specified operable and
committable state at the start of a mission, when the mission is called for at an
unknown, i.e. a random, time. Simply put, availability is the proportion of time a
system is in a functioning condition. This is often described as a mission capable
rate. Mathematically, this is expressed as 1 minus unavailability.
The ratio of (a) the total time a functional unit is capable of being used during a given
interval to (b) the length of the interval.
Introduction
Availability of a system is typically measured as a factor of its reliability - as reliability
increases, so does availability.
Availability of a system may also be increased by the strategy on focusing on increasing
testability & maintainability and not on reliability. Improving maintainability is generally
easier than reliability.
Maintainability estimates (Repair rates) are also generally more accurate.
However, because the uncertainties in the reliability estimates are in most cases very large,
it is likely to dominate the availability (prediction uncertainty) problem, even while
maintainability levels are very high.
When reliability is not under control more complicated issues may arise, like manpower
(maintainers / customer service capability) shortage, spare part availability, logistic delays,
lack of repair facilities, extensive retro-fit and complex configuration management costs and
others.
The problem of unreliability may be increased also due to the "domino effect" of
maintenance induced failures after repairs.
Only focusing on maintainability is therefore not enough. If failures are prevented, none of
the others are of any importance and therefore reliability is generally regarded as the most
important part of availability.
Reliability needs to be evaluated and improved related to both availability and the cost of
ownership (due to cost of spare parts, maintenance man-hours, transport costs, storage cost,
part obsolete risks etc.).
Often a trade-off is needed between the two. There might be a maximum ratio between
availability and cost of ownership.
Testability of a system should also be addressed in the availability plan as this is the link
between reliability and maintainability.
The maintenance strategy can influence the reliability of a system (e.g. by preventive and/or
predictive maintenance), although it can never bring it above the inherent reliability.
So, Maintainability and Maintenance strategies influence the availability of a system. In
theory this can be almost unlimited if one would be able to always repair any fault in an
infinitely short time. This is in practice impossible.
Repair-ability is always limited due to testability, manpower and logistic considerations.
An availability plan should clearly provide a strategy for availability control. Whether only
Availability or also Cost of Ownership is more important depends on the use of the system.
For example, a system that is a critical link in a production system - e.g. a big oil platform
is normally allowed to have a very high cost of ownership if this translates to even a minor
increase in availability, as the unavailability of the platform results in a massive loss of
revenue which can easily exceed the high cost of ownership.
A proper reliability plan should always address RAMT analysis in its total context. RAMT
stands in this case for Reliability, Availability, Maintainability/Maintenance and Testability
in context to the customer needs.
Representation
The most simple representation for availability is as a ratio of the expected value of the
uptime of a system to the aggregate of the expected values of up and down time, or
If we define the status function
as
therefore, the availability A(t) at time t>0 is represented by
Average availability must be defined on an interval of the real line. If we consider an

arbitrary constant
, then average availability is represented as
Limiting (or steady-state) availability is represented by
Limiting average availability is also defined on an interval
as,
Availability
The probability that an item will be in an operable and commitable state at the start of a
mission when the mission is called for at a random time. Availability is generally defined as
uptime divided by downtime.
Definitions within Systems Engineering
Availability, Inherent (Ai)
The probability that an item will operate satisfactorily at a given point in time when used
under stated conditions in an ideal support environment.
It excludes logistics time, waiting or administrative downtime, and preventive maintenance
downtime. It includes corrective maintenance downtime.
Inherent availability is generally derived from analysis of an engineering design and is
calculated as the mean time to failure (MTTF) divided by the mean time to failure plus the
mean time to repair (MTTR).
It is based on quantities under control of the designer.
Availability, Achieved (Aa)
The probability that an item will operate satisfactorily at a given point in time when used
under stated conditions in an ideal support environment (i.e., that personnel, tools, spares, etc.
are instantaneously available).
It excludes logistics time and waiting or administrative downtime. It includes active
preventive and corrective maintenance downtime.
10
Availability, Operational (Ao)

The probability that an item will operate satisfactorily at a given point in time when used in
an actual or realistic operating and support environment.
It includes logistics time, ready time, and waiting or administrative downtime, and both
preventive and corrective maintenance downtime.
This value is equal to the mean time between failure (MTBF) divided by the mean time
between failure plus the mean downtime (MDT).
This measure extends the definition of availability to elements controlled by the logisticians
and mission planners such as quantity and proximity of spares, tools and manpower to the
hardware item.
Literature
Availability is well established in the literature of stochastic modeling and optimal
maintenance.
Barlow and Proschan [1975] define availability of a repairable system as "the probability
that the system is operating at a specified time t." Blanchard [1998] gives a qualitative
definition of availability as "a measure of the degree of a system which is in the operable and
committable state at the start of mission when the mission is called for at an unknown random
point in time."
This definition comes from the MIL-STD-721. Lie, Hwang, and Tillman [1977] developed
a complete survey along with a systematic classification of availability.
Availability measures are classified by either the time interval of interest or the mechanisms
for the system downtime.
If the time interval of interest is the primary concern, we consider instantaneous, limiting,
average, and limiting average availability.
Replication
Replication in computing involves sharing information so as to ensure consistency between
redundant resources, such as software or hardware components, to improve reliability, faulttolerance, or accessibility.
Terminology
One speaks of:
data replication if the same data is stored on multiple storage devices,
computation replication if the same computing task is executed many times.
A computational task is typically replicated in space, i.e. executed on separate devices, or it
could be replicated in time, if it is executed repeatedly on a single device.
The access to a replicated entity is typically uniform with access to a single, non-replicated
entity.
The replication itself should be transparent to an external user. Also, in a failure scenario, a
failover of replicas is hidden as much as possible. The latter refers to data replication with
respect to Quality of Service (QoS) aspects.
Computer scientists talk about active and passive replication in systems that replicate data
or services:
active replication is performed by processing the same request at every replica.
passive replication involves processing each single request on a single replica and
then transferring its resultant state to the other replicas.
If at any time one master replica is designated to process all the requests, then we are
talking about the primary-backup scheme (master-slave scheme) predominant in highavailability clusters.
On the other side, if any replica processes a request and then distributes a new state, then
this is a multi-primary scheme (called multi-master in the database field).
In the multi-primary scheme, some form of distributed concurrency control must be used,
such as distributed lock manager.
11
Load balancing differs from task replication, since it distributes a load of different (not the
same) computations across machines, and allows a single computation to be dropped in case
of failure.
Load balancing, however, sometimes uses data replication (especially multi-master
replication) internally, to distribute its data among machines.
Backup differs from replication in that it saves a copy of data unchanged for a long period
of time.
Replicas, on the other hand, undergo frequent updates and quickly lose any historical state.
Replication in distributed systems
Replication is one of the oldest and most important topics in the overall area of distributed
systems.
Whether one replicates data or computation, the objective is to have some group of
processes that handle incoming events.
If we replicate data, these processes are passive and operate only to maintain the stored
data, reply to read requests, and apply updates. When we replicate computation, the usual
goal is to provide fault-tolerance.
Replication models in distributed systems
A number of widely cited models exist for data replication, each having its own properties
and performance:
1. Transactional replication. This is the model for replicating transactional data, for
example a database or some other form of transactional storage structure. The onecopy serializability model is employed in this case, which defines legal outcomes of a
transaction on replicated data in accordance with the overall ACID properties that
transactional systems seek to guarantee.
2. State machine replication. This model assumes that replicated process is a
deterministic finite automaton and that atomic broadcast of every event is possible. It
is based on a distributed computing problem called distributed consensus and has a
great deal in common with the transactional replication model. This is sometimes
mistakenly used as synonym of active replication. State machine replication is usually
implemented by a replicated log consisting of multiple subsequent rounds of the
Paxos algorithm. This was popularized by Google's Chubby system, and is the core
behind the open-source Keyspace data store.[3][4]
3. Virtual synchrony. This computational model is used when a group of processes
cooperate to replicate in-memory data or to coordinate actions. The model defines a
distributed entity called a process group. A process can join a group, and is provided
with a checkpoint containing the current state of the data replicated by group
members. Processes can then send multicasts to the group and will see incoming
multicasts in the identical order. Membership changes are handled as a special
multicast that delivers a new membership view to the processes in the group.
Database replication
Database replication can be used on many database management systems, usually with a
master/slave relationship between the original and the copies.
The master logs the updates, which then ripple through to the slaves.
The slave outputs a message stating that it has received the update successfully, thus
allowing the sending (and potentially re-sending until successfully applied) of subsequent
updates.
Database replication becomes difficult when it scales up. Usually, the scale up goes with
two dimensions, horizontal and vertical: horizontal scale-up has more data replicas, vertical
scale-up has data replicas located further away in distance.
Problems raised by horizontal scale-up can be alleviated by a multi-layer multi-view access
protocol.
Vertical scale-up causes fewer problems in that internet reliability and performance are
improving.
12
When data is replicated between database servers, so that the information remains
consistent throughout the database system and users cannot tell or even know which server in
the DBMS they are using, the system is said to exhibit replication transparency.
Disk storage replication
Active (real-time) storage replication is usually implemented by distributing updates of a
block device to several physical hard disks.
This way, any file system supported by the operating system can be replicated without
modification, as the file system code works on a level above the block device driver layer.
It is implemented either in hardware (in a disk array controller) or in software (in a device
driver).
The most basic method is disk mirroring, typical for locally-connected disks.
The storage industry narrows the definitions, so mirroring is a local (short-distance)
operation.
A replication is extendable across a computer network, so the disks can be located in
physically distant locations, and the master-slave database replication model is usually
applied.
The purpose of replication is to prevent damage from failures or disasters that may occur in
one location, or in case such events do occur, improve the ability to recover.
For replication, latency is the key factor because it determines either how far apart the sites
can be or the type of replication that can be employed.
The main characteristic of such cross-site replication is how write operations are handled:
Synchronous replication - guarantees "zero data loss" by the means of atomic write
operation, i.e. write either completes on both sides or not at all. Write is not
considered complete until acknowledgement by both local and remote storage. Most
applications wait for a write transaction to complete before proceeding with further
work, hence overall performance decreases considerably. Inherently, performance
drops proportionally to distance, as latency is caused by speed of light. For 10 km
distance, the fastest possible roundtrip takes 67 s, whereas nowadays a whole local
cached write completes in about 10-20 s.
o An often-overlooked aspect of synchronous replication is the fact that failure
of remote replica, or even just the interconnection, stops by definition any and
all writes (freezing the local storage system). This is the behaviour that
guarantees zero data loss. However, many commercial systems at such
potentially dangerous point do not freeze, but just proceed with local writes,
losing the desired zero recovery point objective.
o The main difference between synchronous and asynchronous volume
replication is that synchronous replication needs to wait for the destination
server in any write operation.[6]
Asynchronous replication - write is considered complete as soon as local storage
acknowledges it. Remote storage is updated, but probably with a small lag.
Performance is greatly increased, but in case of losing a local storage, the remote
storage is not guaranteed to have the current copy of data and most recent data may be
lost.
Semi-synchronous replication - this usually means[citation needed] that a write is
considered complete as soon as local storage acknowledges it and a remote server
acknowledges that it has received the write either into memory or to a dedicated log
file. The actual remote write is not performed immediately but is performed
asynchronously, resulting in better performance than synchronous replication but with
increased risk of the remote write failing.
o Point-in-time replication - introduces periodic snapshots that are replicated
instead of primary storage. If the replicated snapshots are pointer-based, then
during replication only the changed data is moved not the entire volume.
Using this method, replication can occur over smaller, less expensive
bandwidth links such as iSCSI or T1 instead of fiber optic lines.
13
To address the limits imposed by latency, techniques of WAN optimization can be applied to
the link.
Notable implementations
Many distributed filesystems use replication to ensure fault tolerance and avoid a single
point of failure.
See the lists of distributed fault-tolerant file systems and distributed parallel fault-tolerant
file systems.
Other notable storage replication software includes:
Dell - AppAssure Backup, replication and disaster recovery
Dell - Compellent Remote Instant Replay
EMC - EMC RecoverPoint
EMC - EMC SRDF
EMC - EMC VPLEX
DataCore SANsymphony & SANmelody
StarWind iSCSI SAN & NAS
FalconStor Replication & Mirroring (sub-block heterogeneous point-in-time, async,
sync)
[7]
FreeNas - Replication handled by ssh + zfs file system
Hitachi TrueCopy
Hewlett-Packard - Continuous Access (HP CA)
IBM - Peer to Peer Remote Copy (PPRC) and Global Mirror (known together as IBM
Copy Services)
Linux - DRBD - open source module
HAST DRBD-like Open Source solution for FreeBSD.
MapR volume mirroring
NetApp SyncMirror
NetApp SnapMirror
Symantec Veritas Volume Replicator (VVR)
VMware - Site Recovery Manager (SRM)
File-based replication
File-based replication is replicating files at a logical level rather than replicating at the
storage block level.
There are many different ways of performing this. Unlike with storage-level replication, the
solutions almost exclusively rely on software.
Capture with a kernel driver
With the use of a kernel driver (specifically a filter driver), that intercepts calls to the
filesystem functions, any activity is captured immediately as it occurs.
This utilises the same type of technology that real time active virus checkers employ.
At this level, logical file operations are captured like file open, write, delete, etc.
The kernel driver transmits these commands to another process, generally over a network to
a different machine, which will mimic the operations of the source machine.
14
Like block-level storage replication, the file-level replication allows both synchronous and
asynchronous modes.
In synchronous mode, write operations on the source machine are held and not allowed to
occur until the destination machine has acknowledged the successful replication.
Synchronous mode is less common with file replication products although a few solutions
exist.
File level replication solution yield a few benefits. Firstly because data is captured at a file
level it can make an informed decision on whether to replicate based on the location of the
file and the type of file.
Hence unlike block-level storage replication where a whole volume needs to be replicated,
file replication products have the ability to exclude temporary files or parts of a filesystem
that hold no business value.
This can substantially reduce the amount of data sent from the source machine as well as
decrease the storage burden on the destination machine.
A further benefit to decreasing bandwidth is the data transmitted can be more granular than
with block-level replication.
If an application writes 100 bytes, only the 100 bytes are transmitted not a complete disk
block which is generally 4096 bytes.
On a negative side, as this is a software only solution, it requires implementation and
maintenance on the operating system level, and uses some of machine's processing power
(CPU).
Notable implementations:
Cofio Software AIMstor Replication
Double-Take Software Availability
Filesystem journal replication
In many ways working like a database journal, many filesystems have the ability to journal
their activity.
The journal can be sent to another machine, either periodically or in real time. It can be used
there to play back events.
Microsoft DPM (periodical updates, not in real time)
Batch replication
This is the process of comparing the source and destination filesystems and ensuring that
the destination matches the source.
The key benefit is that such solutions are generally free or inexpensive.
The downside is that the process of synchronizing them is quite system-intensive, and
consequently this process generally runs infrequently.
rsync
Distributed shared memory replication
Another example of using replication appears in distributed shared memory systems, where
it may happen that many nodes of the system share the same page of the memory - which
usually means, that each node has a separate copy (replica) of this page.
Primary-backup and multi-primary replication
Many classical approaches to replication are based on a primary/backup model where one
device or process has unilateral control over one or more other processes or devices.
For example, the primary might perform some computation, streaming a log of updates to a
backup (standby) process, which can then take over if the primary fails.
This approach is the most common one for replicating databases, despite the risk that if a
portion of the log is lost during a failure, the backup might not be in a state identical to the
one the primary was in, and transactions could then be lost.
A weakness of primary/backup schemes is that in settings where both processes could have
been active, only one is actually performing operations.
We're gaining fault-tolerance but spending twice as much money to get this property.
15
For this reason, starting in the period around 1985, the distributed systems research
community began to explore alternative methods of replicating data.
An outgrowth of this work was the emergence of schemes in which a group of replicas
could cooperate, with each process backup up the others, and each handling some share of the
workload.
A number of modern products support similar schemes.
For example, the Spread Toolkit supports this same virtual synchrony model and can be
used to implement a multi-primary replication scheme; it would also be possible to use CEnsemble or Quicksilver in this manner.
WANdisco permits active replication where every node on a network is an exact copy or
replica and hence every node on the network is active at one time; this scheme is optimized
for use in a wide area network.
Performance and Scalability

In the A Word On Scalability posting I tried to write down a more precise definition of
scalability than is commeonly used.
There were good comments about the definition at the posting as well as in a discussion at
The ServerSide.
To recap in a less precise manner I stated that
A service is said to be scalable if when we increase the resources in a system, it
results in increased performance in a manner proportional to resources added
An always-on service is said to be scalable if adding resources to facilitate
redundancy does not result in a loss of performance.
A scalable service needs to be able to handle heterogeneity of resources.
There were quite a few comments about the use of performance in the definition.
This is how I reason about performance in this context: I am assuming that each service has
an SLA contract that defines what the expectations of your clients/customers are (SLA =
Service Level Agreement).
What exactly is in that SLA depends on the kind of service business you are in; quite a few
of the services that contribute to an Amazon.com website have an SLA that is latency driven.
This latency will have a certain distribution and you pick a number of points on the
distribution as representatives for measuring your SLA.
For example at Amazon we also track the latency at the 99.9% mark to make sure all of all
customers are getting an experience at SLA or better.
This SLA needs to be maintained if you grow your business. Growing can mean increasing
the number of requests, increasing the number of items you serve, increasing the amount of
work you do for each request, etc.
But no matter along which axis you grow, you will need to make sure you can always meet
your SLA.
Growth along some axis can be served by scaling up to faster CPUs and larger memories,
but if you keep growing there is an end to what you can buy and you will need to scale out.
Given that scaling up is often not cost effective, you might as well start by working on
scaling out, as you will have to go that path eventually.
I have not seen many SLAs that are purely throughput driven.
It is often a combination of the amount of work that needs to be done, the distribution in
which it will arrive and when that work needs to be finished, that will lead to a throughput
driven SLA.
Latency does play a role here as it is often a driver for what throughput is necessary to
achieve the output distribution.
If you have a request arrival distribution that is non-uniform you can play various games
with buffering and capping the throughput at lower than you peak load as long as you are
willing to accept longer latencies.
Often it is the latency distribution that you try to achieve that drives you throughput
requirements.
16
There were some other points made with respect to what should be part of a scalability
definition, among others by Gideon Low @ the serverside thread (I tried to link to his
individual response but seem to fail) who make some good points.
Operationally efficient It takes less human resources to manage the system as the
number of hardware resources scales up.
Resilient Increasing the number of resources will also increase the probability of
failure of one of those resources, but the impact of such a failure should be reduced as
the number of resource grows.
These two points combined with a discussion about cost/capacity/efficiency should be part
of a definition of a scalable service.
Ill be thinking a bit about what the right wording should be and will post a proposal later.
Scalability
In electronics (including hardware, communication and software), scalability is the ability
of a system, network, or process to handle a growing amount of work in a capable manner or
its ability to be enlarged to accommodate that growth.
For example, it can refer to the capability of a system to increase total throughput under an
increased load when resources (typically hardware) are added.
An analogous meaning is implied when the word is used in an economic context, where
scalability of a company implies that the underlying business model offers the potential for
economic growth within the company.
Scalability, as a property of systems, is generally difficult to define and in any particular
case it is necessary to define the specific requirements for scalability on those dimensions that
are deemed important.
It is a highly significant issue in electronics systems, databases, routers, and networking.
A system, whose performance improves after adding hardware, proportionally to the
capacity added, is said to be a scalable system.
Measures
Scalability can be measured in various dimensions, such as:
Administrative scalability: The ability for an increasing number of organizations or
users to easily share a single distributed system.
Functional scalability: The ability to enhance the system by adding new functionality
at minimal effort.
Geographic scalability: The ability to maintain performance, usefulness, or usability
regardless of expansion from concentration in a local area to a more distributed
geographic pattern.
Load scalability: The ability for a distributed system to easily expand and contract its
resource pool to accommodate heavier or lighter loads or number of inputs.
Alternatively, the ease with which a system or component can be modified, added, or
removed, to accommodate changing load.
Examples
A routing protocol is considered scalable with respect to network size, if the size of
the necessary routing table on each node grows as O(log N), where N is the number of
nodes in the network.
A scalable online transaction processing system or database management system is
one that can be upgraded to process more transactions by adding new processors,
devices and storage, and which can be upgraded easily and transparently without
shutting it down.
Some early peer-to-peer (P2P) implementations of Gnutella had scaling issues. Each
node query flooded its requests to all peers. The demand on each peer would increase
in proportion to the total number of peers, quickly overrunning the peers' limited
capacity. Other P2P systems like BitTorrent scale well because the demand on each
peer is independent of the total number of peers. There is no centralized bottleneck, so
17
the system may expand indefinitely without the addition of supporting resources
(other than the peers themselves).
The distributed nature of the Domain Name System allows it to work efficiently even
when all hosts on the worldwide Internet are served, so it is said to "scale well".
Horizontal and vertical scaling
Methods of adding more resources for a particular application fall into two broad
categories: horizontal and vertical scaling.
To scale horizontally (or scale out) means to add more nodes to a system, such as adding a
new computer to a distributed software application.
An example might be scaling out from one Web server system to three.
To scale vertically (or scale up) means to add resources to a single node in a system,
typically involving the addition of CPUs or memory to a single computer. Such vertical
scaling of existing systems also enables them to use virtualization technology more
effectively, as it provides more resources for the hosted set of operating system and
application modules to share.
Database scalability
A number of different approaches enable databases to grow to very large size while
supporting an ever-increasing rate of transactions per second.
One technique supported by most of the major database management system (DBMS)
products is the partitioning of large tables, based on ranges of values in a key field.
In this manner, the database can be scaled out across a cluster of separate database servers.
Also, with the advent of 64-bit microprocessors, multi-core CPUs, and large SMP
multiprocessors, DBMS vendors have been at the forefront of supporting multi-threaded
implementations that substantially scale up transaction processing capacity.
Design for scalability
It is often advised to focus system design on hardware scalability rather than on capacity.
It is typically cheaper to add a new node to a system in order to achieve improved
performance than to partake in performance tuning to improve the capacity that each node
can handle. But this approach can have diminishing returns (as discussed in performance
engineering).
Weak versus strong scaling
In the context of high performance computing there are two common notions of scalability.
The first is strong scaling, which is defined as how the solution time varies with the
number of processors for a fixed total problem size.
The second is weak scaling, which is defined as how the solution time varies with the
number of processors for a fixed problem size per processor.
Best practices for disaster recovery

Use Emergency Management Services if applicable.
To ensure access to the server regardless of the condition of the network drivers or operating
system files, consider setting up Emergency Management Services, a new feature in the
Windows Server 2003 family. With Emergency Management Services, you can remotely
manage a server in emergency situations that would typically require a local keyboard,
mouse, and monitor, such as when the network is unavailable or the server is not functioning
properly. Emergency Management Services has specific hardware requirements, and is
available only for products in the Windows Server 2003 family.
Create a plan for performing regular backup operations.
Review and incorporate the best practices for Backup into a plan for backing up all of your
files on a regular basis.
Keep the installation CD where you can easily find it.
If needed, you can start the computer from the installation CD and use the Recovery Console
or Automated System Recovery.
18
Install the Recovery Console as a startup option.

You can install the Recovery Console on your computer to make it available in case you are
unable to restart Windows. You can then select the Recovery Console option from the list of
available operating systems on startup. You cannot install the Recovery Console on an
Itanium-based computer.
Specify startup and recovery options.
Specify what you want the operating system to do if the computer stops unexpectedly. For
example, you can specify that you want your computer to restart automatically and you can
control logging options.
Create a backup and restore plan
Be sure your backup plan specifies:
o The computer where backups will be stored
o The programs that you will use to back up your system
o The computers you want to back up
o The schedule when backups will occur
o The offsite location where you will archive backups
Backup
In information technology, a backup, or the process of backing up, refers to the copying
and archiving of computer data so it may be used to restore the original after a data loss
event.
Backups have two distinct purposes. The primary purpose is to recover data after its loss, be
it by data deletion or corruption.
Data loss can be a common experience of computer users.
The secondary purpose of backups is to recover data from an earlier time, according to a
user-defined data retention policy, typically configured within a backup application for how
long copies of data are required.
Storage, the base of a backup system
Data repository models
Any backup strategy starts with a concept of a data repository. The backup data needs to be
stored somehow and probably should be organized to a degree. It can be as simple as a sheet
of paper with a list of all backup tapes and the dates they were written or a more sophisticated
setup with a computerized index, catalog, or relational database. Different repository models
have different advantages. This is closely related to choosing a backup rotation scheme.
1. Unstructured
An unstructured repository may simply be a stack of floppy disks or CD-R/DVD-R
media with minimal information about what was backed up and when. This is the
easiest to implement, but probably the least likely to achieve a high level of
recoverability.
2. Full only / System imaging
A repository of this type contains complete system images from one or more specific
points in time. This technology is frequently used by computer technicians to record
known good configurations. Imaging is generally more useful for deploying a
standard configuration to many systems rather than as a tool for making ongoing
backups of diverse systems.
3. Incremental
An incremental style repository aims to make it more feasible to store backups from
more points in time by organizing the data into increments of change between points
in time. This eliminates the need to store duplicate copies of unchanged data, as
would be the case with a portion of the data of subsequent full backups. Typically, a
full backup (of all files) is made which serves as the reference point for an
19
incremental backup set. After that, any numbers of incremental backups are made.
Restoring the whole system to a certain point in time would require locating the last
full backup taken previous to the data loss plus each and all of the incremental
backups that cover the period of time between the full backup and the point in time to
which the system is supposed to be restored.Additionally, some backup systems can
reorganize the repository to synthesize full backups from a series of incrementals.
4. Differential
A differential style repository saves the data since the last full backup. It has the
advantage that only a maximum of two data sets are needed to restore the data. One
disadvantage, at least as compared to the incremental backup method, is that as time
from the last full backup (and, thus, data changes) increase so does the time to
perform the differential backup. To perform a differential backup, it is first necessary
to perform a full backup. After that, each differential backup made will contain all the
changes since the last full backup. Restoring an entire system to a certain point in time
would require locating the last full backup taken previous to the point of the failure or
loss plus the last differential backup since the last full backup.
5. Reverse delta
A reverse delta type repository stores a recent "mirror" of the source data and a series
of differences between the mirror in its current state and its previous states. A reverse
delta backup will start with a normal full backup. After the full backup is performed,
the system will periodically synchronize the full backup with the live copy, while
storing the data necessary to reconstruct older versions. This can either be done using
hard links, or using binary diffs. This system works particularly well for large, slowly
changing, data sets. Examples of programs that use this method are rdiff-backup and
Time Machine.
6. Continuous data protection
Instead of scheduling periodic backups, the system immediately logs every change on
the host system. This is generally done by saving byte or block-level differences
rather than file-level differences.[5] It differs from simple disk mirroring in that it
enables a roll-back of the log and thus restoration of old image of data.
Storage media
Regardless of the repository model that is used, the data has to be stored on some data storage
medium somewhere.
1. Magnetic tape
Magnetic tape has long been the most commonly used medium for bulk data storage,
backup, archiving, and interchange. Tape has typically had an order of magnitude
better capacity/price ratio when compared to hard disk, but recently the ratios for tape
and hard disk have become a lot closer. [6] There are myriad formats, many of which
are proprietary or specific to certain markets like mainframes or a particular brand of
personal computer. Tape is a sequential access medium, so even though access times
may be poor, the rate of continuously writing or reading data can actually be very fast.
Some new tape drives are even faster than modern hard disks. A principal advantage
of tape is that it has been used for this purpose for decades (much longer than any
alternative) and its characteristics are well understood.
2. Hard disk
The capacity/price ratio of hard disk has been rapidly improving for many years. This
is making it more competitive with magnetic tape as a bulk storage medium. The
main advantages of hard disk storage are low access times, availability, capacity and
ease of use External disks can be connected via local interfaces like SCSI, USB,
FireWire, or eSATA, or via longer distance technologies like Ethernet, iSCSI, or
20
Fibre Channel. Some disk-based backup systems, such as Virtual Tape Libraries,
support data deduplication which can dramatically reduce the amount of disk storage
capacity consumed by daily and weekly backup data. The main disadvantages of hard
disk backups are that they are easily damaged, especially while being transported
(e.g., for off-site backups), and that their stability over periods of years is a relative
unknown.
3. Optical storage
Recordable CDs, DVDs, and Blu-ray Discs are commonly used with personal
computers and generally have low media unit costs. However, the capacities and
speeds of these and other optical discs are typically an order of magnitude lower than
hard disk or tape. Many optical disk formats are WORM type, which makes them
useful for archival purposes since the data cannot be changed. The use of an autochanger or jukebox can make optical discs a feasible option for larger-scale backup
systems. Some optical storage systems allow for cataloged data backups without
human contact with the discs, allowing for longer data integrity.
4. Floppy disk
During the 1980s and early 1990s, many personal/home computer users associated
backing up mostly with copying to floppy disks. However, the data capacity of floppy
disks failed to catch up with growing demands, rendering them unpopular and
obsolete.[8]
5. Solid state storage
Also known as flash memory, thumb drives, USB flash drives, CompactFlash,
SmartMedia, Memory Stick, Secure Digital cards, etc., these devices are relatively
expensive for their low capacity. A solid state drive does not contain any movable
parts unlike its magnetic drive counterpart and can have huge throughput in the order
of 500Mbit/s to 6Gbit/s. SSD drives are now available in the order of 500GB to TBs.
6. Remote backup service
As broadband internet access becomes more widespread, remote backup services are
gaining in popularity. Backing up via the internet to a remote location can protect
against some worst-case scenarios such as fires, floods, or earthquakes which would
destroy any backups in the immediate vicinity along with everything else. There are,
however, a number of drawbacks to remote backup services. First, Internet
connections are usually slower than local data storage devices. Residential broadband
is especially problematic as routine backups must use an upstream link that's usually
much slower than the downstream link used only occasionally to retrieve a file from
backup. This tends to limit the use of such services to relatively small amounts of high
value data. Secondly, users must trust a third party service provider to maintain the
privacy and integrity of their data, although confidentiality can be assured by
encrypting the data before transmission to the backup service with an encryption key
known only to the user. Ultimately the backup service must itself use one of the above
methods so this could be seen as a more complex way of doing traditional backups.
Managing the data repository
Regardless of the data repository model or data storage media used for backups, a balance
needs to be struck between accessibility, security and cost.
These media management methods are not mutually exclusive and are frequently combined
to meet the needs of the situation.
Using on-line disks for staging data before it is sent to a near-line tape library is a common
example.
1. On-line
21
On-line backup storage is typically the most accessible type of data storage, which
can begin restore in milliseconds time. A good example would be an internal hard
disk or a disk array (maybe connected to SAN). This type of storage is very
convenient and speedy, but is relatively expensive. On-line storage is quite vulnerable
to being deleted or overwritten, either by accident, by intentional malevolent action,
or in the wake of a data-deleting virus payload.
2. Near-line
Near-line storage is typically less accessible and less expensive than on-line storage,
but still useful for backup data storage. A good example would be a tape library with
restore times ranging from seconds to a few minutes. A mechanical device is usually
involved in moving media units from storage into a drive where the data can be read
or written. Generally it has safety properties similar to on-line storage.
3. Off-line
Off-line storage requires some direct human action in order to make access to the
storage media physically possible. This action is typically inserting a tape into a tape
drive or plugging in a cable that allows a device to be accessed. Because the data is
not accessible via any computer except during limited periods in which it is written or
read back, it is largely immune to a whole class of on-line backup failure modes.
Access time will vary depending on whether the media is on-site or off-site.
4. Off-site data protection
To protect against a disaster or other site-specific problem, many people choose to
send backup media to an off-site vault. The vault can be as simple as a system
administrator's home office or as sophisticated as a disaster-hardened, temperaturecontrolled, high-security bunker that has facilities for backup media storage.
Importantly a data replica can be off-site but also on-line (e.g., an off-site RAID
mirror). Such a replica has fairly limited value as a backup, and should not be
confused with an off-line backup.
5. Backup site or disaster recovery center (DR center)
In the event of a disaster, the data on backup media will not be sufficient to recover.
Computer systems onto which the data can be restored and properly configured
networks are necessary too. Some organizations have their own data recovery centers
that are equipped for this scenario. Other organizations contract this out to a thirdparty recovery center. Because a DR site is itself a huge investment, backing up is
very rarely considered the preferred method of moving data to a DR site. A more
typical way would be remote disk mirroring, which keeps the DR data as up to date as
Selection and extraction of data
A successful backup job starts with selecting and extracting coherent units of data. Most
data on modern computer systems is stored in discrete units, known as files.
These files are organized into filesystems.
Files that are actively being updated can be thought of as "live" and present a challenge to
back up. It is also useful to save metadata that describes the computer or the filesystem being
backed up.
Deciding what to back up at any given time is a harder process than it seems. By backing up
too much redundant data, the data repository will fill up too quickly. Backing up an
insufficient amount of data can eventually lead to the loss of critical information.
Files
Copying files
22
Making copies of files is the simplest and most common way to perform a backup. A
means to perform this basic function is included in all backup software and all
operating systems.
Partial file copying
Instead of copying whole files, one can limit the backup to only the blocks or bytes
within a file that have changed in a given period of time. This technique can use
substantially less storage space on the backup medium, but requires a high level of
sophistication to reconstruct files in a restore situation. Some implementations require
integration with the source file system.
Filesystems
Filesystem dump
Instead of copying files within a filesystem, a copy of the whole filesystem itself can
be made. This is also known as a raw partition backup and is related to disk imaging.
The process usually involves unmounting the filesystem and running a program like
dd (Unix). Because the disk is read sequentially and with large buffers, this type of
backup can be much faster than reading every file normally, especially when the
filesystem contains many small files, is highly fragmented, or is nearly full. But
because this method also reads the free disk blocks that contain no useful data, this
method can also be slower than conventional reading, especially when the filesystem
is nearly empty. Some filesystems, such as XFS, provide a "dump" utility that reads
the disk sequentially for high performance while skipping unused sections. The
corresponding restore utility can selectively restore individual files or the entire
volume at the operator's choice.
Identification of changes
Some filesystems have an archive bit for each file that says it was recently changed.
Some backup software looks at the date of the file and compares it with the last
backup to determine whether the file was changed.
Versioning file system
A versioning filesystem keeps track of all changes to a file and makes those changes
accessible to the user. Generally this gives access to any previous version, all the way
back to the file's creation time. An example of this is the Wayback versioning
filesystem for Linux.[9]
Live data
If a computer system is in use while it is being backed up, the possibility of files being open
for reading or writing is real.
If a file is open, the contents on disk may not correctly represent what the owner of the file
intends.
This is especially true for database files of all kinds. The term fuzzy backup can be used to
describe a backup of live data that looks like it ran correctly, but does not represent the state
of the data at any single point in time.
This is because the data being backed up changed in the period of time between when the
backup started and when it finished. For databases in particular, fuzzy backups are worthless.
Snapshot backup
A snapshot is an instantaneous function of some storage systems that presents
a copy of the file system as if it were frozen at a specific point in time, often
by a copy-on-write mechanism. An effective way to back up live data is to
temporarily quiesce it (e.g. close all files), take a snapshot, and then resume
live operations. At this point the snapshot can be backed up through normal
23
methods.[10] While a snapshot is very handy for viewing a filesystem as it was

at a different point in time, it is hardly an effective backup mechanism by
itself.
Open file backup
Many backup software packages feature the ability to handle open files in
backup operations. Some simply check for openness and try again later. File
locking is useful for regulating access to open files.
When attempting to understand the logistics of backing up open files, one
must consider that the backup process could take several minutes to back up a
large file such as a database. In order to back up a file that is in use, it is vital
that the entire backup represent a single-moment snapshot of the file, rather
than a simple copy of a read-through. This represents a challenge when
backing up a file that is constantly changing.
Cold database backup
During a cold backup, the database is closed or locked and not available to
users. The datafiles do not change during the backup process so the database is
in a consistent state when it is returned to normal operation.
Hot database backup
Some database management systems offer a means to generate a backup
image of the database while it is online and usable ("hot"). This usually
includes an inconsistent image of the data files plus a log of changes made
while the procedure is running. Upon a restore, the changes in the log files are
reapplied to bring the copy of the database up-to-date (the point in time at
which the initial hot backup ended).
Metadata
Not all information stored on the computer is stored in files. Accurately recovering a
complete system from scratch requires keeping track of this non-file data too.
System description
System specifications are needed to procure an exact replacement after a disaster.
Boot sector
The boot sector can sometimes be recreated more easily than saving it. Still, it usually
isn't a normal file and the system won't boot without it.
Partition layout
The layout of the original disk, as well as partition tables and filesystem settings, is
needed to properly recreate the original system.
File metadata
Each file's permissions, owner, group, ACLs, and any other metadata need to be
backed up for a restore to properly recreate the original environment.
System metadata
Different operating systems have different ways of storing configuration information.
Microsoft Windows keeps a registry of system information that is more difficult to
restore than a typical file.
Manipulation of data and dataset optimization
It is frequently useful or required to manipulate the data being backed up to optimize the
backup process. These manipulations can provide many benefits including improved backup
speed restore speed, data security, media usage and/or reduced bandwidth requirements.
1. Compression
24
Various schemes can be employed to shrink the size of the source data to be stored so
that it uses less storage space. Compression is frequently a built-in feature of tape
drive hardware.
2. Deduplication
When multiple similar systems are backed up to the same destination storage device,
there exists the potential for much redundancy within the backed up data. For
example, if 20 Windows workstations were backed up to the same data repository,
they might share a common set of system files. The data repository only needs to store
one copy of those files to be able to restore any one of those workstations. This
technique can be applied at the file level or even on raw blocks of data, potentially
resulting in a massive reduction in required storage space. Deduplication can occur on
a server before any data moves to backup media, sometimes referred to as
source/client side deduplication. This approach also reduces bandwidth required to
send backup data to its target media. The process can also occur at the target storage
device, sometimes referred to as inline or back-end deduplication.
3. Duplication
Sometimes backup jobs are duplicated to a second set of storage media. This can be
done to rearrange the backup images to optimize restore speed or to have a second
copy at a different location or on a different storage medium.
4. Encryption
High capacity removable storage media such as backup tapes present a data security
risk if they are lost or stolen.[13] Encrypting the data on these media can mitigate this
problem, but presents new problems. Encryption is a CPU intensive process that can
slow down backup speeds, and the security of the encrypted backups is only as
effective as the security of the key management policy.
5. Multiplexing
When there are many more computers to be backed up than there are destination
storage devices, the ability to use a single storage device with several simultaneous
backups can be useful.
6. Refactoring
The process of rearranging the backup sets in a data repository is known as
refactoring. For example, if a backup system uses a single tape each day to store the
incremental backups for all the protected computers, restoring one of the computers
could potentially require many tapes. Refactoring could be used to consolidate all the
backups for a single computer onto a single tape. This is especially useful for backup
systems that do incrementals forever style backups.
7. Staging
Sometimes backup jobs are copied to a staging disk before being copied to tape. This
process is sometimes referred to as D2D2T, an acronym for Disk to Disk to Tape.
This can be useful if there is a problem matching the speed of the final destination
device with the source device as is frequently faced in network-based backup systems.
It can also serve as a centralized location for applying other data manipulation
techniques.
Managing the backup process
It is important to understand that backing up is a process. As long as new data is being
created and changes are being made, backups will need to be updated. Individuals and
organizations with anything from one computer to thousands (or even millions?) of computer
systems all have requirements for protecting data. While the scale is different, the objectives
and limitations are essentially the same. Likewise, those who perform backups need to know
to what extent they were successful, regardless of scale.
25
Objectives
Recovery point objective (RPO)
The point in time that the restarted infrastructure will reflect. Essentially, this is the
roll-back that will be experienced as a result of the recovery. The most desirable RPO
would be the point just prior to the data loss event. Making a more recent recovery
point achievable requires increasing the frequency of synchronization between the
source data and the backup repository.[14]
Recovery time objective (RTO)
The amount of time elapsed between disaster and restoration of business functions.
Data security
In addition to preserving access to data for its owners, data must be restricted from
unauthorized access. Backups must be performed in a manner that does not
compromise the original owner's undertaking. This can be achieved with data
encryption and proper media handling policies.
Limitations
An effective backup scheme will take into consideration the limitations of the situation.
1. Backup window
The period of time when backups are permitted to run on a system is called the
backup window. This is typically the time when the system sees the least usage and
the backup process will have the least amount of interference with normal operations.
The backup window is usually planned with users' convenience in mind. If a backup
extends past the defined backup window, a decision is made whether it is more
beneficial to abort the backup or to lengthen the backup window.
2. Performance impact
All backup schemes have some performance impact on the system being backed up.
For example, for the period of time that a computer system is being backed up, the
hard drive is busy reading files for the purpose of backing up, and its full bandwidth is
no longer available for other tasks. Such impacts should be analyzed.
3. Costs of hardware, software, labor
All types of storage media have a finite capacity with a real cost. Matching the correct
amount of storage capacity (over time) with the backup needs is an important part of
the design of a backup scheme. Any backup scheme has some labor requirement, but
complicated schemes have considerably higher labor requirements. The cost of
commercial backup software can also be considerable.
4. Network bandwidth
Distributed backup systems can be affected by limited network bandwidth.
Implementation
Meeting the defined objectives in the face of the above limitations can be a difficult task. The
tools and concepts below can make that task more achievable.
1. Scheduling
Using a job scheduler can greatly improve the reliability and consistency of backups
by removing part of the human element. Many backup software packages include this
functionality.
2. Authentication
Over the course of regular operations, the user accounts and/or system agents that
perform the backups need to be authenticated at some level. The power to copy all
data off of or onto a system requires unrestricted access. Using an authentication
mechanism is a good way to prevent the backup scheme from being used for
unauthorized activity.
3. Chain of trust
26
Removable storage media are physical items and must only be handled by trusted
individuals. Establishing a chain of trusted individuals (and vendors) is critical to
defining the security of the data.
Measuring the process
To ensure that the backup scheme is working as expected, the process needs to include
monitoring key factors and maintaining historical data.
1. Backup validation
(also known as "backup success validation") The process by which owners of data can
get information about how their data was backed up. This same process is also used to
prove compliance to regulatory bodies outside of the organization, for example, an
insurance company might be required under HIPAA to show "proof" that their patient
data are meeting records retention requirements. [16] Disaster, data complexity, data
value and increasing dependence upon ever-growing volumes of data all contribute to
the anxiety around and dependence upon successful backups to ensure business
continuity. For that reason, many organizations rely on third-party or "independent"
solutions to test, validate, and optimize their backup operations (backup reporting).
2. Reporting
In larger configurations, reports are useful for monitoring media usage, device status,
errors, vault coordination and other information about the backup process.
3. Logging
In addition to the history of computer generated reports, activity and change logs are
useful for monitoring backup system events.
4. Validation
Many backup programs make use of checksums or hashes to validate that the data was
accurately copied. These offer several advantages. First, they allow data integrity to
be verified without reference to the original file: if the file as stored on the backup
medium has the same checksum as the saved value, then it is very probably correct.
Second, some backup programs can use checksums to avoid making redundant copies
of files, to improve backup speed. This is particularly useful for the de-duplication
process.
5. Monitored backup
Backup processes are monitored by a third party monitoring center. This center alerts
users to any errors that occur during automated backups. Monitored backup requires
software capable of pinging the monitoring center's servers in the case of errors. Some
monitoring services also allow collection of historical meta-data, that can be used for
Storage Resource Management purposes like projection of data growth, locating
redundant primary storage capacity and reclaimable backup capacity.
Law
Confusion
Because of a considerable overlap in technology, backups and backup systems are frequently
confused with archives and fault-tolerant systems. Backups differ from archives in the sense
that archives are the primary copy of data, usually put away for future use, while backups are
a secondary copy of data, kept on hand to replace the original item. Backup systems differ
from fault-tolerant systems in the sense that backup systems assume that a fault will cause a
data loss event and fault-tolerant systems assure a fault will not.
Advice
The more important the data that is stored on the computer, the greater is the need for
backing up this data.
A backup is only as useful as its associated restore strategy. For critical systems and
data, the restoration process must be tested.
27
Storing the copy near the original is unwise, since many disasters such as fire, flood,
theft, and electrical surges are likely to cause damage to the backup at the same time.
In these cases, both the original and the backup medium are likely to be lost.
Automated backup and scheduling should be considered, as manual backups can be
affected by human error.
Incremental backups should be considered to save the amount of storage space and to
avoid redundancy.
Backups can fail for a wide variety of reasons. Verification or monitoring strategy is
an important part of a successful backup plan.
Multiple backups on different media, stored in different locations, should be used for
all critical information.
Backed up archives should be stored in open and standard formats, especially when
the goal is long-term archiving. Recovery software and processes may have changed,
and software may not be available to restore data saved in proprietary formats.
System administrators and others working in the information technology field are
routinely fired for not devising and maintaining backup processes suitable to their
organization.
Don't forget that after you backed up, when you do some changes, they will not be
saved to the backup. So it's recommended to backup that you will not change. If you
prefer to backup data with some changes, update your backup.
Disaster recovery
Disaster recovery (DR) is the process, policies and procedures that are related to preparing
for recovery or continuation of technology infrastructure which are vital to an organization
after a natural or human-induced disaster.
Disaster recovery is a subset of business continuity.While business continuity involves
planning for keeping all aspects of a business functioning in the midst of disruptive events,
disaster recovery focuses on the IT or technology systems that support business functions.
Classification of disasters
Disasters can be classified into two broad categories. The first is natural disasters such as
floods, hurricanes, tornadoes or earthquakes.
While preventing a natural disaster is very difficult, measures such as good planning which
includes mitigation measures can help reduce or avoid losses.
The second category is man made disasters.
These include hazardous material spills, infrastructure failure, or bio-terrorism.
In these instances surveillance and mitigation planning are invaluable towards avoiding or
lessening losses from these events.
Importance of disaster recovery planning
Recent research supports the idea that implementing a more holistic pre-disaster planning
approach is more cost-effective in the long run.
Every $1 spent on hazard mitigation (such as a disaster recovery plan)saves society $4 in
response and recovery costs.
As IT systems have become increasingly critical to the smooth operation of a company, and
arguably the economy as a whole, the importance of ensuring the continued operation of
those systems, and their rapid recovery, has increased
.For example, of companies that had a major loss of business data, 43% never reopen and
29% close within two years.
As a result, preparation for continuation or recovery of systems needs to be taken very
seriously.
This involves a significant investment of time and money with the aim of ensuring minimal
losses in the event of a disruptive event.
Control measures
Control measures are steps or mechanisms that can reduce or eliminate various threats for
organizations. Different types of measures can be included in disaster recovery plan (DRP).
28
Disaster recovery planning is a subset of a larger process known as business continuity

planning and includes planning for resumption of applications, data, hardware, electronic
communications (such as networking) and other IT infrastructure.
A business continuity plan (BCP) includes planning for non-IT related aspects such as key
personnel, facilities, crisis communication and reputation protection, and should refer to the
disaster recovery plan (DRP) for IT related infrastructure recovery / continuity.
IT disaster recovery control measures can be classified into the following three types:
1. Preventive measures - Controls aimed at preventing an event from occurring.
2. Detective measures - Controls aimed at detecting or discovering unwanted events.
3. Corrective measures - Controls aimed at correcting or restoring the system after a
disaster or an event.
Good disaster recovery plan measures dictate that these three types of controls be
documented and tested regularly.
Strategies
Prior to selecting a disaster recovery strategy, a disaster recovery planner first refers to their
organization's business continuity plan which should indicate the key metrics of recovery
point objective (RPO) and recovery time objective (RTO) for various business processes
(such as the process to run payroll, generate an order, etc.).
The metrics specified for the business processes are then mapped to the underlying IT
systems and infrastructure that support those processes.
Some of the most common strategies for data protection include:
backups made to tape and sent off-site at regular intervals
backups made to disk on-site and automatically copied to off-site disk, or made
directly to off-site disk
replication of data to an off-site location, which overcomes the need to restore the
data (only the systems then need to be restored or synchronized), often making use of
storage area network (SAN) technology
Hybrid Cloud solutions that replicate to both on-site 'appliances' and off-site data
centers. These solutions provide the ability to instantly fail-over to local on-site
hardware, but in the event of a physical disaster, servers can be brought up in the
cloud data centers as well. Two such examples are Quorom [13] or EverSafe.[14]
the use of high availability systems which keep both the data and system replicated
off-site, enabling continuous access to systems and data, even after a disaster (often
associated with cloud storage)
In many cases, an organization may elect to use an outsourced disaster recovery provider to
provide a stand-by site and systems rather than using their own remote facilities, increasingly
via cloud computing.
In addition to preparing for the need to recover systems, organizations also implement
precautionary measures with the objective of preventing a disaster in the first place. These
may include:
local mirrors of systems and/or data and use of disk protection technology such as
RAID
surge protectors to minimize the effect of power surges on delicate electronic
equipment
use of an uninterruptible power supply (UPS) and/or backup generator to keep
systems going in the event of a power failure
fire prevention/mitigation systems such as alarms and fire extinguishers
anti-virus software and other security measures
29

Ait 5

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Ait 5

Încărcat de

Drepturi de autor:

Formate disponibile

Enterprise Systems Architecture (ESA)

Definition - What does Enterprise Systems Architecture (ESA) mean?

A part of a larger enterprise (such as a business unit)

Companies such as Independence Blue Cross, Intel, Volkswagen AG and InterContinental

If we define the status function

therefore, the availability A(t) at time t>0 is represented by

Average availability must be defined on an interval of the real line. If we consider an

Limiting (or steady-state) availability is represented by

Limiting average availability is also defined on an interval

Availability, Operational (Ao)

Performance and Scalability

Best practices for disaster recovery

Install the Recovery Console as a startup option.

methods.[10] While a snapshot is very handy for viewing a filesystem as it was

Disaster recovery planning is a subset of a larger process known as business continuity

S-ar putea să vă placă și