Sunteți pe pagina 1din 113

Abstraction (computer science)

In computer science, abstraction is a mechanism and practice to reduce and factor out details so that one can focus on a
few concepts at a time.

The concept is by analogy with abstraction in mathematics. The mathematical technique of abstraction begins with
mathematical definitions; this has the fortunate effect of finessing some of the vexing philosophical issues of abstraction.
For example, in both computing and in mathematics, numbers are concepts in the programming languages, as founded in
mathematics. Implementation details depend on the hardware and software, but this is not a restriction because the
computing concept of number is still based on the mathematical concept.

Roughly speaking, abstraction can be either that of control or data. Control abstraction is the abstraction of actions while
data abstraction is that of data structures. For example, control abstraction in structured programming is the use of
subprograms and formatted control flows. Data abstraction is to allow for handling data bits in meaningful manners. For
example, it is the basic motivation behind datatype. Object-oriented programming can be seen as an attempt to abstract
both data and code.

Contents

 1 Rationale
 2 Language features
 2.1 Programming languages
 2.2 Specification languages
 3 Control abstraction
 3.1 Structured programming
 4 Data abstraction
 5 Abstraction in object oriented programming
 5.1 Object-oriented design
 6 Considerations
 7 Levels of abstraction
 7.1 Database systems
 7.2 Layered architecture
 8 See also

 9 Further reading

Rationale
Computing is mostly independent of the concrete world: The hardware implements a model of
computation that is interchangeable with others. The software is structured in architectures to
enable humans to create the enormous systems by concentration on a few issues at a time.
These architectures are made of specific choices of abstractions. Greenspun's Tenth Rule is an
aphorism on how such an architecture is both inevitable and complex.
A central form of abstraction in computing is the language abstraction: new artificial languages
are developed to express specific aspects of a system. Modelling languages help in planning.
Computer languages can be processed with a computer. An example of this abstraction process
is the generational development of programming languages from the machine language to the
assembly language and the high-level language. Each stage can be used as a stepping stone for
the next stage. The language abstraction continues for example in scripting languages and
domain-specific programming languages.

Within a programming language, some features let the programmer create new abstractions.
These include the subroutine, the module, and the software component. Some other abstractions
such as software design patterns and architectural styles are not visible to a programming
language but only in the design of a system.

Some abstractions try to limit the breadth of concepts a programmer needs by completely hiding
the abstractions they in turn are built on. Joel Spolsky has criticised these efforts by claiming that
all abstractions are leaky — that they are never able to completely hide the details below. Some
abstractions are designed to interoperate with others, for example a programming language may
contain a foreign function interface for making calls to the lower-level language.

Language features
Programming languages

Different programming languages provide different types of abstraction, depending on the


applications for which the language is intended. For example:

 In object-oriented programming languages such as C++ or Java, the concept of


abstraction is itself a declarative statement, using the keywords virtual or abstract,
respectively. After such a declaration, it is the responsibility of the programmer to
implement a class to instantiate the object of the declaration.
 In functional programming languages, it is common to find abstractions related to
functions, such as lambda abstractions (making a term into a function of some variable),
higher-order functions (parameters are functions), bracket abstraction (making a term into
a function of a variable).
 The Linda abstracts the concepts of server and shared data-space to facilitate distributed
programming.

Specification languages

Specification languages generally rely on abstractions of one kind or another, since specifications
are typically defined earlier in a project, and at a more abstract level, than an eventual
implementation. The UML specification language, for example, allows the definition of abstract
classes, which are simply left abstract during the architecture and specification phase of the
project.

Control abstraction
Control abstraction is one of the main purposes of using programming languages. Computer
machines understand operations at the very low level such as moving some bits from one
location of the memory to another location and producing the sum of two sequences of bits.
Programming languages allow this to be done in the higher level. For example, consider the high-
level expression/program statement:

a := (1 + 2) * 5

To a human, this is a fairly simple and obvious calculation ("one plus two is three, times five is
fifteen"). However, the low-level steps necessary to carry out this evaluation, and return the value
"15", and then assign that value to the variable "a", are actually quite subtle and complex. The
values need to be converted to binary representation (often a much more complicated task than
one would think) and the calculations decomposed (by the compiler or interpreter) into assembly
instructions (again, which are much less intuitive to the programmer: operations such as shifting a
binary register left, or adding the binary complement of the contents of one register to another,
are simply not how humans think about the abstract arithmetical operations of addition or
multiplication). Finally, assigning the resulting value of "15" to the variable labeled "a", so that "a"
can be used later, involves additional 'behind-the-scenes' steps of looking up a variable's label
and the resultant location in physical or virtual memory, storing the binary representation of "15"
to that memory location, etc. etc.

Without control abstraction, a programmer would need to specify all the register/binary-level steps
each time she simply wanted to add or multiply a couple of numbers and assign the result to a
variable. This duplication of effort has two serious negative consequences: (a) it forces the
programmer to constantly repeat fairly common tasks every time a similar operation is needed;
and (b) it forces the programmer to program for the particular hardware and instruction set.

Structured programming

Structured programming involves the splitting of complex program tasks into smaller pieces with
clear flow control and interfaces between components, with reduction of the complexity potential
for side-effects.

In a simple program, this may be trying to ensure that loops have single or obvious exit points
and trying, where it's most clear to do so, to have single exit points from functions and
procedures.

In a larger system, it may involve breaking down complex tasks into many different modules.
Consider a system handling payroll on ships and at shore offices:

 The uppermost level may be a menu of typical end user operations.


 Within that could be standalone executables or libraries for tasks such as signing on and
off employees or printing checks.
 Within each of those standalone components there could be many different source files,
each containing the program code to handle a part of the problem, with only selected
interfaces available to other parts of the program. A sign on program could have source
files for each data entry screen and the database interface (which may itself be a
standalone third party library or a statically linked set of library routines).
 Either the database or the payroll application also has to initiate the process of
exchanging data with between ship and shore and that data transfer task will often
contain many other components.

These layers produce the effect of isolating the implementation details of one component and its
assorted internal methods from the others. This concept was embraced and extended in object-
oriented programming.

Data abstraction
Data abstraction is the enforcement of a clear separation between the abstract properties of a
data type and the concrete details of its implementation. The abstract properties are those that
are visible to client code that makes use of the data type--the interface to the data type--while the
concrete implementation is kept entirely private, and indeed can change, for example to
incorporate efficiency improvements over time. The idea is that such changes are not supposed
to have any impact on client code, since they involve no difference in the abstract behaviour.

For example, one could define an abstract data type called lookup table, where keys are uniquely
associated with values, and values may be retrieved by specifying their corresponding keys. Such
a lookup table may be implemented in various ways: as a hash table, a binary search tree, or
even a simple linear list. As far as client code is concerned, the abstract properties of the type are
the same in each case.

Of course, this all relies on getting the details of the interface right in the first place, since any
changes there can have major impacts on client code. Another way to look at this is that the
interface forms a contract on agreed behaviour between the data type and client code; anything
not spelled out in the contract is subject to change without notice.

Languages that implement data abstraction include Ada and Modula-2. Object-oriented
languages are commonly claimed to offer data abstraction; however, their inheritance concept
tends to put information in the interface that more properly belongs in the implementation; thus,
changes to such information ends up impacting client code, leading directly to the fragile base
class problem.

Abstraction in object oriented programming


In object-oriented programming theory, abstraction is the facility to define objects that represent
abstract "actors" that can perform work, report on and change their state, and "communicate" with
other objects in the system. The term encapsulation refers to the hiding of state details, but
extending the concept of data type from earlier programming languages to associate behavior
most strongly with the data, and standardizing the way that different data types interact, is the
beginning of abstraction. When abstraction proceeds into the operations defined, enabling
objects of different types to be substituted, it is called polymorphism. When it proceeds in the
opposite direction, inside the types or classes, structuring them to simplify a complex set of
relationships, it is called delegation or inheritance.

Various object-oriented programming languages offer similar facilities for abstraction, all to
support a general strategy of polymorphism in object-oriented programming, which includes the
substitution of one type for another in the same or similar role. Although it is not as generally
supported, a configuration or image or package may predetermine a great many of these
bindings at compile-time, link-time, or loadtime. This would leave only a minimum of such
bindings to change at run-time.

In CLOS or self, for example, there is less of a class-instance distinction, more use of delegation
for polymorphism, and individual objects and functions are abstracted more flexibly to better fit
with a shared functional heritage from Lisp.

Another extreme is C++, which relies heavily on templates and overloading and other static
bindings at compile-time, which in turn has certain flexibility problems.

Although these are alternate strategies for achieving the same abstraction, they do not
fundamentally alter the need to support abstract nouns in code - all programming relies on an
ability to abstract verbs as functions, nouns as data structures, and either as processes.

For example, here is a sample Java fragment to represent some common farm "animals" to a
level of abstraction suitable to model simple aspects of their hunger and feeding. It defines an
Animal class to represent both the state of the animal and its functions:

class Animal extends LivingThing {


Location loc;
double energyReserves;

boolean isHungry() {
if (energyReserves < 2.5) { return true; }
else { return false; }
}
void eat(Food f) {
// Consume food
energyReserves += f.getCalories();
}
void moveTo(Location l) {
// Move to new location
loc = l;
}
}

With the above definition, one could create objects of type Animal and call their methods like this:

thePig = new Animal();


theCow = new Animal();
if (thePig.isHungry()) { thePig.eat(tableScraps); }
if (theCow.isHungry()) { theCow.eat(grass); }
theCow.moveTo(theBarn);
In the above example, the class Animal is an abstraction used in place of an actual animal,
LivingThing is a further abstraction (in this case a generalisation) of Animal.

If a more differentiated hierarchy of animals is required to differentiate, say, those who provide
milk from those who provide nothing except meat at the end of their lives, that is an intermediary
level of abstraction, probably DairyAnimal (cows, goats) who would eat foods suitable to giving
good milk, and Animal (pigs, steers) who would eat foods to give the best meat quality.

Such an abstraction could remove the need for the application coder to specify the type of food,
so s/he could concentrate instead on the feeding schedule. The two classes could be related
using inheritance or stand alone, and varying degrees of polymorphism between the two types
could be defined. These facilities tend to vary drastically between languages, but in general each
can achieve anything that is possible with any of the others. A great many operation overloads,
data type by data type, can have the same effect at compile-time as any degree of inheritance or
other means to achieve polymorphism. The class notation is simply a coder's convenience.

Object-oriented design

Decisions regarding what to abstract and what to keep under the control of the coder are the
major concern of object-oriented design and domain analysis—actually determining the relevant
relationships in the real world is the concern of object-oriented analysis or legacy analysis.

In general, to determine appropriate abstraction, one must make many small decisions about
scope, domain analysis, determine what other systems one must cooperate with, legacy analysis,
then perform a detailed object-oriented analysis which is expressed within project time and
budget constraints as an object-oriented design. In our simple example, the domain is the
barnyard, the live pigs and cows and their eating habits are the legacy constraints, the detailed
analysis is that coders must have the flexibility to feed the animals what is available and thus
there is no reason to code the type of food into the class itself, and the design is a single simple
Animal class of which pigs and cows are instances with the same functions. A decision to
differentiate DairyAnimal would change the detailed analysis but the domain and legacy analysis
would be unchanged—thus it is entirely under the control of the programmer, and we refer to
abstraction in object-oriented programming as distinct from abstraction in domain or legacy
analysis.

Considerations

When discussing formal semantics of programming languages, formal methods or abstract


interpretation, abstraction refers to the act of considering a less accurate, but safe, definition of
the observed program behaviors. For instance, one may observe only the final result of program
executions instead of considering all the intermediate steps of executions. Abstraction is defined
to a concrete (more precise) model of execution.

Abstraction may be exact or faithful with respect to a property if it is possible to answer a


question about the property equally well on the concrete or abstract model. For instance, if we
wish to know what the result of the evaluation of a mathematical expression involving only
integers +, -, ×, is worth modulo n, it is sufficient to perform all operations modulo n (a familiar
form of this abstraction is casting out nines).

Abstractions, however, are not necessarily exact, but one requires that they should be sound.
That is, it should be possible to get sound answers from them—even though the abstraction may
simply yield a result of undecidability. For instance, we may abstract the students in a class by
their minimal and maximal ages; if one asks whether a certain person belongs to that class, one
may simply compare that person's age with the minimal and maximal ages; if his age lies outside
the range, one may safely answer that the person does not belong to the class; if it does not, one
may only answer "I don't know".

Abstractions are useful when dealing with computer programs, because non-trivial properties of
computer programs are essentially undecidable (see Rice's theorem). As a consequence,
automatic methods for deriving information on the behavior of computer programs either have to
drop termination (on some occasions, they may fail, crash or never yield out a result), soundness
(they may provide false information), or precision (they may answer "I don't know" to some
questions).

Abstraction is the core concept of abstract interpretation. Model checking is generally performed
on abstract versions of the studied systems.

Levels of abstraction

A common concept in computer science is levels (or, less commonly, layers) of abstraction,
wherein each level represents a different model of the same information and processes, but uses
a system of expression involving a unique set of objects and compositions that are applicable
only to a particular domain. Each relatively abstract, "higher" level builds on a relatively concrete,
"lower" level, which tends to provide an increasingly "granular" representation. For example,
gates build on electronic circuits, binary on gates, machine language on binary, programming
language on machine language, applications and operating systems on programming languages.
Each level is embodied, but not determined, by the level beneath it, making it a language of
description that is somewhat self-contained.

Database systems

Since many users of database systems are not deeply familiar with computer data structures,
database developers often hide complexity through the following levels:

Data abstraction levels of a database system


Physical level: The lowest level of abstraction describes how the data is actually stored. The
physical level describes complex low-level data structures in detail.

Logical level: The next higher level of abstraction describes what data are stored in the
database, and what relationships exist among those data. The logical level thus describes an
entire database in terms of a small number of relatively simple structures. Although
implementation of the simple structures at the logical level may involve complex physical level
structures, the user of the logical level does not need to be aware of this complexity. Database
administrators, who must decide what information to keep in a database, use the logical level of
abstraction.

View level: The highest level of abstraction describes only part of the entire database. Even
though the logical level uses simpler structures, complexity remains because of the variety of
information stored in a large database. Many users of a database system do not need all this
information; instead, they need to access only a part of the database. The view level of
abstraction exists to simplify their interaction with the system. The system may provide many
views for the same database.

Layered architecture

The ability to provide a design of different levels of abstraction can

 simplify the design considerably, and


 enable different role players to effectively work at various levels of abstraction.

This can be used in both system and business process design. Some design processes
specifically generate designs that contain various levels of abstraction.

See also

 Inheritance semantics
 Algorithm for an abstract description of a computational procedure
 Abstract data type for an abstract description of a set of data
 Lambda abstraction for making a term into a function of some variable
 Higher-order function for abstraction of functions as parameters
 Bracket abstraction for making a term into a function of a variable
 Data modeling for structuring data independent of the processes that use it
 Refinement for the opposite of abstraction in computing
 Encapsulation for the categorical dual (other side) of abstraction
 Substitution for the categorical left adjoint (inverse) of abstraction
 Abstraction inversion for an anti-pattern of one danger in abstraction
 Greenspun's Tenth Rule for an aphorism about abstracting too much yourself

Data modeling
In computer science, data modeling is the process of creating a data model by applying a data
model theory to create a data model instance. A data model theory is a formal data model
description. See database model for a list of current data model theories.

When data modelling, we are structuring and organizing data. These data structures are then
typically implemented in a database management system. In addition to defining and organizing
the data, data modeling will impose (implicitly or explicitly) constraints or limitations on the data
placed within the structure.

Managing large quantities of structured and unstructured data is a primary function of information
systems. Data models describe structured data for storage in data management systems such as
relational databases. They typically do not describe unstructured data, such as word processing
documents, email messages, pictures, digital audio, and video. Early phases of many software
development projects emphasize the design of a conceptual data model. Such a design can be
detailed into a logical data model. In later stages, this model may be translated into physical data
model.

Contents

 1 Data model
 1.1 Data structure
 2 Generic Data Modeling
 2.1 Data organization
 3 Techniques
 4 See also

 5 External links

Data model

A data model instance may be described in two ways:

 a logical description of the data model instance - concentrating on the generic features of
the model, independent of any particular implementation.
 a physical description of the data model instance - concentraing on the implementation
features of the particular database hosting the model.

Data structure

A data model describes the structure of the data within a given domain and, by implication, the
underlying structure of that domain itself. This means that a data model in fact specifies a
dedicated 'grammar' for a dedicated artificial language for that domain.

A data model represents classes of entities (kinds of things) about which a company wishes to
hold information, the attributes of that information, and relationships among those entities and
(often implicit) relationships among those attributes. The model describes the organization of the
data to some extent irrespective of how data might be represented in a computer system.
The entities represented by a data model can be the tangible entities, but models that include
such concrete entity classes tend to change over time. Robust data models often identify
abstractions of such entities. For example, a data model might include an entity class called
"Person", representing all the people who interact with an organization. Such an abstract entity
class is typically more appropriate than ones called "Vendor" or "Employee", which identify
specific roles played by those people.

A proper conceptual data model describes the semantics of a subject area. It is a collection of
assertions about the nature of the information that is used by one or more organizations. Proper
entity classes are named with natural language words instead of technical jargon. Likewise,
properly named relationships form concrete assertions about the subject area.

There are several versions of this. For example, a relationship called "is composed of" that is
defined to operate on entity classes ORDER and LINE ITEM forms the following concrete
assertion definition: Each ORDER "is composed of" one or more LINE ITEMS." A more rigorous
approach is to force all relationship names to be prepositions, gerunds, or participals, with verbs
being simply "must be" or "may be". This way, both cardinality and optionality can be handled
semantically. This would mean that the relationship just cited would read in one direction, "Each
ORDER may be composed of one or more LINE ITEMS" and in the other "Each LINE ITEM must
be part of one and only one ORDER."

Note that this illustrates that often generic terms, such as 'is composed of', are defined to be
limited in their use for a relationship between specific kinds of things, such as an order and an
order line. This constraint is eliminated in the generic data modeling methodologies.

Generic Data Modeling

Different modelers may well produce different models of the same domain. This can lead to
difficulty in bringing the models of different people together. Invariably, however, this difference is
attributable to different levels of abstraction in the models. If the modelers agree on certain
elements which are to be rendered more concretely, then the differences become less significant.

There are generic patterns that can be used to advantage for modeling business. These include
the concepts PARTY (with included PERSON and ORGANIZATION), PRODUCT TYPE,
PRODUCT INSTANCE, ACTIVITY TYPE, ACTIVITY INSTANCE, CONTRACT, GEOGRAPHIC
AREA, and SITE. A model which explicitly includes versions of these entity classes will be both
reasonably robust and reasonably easy to understand.

More abstract models are suitable for general purpose tools, and consist of variations on THING
and THING TYPE, with all actual data being instances of these. Such abstract models are
significantly more difficult to manage, since they are not very expressive of real world things.
More concrete and specific data models will risk having to change as the environment changes.

One approach to generic data modeling has the following characteristics:


 A generic data model shall consist of generic entity types, such as 'individual thing',
'class', 'relationship', and possibly a number of their subtypes.
 Every individual thing is an instance of a generic entity called 'individual thing' or one of
its subtypes.
 Every individual thing is explicitly classified by a kind of thing ('class') using an explicit
classification relationship.
 The classes used for that classification are separately defined as standard instances of
the entity 'class' or one of its subtypes, such as 'class of relationship'. These standard
classes are usually called 'reference data'. This means that domain specific knowledge is
captured in those standard instances and not as entity types. For example, concepts
such as car, wheel, building, ship, and also temperature, length, etc. are standard
instances. But also standard types of relationship, such as 'is composed of' and 'is
involved in' can be defined as standard instances.

This way of modeling allows to add standard classes and standard relation types as data
(instances), which makes the data model flexible and prevents data model changes when the
scope of the application changes.

A generic data model obeys the following rules:

1. Candidate attributes are treated as representing relationships to other entity types.


2. Entity types are represented, and are named after, the underlying nature of a thing, not
the role it plays in a particular context. Entity types are chosen.
3. Entities have a local identifier within a database or exchange file. These should be
artificial and managed to be unique. Relationships are not used as part of the local
identifier.
4. Activities, relationships and event-effects are represented by entity types (not attributes).
5. Entity types are part of a sub-type/super-type hierarchy of entity types, in order to define
a universal context for the model. As types of relationships are also entity types they are
also arranged in a sub-type/super-type hierarchy of types of relationship.
6. Types of relationships are defined on a high (generic) level, being the highest level where
the type of relationship is still valid. For example, a composition relationship (indicated by
the phrase: 'is composed of') is defined as a relationship between an 'individual thing' and
another 'individual thing' (and not just between e.g. an order and an order line). This
generic level means that the type of relation may in principle be applied between any
individual thing and any other individual thing. Additional constraints are defined in the
'reference data', being standard instances of relationships between kinds of things.

Examples of generic data models are ISO 10303-221, ISO 15926 and Gellish

Data organization

Another kind of data model describes how to organize data using a database management
system or other data management technology. It describes, for example, relational tables and
columns or object-oriented classes and attributes. Such a data model is sometimes referred to as
the physical data model, but in the original ANSI three schema architecture, it is called "logical".
In that architecture, the physical model describes the storage media (cylinders, tracks, and
tablespaces). Ideally, this model is derived from the more conceptual data model described
above. It may differ, however, to account for constraints like processing capacity and usage
patterns.
While data analysis is a common term for data modeling, the activity actually has more in
common with the ideas and methods of synthesis (inferring general concepts from particular
instances) than it does with analysis (identifying component concepts from more general ones).
{Presumably we call ourselves systems analysts because no one can say systems synthesists.}
Data modeling strives to bring the data structures of interest together into a cohesive,
inseparable, whole by eliminating unnecessary data redundancies and by relating data structures
with relationships.

A different approach is through the use of adaptive systems such as artificial neural networks that
can autonomously create implicit models of data.

Techniques

Several techniques have been developed for the design of a data models. While these
methodologies guide data modelers in their work, two different people using the same
methodology will often come up with very different results. Most notable are:

 Entity-relationship model
 IDEF
 Object Role Modeling (ORM) or Nijssen's Information Analysis Method (NIAM)
 Business rules or business rules approach
 RM/T
 Bachman diagrams
 Object-relational_mapping
 Barker's Notation
 EBNF Grammars

See also

 Abstraction (computer science)

External links

 Article Database Modelling in UML from Methods & Tools


 Data Modelling Dictionary
 Data modeling articles
 Notes on System Development, Methodologies and Modelingby Tony Drewry

Entity-relationship model
Databases are used to store structured data. The structure of this data, together with other
constraints, can be designed using a variety of techniques, one of which is called entity-
relationship modeling or ERM. The end-product of the ERM process is an entity-relationship
diagram or ERD. Data modeling requires a graphical notation for representing such data models.
An ERD is a type of conceptual data model or semantic data model.
The first stage of information system design uses these models to describe information needs or
the type of information that is to be stored in a database during the requirements analysis. The
data modeling technique can be used to describe any ontology (i.e. an overview and
classifications of used terms and their relationships) for a certain universe of discourse (i.e. area
of interest). In the case of the design of an information system that is based on a database, the
conceptual data model is, at a later stage (usually called logical design), mapped to a logical data
model, such as the relational model; this in turn is mapped to a physical model during physical
design. Note that sometimes, both of these phases are referred to as "physical design".

There are a number of conventions for entity-relationship diagrams (ERDs). The classical
notation is described in the remainder of this article, and mainly relates to conceptual modelling.
There are a range of notations more typically employed in logical and physical database design,
including information engineering, IDEF1x (ICAM DEFinition Language) and dimensional
modelling.

Contents

 1 Common symbols
 2 Less common symbols
 3 Alternative diagramming conventions
 3.1 Crow's Feet
 4 Classification
 5 See also
 6 ER diagramming tools
 7 References

 8 External links

Common symbols

Two related entities

An entity with an attribute

A relationship with an attribute

Primary key
A sample ER diagram

An entity represents a discrete object. Entities can be thought of as nouns. Examples: a


computer, an employee, a song, a mathematical theorem. A relationship captures how two or
more entities are related to one another. Relationships can be thought of as verbs. Examples: an
owns relationship between a company and a computer, a supervises relationship between an
employee and a department, a performs relationship between an artist and a song, a proved
relationship between a mathematician and a theorem. Entities are drawn as rectangles,
relationships as diamonds.

Entities and relationships can both have attributes. Examples: an employee entity might have a
social security number attribute (in the US); the proved relationship may have a date attribute.
Attributes are drawn as ovals connected to their owning entity sets by a line.
Every entity (unless it is a weak entity) must have a minimal set of uniquely identifying attributes,
which is called the entity's primary key.

Entity-relationship diagrams don't show single entities or single instances of relations. Rather,
they show entity sets and relationship sets (displayed as rectangles and diamonds respectively).
Example: a particular song is an entity. The collection of all songs in a database is an entity set.
The proved relationship between Andrew Wiles and Fermat's last theorem is a single relationship.
The set of all such mathematician-theorem relationships in a database is a relationship set.

Lines are drawn between entity sets and the relationship sets they are involved in. If all entities in
an entity set must participate in the relationship set, a thick or double line is drawn. This is called
a participation constraint. If each entity of the entity set can participate in at most one relationship
in the relationship set, an arrow is drawn from the entity set to the relationship set. This is called a
key constraint. To indicate that each entity in the entity set is involved in exactly one relationship,
a thick arrow is drawn.

Associative entity is used to solve the problem of two entities with a many-to-many relationship
[1].

Unary Relationships - a unary relationship is a relationship between the rows of a single table.

Less common symbols

A weak entity is an entity that can't be uniquely identified by its own attributes alone, and
therefore must use as its primary key both its own attributes and the primary key of an entity it is
related to. A weak entity set is indicated by a bold rectangle (the entity) connected by a bold
arrow to a bold diamond (the relationship). Double lines can be used instead of bold ones.

Attributes in an ER model may be further described as multi-valued, composite, or derived. A


multi-valued attribute, illustrated with a double-line ellipse, may have more than one value for at
least one instance of its entity. For example, a piece of software (entity=application) may have the
multivalued attribute "platform" because at least one instance of that entity runs on more than one
operating system. A composite attribute may itself contain two or more attributes and is indicated
as having at least contributing attributes of its own. For example, addresses usually are
composite attributes, composed of attributes such as street address, city, and so forth. Derived
attributes are attributes whose value is entirely dependent on another attribute and are indicated
by dashed ellipses. For example, if we have an employee database with an employee entity
along with an age attribute, the age attribute would be derived from a birth date attribute.

Sometimes two entities are more specific subtypes of a more general type of entity. For example,
programmers and marketers might both be types of employees at a software company. To
indicate this, a triangle with "ISA" on the inside is drawn. The superclass is connected to the point
on top and the two (or more) subclasses are connected to the base.
A relation and all its participating entity sets can be treated as a single entity set for the purpose
of taking part in another relation through aggregation, indicated by drawing a dotted rectangle
around all aggregated entities and relationships.

Alternative diagramming conventions

Crow's Feet

Two related entities shown using Crow's Feet notation

The "Crow's Feet" notation is named for the symbol used to denote the many sides of a
relationship, which resembles the forward digits of a bird's claw. You can see this claw shape in
the diagram to the right, representing the same relationship depicted in the Common symbols
section above.

In the diagram, the following facts are detailed:

 An Artist can perform many Songs, identified by the crow's foot.


 An Artist must perform at least one Song, shown by the perpendicular line.
 A Song may or may not be performed by any Artist, as indicated by the open circle.

This notation is gaining acceptance through common usage in Oracle texts, and in tools such as
Visio and PowerDesigner, with the following benefits:

 Clarity in identifying the many, or child, side of the relationship, using the crow's foot.
 Concise notation for identifying mandatory relationship, using a perpendicular bar, or an
optional relationship, using an open circle.

Classification

Entity relationship models can be classified in BERMs (Binary Entity Relation Model) and GERMs
(General Entity Relationship Model) according to whether only binary relationships are allowed. A
binary relationship is a relationship between two entities. Thus, in a GERM, relationships between
three or more entities are also allowed.

See also

 Data model
 Data structure diagram
 Object Role Modeling (ORM)
 Unified Modeling Language (UML)

ER diagramming tools
 AllFusion ERwin Data Modeler - a good tool for ERD, can generate html report.
 ConceptDraw - cross platform software for creating ER Diagrams
 DB Visual ARCHITECT - Support UML Class Diagram and ERD
 Dia - a free software program to draw ER diagrams
 Ferret (software) - a free software ER drawing tool
(http://www.gnu.org/software/ferret/project/what.html)
 ER/Studio - robust, easy-to use ER modeling tool from Embarcadero.
 Kivio - a free software flowcharting program that supports ER Diagrams
 Microsoft Visio - diagramming software, some versions can auto-generate an ERD from a
database
 PowerDesigner - modeling suite from Sybase which includes Data Architect for
constructing or reverse engineering conceptual, logical and physical models with many of
the leading RDBMS brands.
 SILVERRUN ModelSphere - supporting conceptual, logical and physical data modeling
including interfaces for multiple target systems.
 SmartDraw - point and click drawing method combined with many templates creates
professional diagrams.

References

 Chen, Peter P. (1976). "The Entity-Relationship Model - Toward a Unified View of Data".
ACM Transactions on Database Systems 1 (1): 9-36.

This paper is one of the most cited papers in the computer field. It was selected as one of the
most influential papers in computer science in a recent survey of over 1,000 computer science
professors. The citation is listed, for example, in DBLP: http://dblp.uni-trier.de/ [2]

External links

 Peter Chen home page at Louisiana State University


http://bit.csc.lsu.edu/~chen/chen.html
 Origins of ER model pioneering
 more deepened analysis of Chinese language
 The Entity-Relationship Model--Toward a Unified View of Data
 Case study: E-R diagram for Acme Fashion Supplies by Mark H. Ridley
 IDEF1X
 Notes: Logical Data Structures (LDSs) - Getting started by Tony Drewry
 Introduction to Data Modeling

Data Definition Language


A Data Definition Language (DDL) is a computer language for defining data. XML Schema is an
example of a pure DDL (although only relevant in the context of XML). A subset of SQL's
instructions form another DDL.

These SQL statements define the structure of a database, including rows, columns, tables,
indexes, and database specifics such as file locations. DDL SQL statements are more part of the
DBMS and have large differences between the SQL variations. DDL SQL commands include the
following:
• Create - To make a new database, table, index, or stored query.
• Drop - To destroy an existing database, table, index, or view.
• Alter - To modify an existing database object.
• Truncate - To irreversibly clear a table.

DBCC (Database Console Commands) - Statements check the physical and logical consistency
of a database.

Data Manipulation Language


Data Manipulation Language (DML) is a family of computer languages used by computer
programs or database users to retrieve, insert, delete and update data in a database.

Currently, the most popular data manipulation language is that of SQL, which is used to retrieve
and manipulate data in a Relational database. Other forms of DML are those used by IMS/DL1,
CODASYL databases (such as IDMS), and others.

Data manipulation languages were initially only used by computer programs, but (with the advent
of SQL) have come to be used by people, as well.

Data manipulation languages have their functional capability organized by the initial word in a
statement, which is almost always a verb. In the case of SQL, these verbs are "select", "insert",
"update", and "delete". This makes the nature of the language into a set of imperative statements
(commands) to the database.

Data manipulation languages tend to have many different "flavors" and capabilities between
database vendors. There has been a standard established for SQL by ANSI, but vendors still
"exceed" the standard and provide their own extensions. Data manipulation language is basically
of two types: 1) Procedural DMLs 2) Declarative DMLs

Database administrator
A database administrator (DBA) is a person who is responsible for the environmental aspects
of a database. In general, these include:

 Recoverability - Creating and testing Backups


 Integrity - Verifying or helping to verify data integrity
 Security - Defining and/or implementing access controls to the data
 Availability - Ensuring maximum uptime
 Performance - Ensuring maximum performance given budgetary constraints
 Development and testing support - Helping programmers and engineers to efficiently
utilize the database.
The role of a database administrator has changed according to the technology of database
management systems (DBMSs) as well as the needs of the owners of the databases.

Contents

 1 Duties
 2 Definition of Database
 3 Recoverability
 4 Integrity
 5 Security
 6 Availability
 7 Performance
 8 Development/Testing Support

 9 See also

Duties

The duties of a database administrator vary and depend on the job description, corporate and
Information Technology (IT) policies and the technical features and capabilities of the DBMS
being administered. They nearly always include disaster recovery (backups and testing of
backups), performance analysis and tuning, and some database design.

Definition of Database

A database is a collection of related information, accessed and managed by its DBMS. After
experimenting with hierarchical and networked DBMSs during the 1970’s, the IT industry became
dominated by relational DBMSs (Or Object-Relational Database Management System) such as
Oracle, Sybase, and, later on, Microsoft SQL Server and the like.

In a strictly technical sense, for any database to be defined as a "Truly Relational Model
Database Management System," it should, ideally, adhere to the twelve rules defined by Edgar F.
Codd, pioneer in the field of relational databases. To date, while many come close, it is admitted
that nothing on the market adheres 100% to those rules, any more than they are 100% ANSI-
SQL compliant.

While IBM and Oracle technically were the earliest on the RDBMS scene, many others have
followed, and while it is unlikely that miniSQL still exist in their original form, Monty's MySQL is
still extant and thriving, along with the Ingres-descended PostgreSQL. Alpha Five, Microsoft
Access - the 1995+ versions, not the prior versions - were, despite various limitations, technically
the closest thing to being 'Truly Relational' DBMS's for the desktop PC, with Visual FoxPro, and
many other desktop products marketed at that time far less compliant with Codd's Rules.

A relational DBMS manages information about types of real-world things (entities) in the form of
tables that represent the entities. A table is like a spreadsheet; each row represents a particular
entity (instance), and each column represents a type of information about the entity (domain).
Sometimes entities are made up of smaller related entities, such as orders and order lines; and
so one of the challenges of a multi-user DBMS is provide data about related entities from the
standpoint of an instant of logical consistency.

Properly managed relational databases minimize the need for application programs to contain
information about the physical storage of the data they access. To maximize the isolation of
programs from data structures, relational DBMSs restrict data access to the messaging protocol
SQL, a nonprocedural language that limits the programmer to specifying desired results. This
message-based interface was a building block for the decentralization of computer hardware,
because a program and data structure with such a minimal point of contact become feasible to
reside on separate computers.

Recoverability

Recoverability means that, if a data entry error, program bug or hardware failure occurs, the DBA
can bring the database backward in time to its state at an instant of logical consistency before the
damage was done. Recoverability activities include making database backups and storing them
in ways that minimize the risk that they will be damaged or lost, such as placing multiple copies
on removable media and storing them outside the affected area of an anticipated disaster.
Recoverability is the DBA’s most important concern.

Recoverability, also sometimes called "disaster recovery," takes two primary forms. First the
backup, then recovery tests.

The backup of the database consists of data with timestamps combined with database logs to
change the data to be consistent to a particular moment in time. It is possible to make a backup
of the database containing only data without timestamps or logs, but the DBA must take the
database offline to do such a backup.

The recovery tests of the database consist of restoring the data, then applying logs against that
data to bring the database backup to consistency at a particular point in time up to the last
transaction in the logs. Alternatively, an offline database backup can be restored simply by
placing the data in-place on another copy of the database.

If a DBA (or any administrator) attempts to implement a recoverability plan without the recovery
tests, there is no guarantee that the backups are at all valid. In practice, in all but the most mature
RDBMS packages, backups rarely are valid without extensive testing to be sure that no bugs or
human error have corrupted the backups.

Integrity

Integrity means that the database, or the programs that create its content, embody means of
preventing users who provide data from breaking the system’s business rules. For example, a
retailer may have a business rule that only individual customers can place orders; and so every
order must identify one and only one customer. Oracle Server and other relational DBMSs
enforce this type of business rule with constraints, which are configurable implicit queries. To
continue the example, in the process of inserting a new order the database may query its
customer table to make sure that the customer identified by the order exists.

Security

Security means that users’ ability to access and change data conforms to the policies of the
business and the delegation decisions of its managers. Like other metadata, a relational DBMS
manages security information in the form of tables. These tables are the “keys to the kingdom”
and so it is important to protect them from intruders.

Availability

Availability means that authorized users can access and change data as needed to support the
business. Increasingly, businesses are coming to expect their data to be available at all times
(“24x7”, or 24 hours a day, 7 days a week, ). The IT industry has responded to the availability
challenge with hardware and network redundancy and increasing online administrative
capabilities. m

Performance

Performance means that the database does not cause unreasonable online response times, and
it does not cause unattended programs to run for an unworkable period of time. In complex
client/server and three-tier systems, the database is just one of many elements that determine the
performance that online users and unattended programs experience. Performance is a major
motivation for the DBA to become a generalist and coordinate with specialists in other parts of the
system outside of traditional bureaucratic reporting lines.

Techniques for database performance tuning have changed as DBA's have become more
sophisticated in their understanding of what causes performance problems and their ability to
diagnose the problem.

In the 1990s, DBAs often focused on the database as a whole, and looked at database-wide
statistics for clues that might help them find out why the system was slow. Also, the actions DBAs
took in their attempts to solve performance problems were often at the global, database level,
such as changing the amount of computer memory available to the database, or changing the
amount of memory available to any database program that needed to sort data.

Around the year 2000, many of the most fundamental assumptions about database performance
tuning were discovered to be myths. Most famously, the database buffer cache hit ratio, once
thought to be the most reliable way to measure database performance, was found to be a
completely meaningless statistic.

As of 2005, the fog has lifted. DBA's understand that performance problems initially must be
diagnosed, and this is best done by examining individual SQL programs, not the database as a
whole. Various tools, some included with the database and some available from third parties,
provide a behind the scenes look at how the database is handling the SQL program, shedding
light on what's taking so long.

Having identified the problem, the individual SQL statement can be tuned, and this is usually
done by either rewriting it, using hints, adding or modifying indexes, or sometimes modifying the
database tables themselves.

Development/Testing Support

Development and testing support is typically what the database administrator regards as his or
her least important duty, while results-oriented managers consider it the DBA’s most important
duty. Support activities include collecting sample production data for testing new and changed
programs and loading it into test databases; consulting with programmers about performance
tuning; and making table design changes to provide new kinds of storage for new program
functions.

Here are some IT roles that are related to the role of database administrator:

 Application programmer or software engineer


 System administrator
 Data administrator
 Data architect

SQL

SQL

Paradigm: multi-paradigm: object-oriented,


functional, procedural

Appeared in: 1974

Designed by: Donald D. Chamberlin and Raymond


F. Boyce

Developer: IBM

Typing discipline: static, strong

Major Many
implementations:

SQL (commonly expanded to Structured Query Language — see History for the term's
derivation) is the most popular computer language used to create, retrieve, update and delete
(see also: CRUD) data from relational database management systems. The language has
evolved beyond its original purpose, and now supports object-relational database management
systems. SQL has been standardized by both ANSI and ISO.

Contents

 1 Pronunciation
 2 History
 2.1 Standardization
 3 Scope
 3.1 Reasons for lack of portability
 4 SQL keywords
 4.1 Data retrieval
 4.2 Data manipulation
 4.3 Transaction Controls
 4.4 Data definition
 4.5 Data control
 4.6 Other
 5 Criticisms of SQL
 6 Alternatives to SQL
 7 See also
 7.1 Database systems using SQL
 7.2 SQL variants
 8 References

 9 External links

Pronunciation

SQL is commonly spoken either as the names of the letters ess-cue-el (IPA: [ˈɛsˈkjuˈɛl]), or like
the word sequel (IPA: [ˈsiːkwəl]). The official pronunciation of SQL according to ANSI is ess-cue-
el. However, each of the major database products (or projects) containing the letters SQL has its
own convention: MySQL is officially and commonly pronounced "My Ess Cue El"; PostgreSQL is
expediently pronounced postgres (being the name of the predecessor to PostgreSQL); and
Microsoft SQL Server is commonly spoken as Microsoft-sequel-server.

History

An influential paper, "A Relational Model of Data for Large Shared Data Banks", by Dr. Edgar F.
Codd, was published in June, 1970 in the Association for Computing Machinery (ACM) journal,
Communications of the ACM, although drafts of it were circulated internally within IBM in 1969.
Codd's model became widely accepted as the definitive model for relational database
management systems (RDBMS or RDMS).

During the 1970s, a group at IBM's San Jose research center developed a database system
"System R" based upon, but not strictly faithful to, Codd's model. Structured English Query
(SQL) Language ("SEQUEL") was designed to manipulate and retrieve data stored in System R.
The acronym SEQUEL was later condensed to SQL because the word 'SEQUEL' was held as a
trademark by the Hawker Siddeley aircraft company of the UK. Although SQL was influenced by
Codd's work, Donald D. Chamberlin and Raymond F. Boyce at IBM were the authors of the
SEQUEL language design.[1] Their concepts were published to increase interest in SQL.

The first non-commercial, relational, non-SQL database, Ingres, was developed in 1974 at U.C.
Berkeley.

In 1978, methodical testing commenced at customer test sites. Demonstrating both the
usefulness and practicality of the system, this testing proved to be a success for IBM. As a result,
IBM began to develop commercial products based on their System R prototype that implemented
SQL, including the System/38 (announced in 1978 and commercially available in August 1979),
SQL/DS (introduced in 1981), and DB2 (in 1983).[1]

At the same time Relational Software, Inc. (now Oracle Corporation) saw the potential of the
concepts described by Chamberlin and Boyce and developed their own version of a RDBMS for
the Navy, CIA and others. In the summer of 1979 Relational Software, Inc. introduced Oracle V2
(Version2) for VAX computers as the first commercially available implementation of SQL. Oracle
is often incorrectly cited as beating IBM to market by two years, when in fact they only beat IBM's
release of the System/38 by a few weeks. Considerable public interest then developed; soon
many other vendors developed versions, and Oracle's future was ensured.

Standardization

SQL was adopted as a standard by ANSI (American National Standards Institute) in 1992 and
ISO (International Organization for Standardization) in 1987.

The SQL standard has gone through a number of revisions:

Year Name Alias Comments

1986 SQL-86 SQL- First published by ANSI. Ratified by ISO in 1987.


87

1989 SQL-89 Minor revision.

1992 SQL-92 SQL2 Major revision (ISO 9075).

1999 SQL:1999 SQL3 Added regular expression matching, recursive queries, triggers, non-
scalar types and some object-oriented features. (The last two are
somewhat controversial and not yet widely supported.)

2003 SQL:2003 Introduced XML-related features, window functions, standardized


sequences and columns with auto-generated values (including identity-
columns).

2006 SQL:2006 ISO/IEC 9075-14:2006 defines ways in which SQL can be used in
conjunction with XML. It defines ways of importing and storing XML data
in an SQL database, manipulating it within the database and publishing
both XML and conventional SQL-data in XML form. In addition, it provides
facilities that permit applications to integrate into their SQL code the use
of XQuery, the XML Query Language published by the World Wide Web
Consortium (W3C), to concurrently access ordinary SQL-data and XML
documents.

The SQL standard is not freely available. SQL:2003 and SQL:2006 may be purchased from ISO
or ANSI. A late draft of SQL:2003 is available as a zip archive from Whitemarsh Information
Systems Corporation. The zip archive contains a number of PDF files that define the parts of the
SQL:2003 specification.

Scope

The neutrality or factuality of this article or section may be


compromised by weasel words.
You can help Wikipedia by improving weasel-worded statements.

SQL is designed for a specific purpose: to query data contained in a relational database. SQL is a
set-based, declarative computer language, not an imperative language such as C or BASIC.

Language extensions such as PL/SQL bridge this gap to some extent by adding procedural
elements, such as flow-of-control constructs. Another approach is to allow programming language
code to be embedded in and interact with the database. For example, Oracle and others include
Java in the database, and SQL Server 2005 allows any .NET language to be hosted within the
database server process, while PostgreSQL allows functions to be written in a wide variety of
languages, including Perl, Tcl, and C.

Extensions to and variations of the standards exist. Commercial implementations commonly omit
support for basic features of the standard, such as the DATE or TIME data types, preferring
variations of their own. SQL code can rarely be ported between database systems without major
modifications, in contrast to ANSI C or ANSI Fortran, which can usually be ported from platform to
platform without major structural changes.

Oracle Corporation's PL/SQL, IBM's SQL PL (SQL Procedural Language) and Sybase /
Microsoft's Transact-SQL are of a proprietary nature because the procedural programming
language they present are non-standardized.

Reasons for lack of portability


There are several reasons for this lack of portability between database systems:

 The complexity and size of the SQL standard means that most databases do not
implement the entire standard.
 The standard does not specify database behavior in several important areas (e.g.
indexes), leaving it up to implementations of the database to decide how to behave.
 The SQL standard precisely specifies the syntax that a conforming database system
must implement. However, the standard's specification of the semantics of language
constructs is less well-defined, leading to areas of ambiguity.
 Many database vendors have large existing customer bases; where the SQL standard
conflicts with the prior behavior of the vendor's database, the vendor may be unwilling to
break backward compatibility.

SQL keywords

SQL keywords fall into several groups.

Data retrieval

The most frequently used operation in transactional databases is the data retrieval operation.
When restricted to data retrieval commands, SQL acts as a declarative language.

 SELECT is used to retrieve zero or more rows from one or more tables in a database. In
most applications, SELECT is the most commonly used Data Manipulation Language
command. In specifying a SELECT query, the user specifies a description of the desired
result set, but they do not specify what physical operations must be executed to produce
that result set. Translating the query into an efficient query plan is left to the database
system, more specifically to the query optimizer.
 Commonly available keywords related to SELECT include:
 FROM is used to indicate from which tables the data is to be taken, as
well as how the tables JOIN to each other.
 WHERE is used to identify which rows to be retrieved, or applied to
GROUP BY. WHERE is evaluated before the GROUP BY.
 GROUP BY is used to combine rows with related values into elements of
a smaller set of rows.
 HAVING is used to identify which of the "combined rows" (combined
rows are produced when the query has a GROUP BY keyword or when
the SELECT part contains aggregates), are to be retrieved. HAVING acts
much like a WHERE, but it operates on the results of the GROUP BY
and hence can use aggregate functions.
 ORDER BY is used to identify which columns are used to sort the
resulting data.

Data retrieval is very often combined with data projection; usually it isn't the verbatim data stored
in primitive data types that a user is looking for or a query is written to serve. Often the data
needs to be expressed differently from how it's stored. SQL allows a wide variety of formulas
included in the select list to project data.

Example 1:
SELECT * FROM books
WHERE price > 100.00
ORDER BY title

This is an example that could be used to get a list of expensive books. It retrieves the records
from the books table that have a price field which is greater than 100.00. The result is sorted
alphabetically by book title. The asterisk (*) means to show all columns of the books table.
Alternatively, specific columns could be named.

Example 2:
SELECT books.title, count(*) AS Authors
FROM books
JOIN book_authors
ON books.book_number = book_authors.book_number
GROUP BY books.title

Example 2 shows both the use of multiple tables in a join, and aggregation (grouping). This
example shows how many authors there are per book. Example output may resemble:

Title Authors
---------------------- -------
SQL Examples and Guide 3
The Joy of SQL 1
How to use Wikipedia 2
Pitfalls of SQL 1
How SQL Saved my Dog 1

Data manipulation

First, there are the standard Data Manipulation Language (DML) elements. DML is the subset of
the language used to add, update and delete data.

 INSERT is used to add zero or more rows (formally tuples) to an existing table.
 UPDATE is used to modify the values of a set of existing table rows.
 MERGE is used to combine the data of multiple tables. It is something of a combination
of the INSERT and UPDATE elements. It is defined in the SQL:2003 standard; prior to
that, some databases provided similar functionality via different syntax, sometimes called
an "upsert".
 DELETE removes zero or more existing rows from a table.

INSERT Example:
INSERT INTO my_table (field1, field2, field3) VALUES ('test', 'N', NULL);
UPDATE Example:
UPDATE my_table SET field1 = 'updated value' WHERE field2 = 'N';
DELETE Example:
DELETE FROM my_table WHERE field2 = 'N';

Transaction Controls
Transaction, if available, can be used to wrap around the DML operations.

 BEGIN WORK (or START TRANSACTION, depending on SQL dialect) can be used to
mark the start of a database transaction, which either completes completely or not at all.
 COMMIT causes all data changes in a transaction to be made permanent.
 ROLLBACK causes all data changes since the last COMMIT or ROLLBACK to be
discarded, so that the state of the data is "rolled back" to the way it was prior to those
changes being requested.

COMMIT and ROLLBACK interact with areas such as transaction control and locking. Strictly,
both terminate any open transaction and release any locks held on data. In the absence of a
BEGIN WORK or similar statement, the semantics of SQL are implementation-dependent.

Example:
BEGIN WORK;
UPDATE inventory SET quantity = quantity - 3 WHERE item = 'pants';
COMMIT;

Data definition

The second group of keywords is the Data Definition Language (DDL). DDL allows the user to
define new tables and associated elements. Most commercial SQL databases have proprietary
extensions in their DDL, which allow control over nonstandard features of the database system.
The most basic items of DDL are the CREATE,ALTER,RENAME,TRUNCATE and DROP
commands.

 CREATE causes an object (a table, for example) to be created within the database.
 DROP causes an existing object within the database to be deleted, usually irretrievably.
 TRUNCATE deletes all data from a table (non-standard, but common SQL command).
 ALTER command permits the user to modify an existing object in various ways -- for
example, adding a column to an existing table.

Example:
CREATE TABLE my_table (
my_field1 INT,
my_field2 VARCHAR (50),
my_field3 DATE NOT NULL,
PRIMARY KEY (my_field1, my_field2)
);

Data control

The third group of SQL keywords is the Data Control Language (DCL). DCL handles the
authorization aspects of data and permits the user to control who has access to see or
manipulate data within the database. Its two main keywords are:
 GRANT — authorizes one or more users to perform an operation or a set of operations
on an object.
 REVOKE — removes or restricts the capability of a user to perform an operation or a set
of operations.

Example:
GRANT SELECT, UPDATE ON my_table TO some_user, another_user.

Other

 ANSI-standard SQL supports double dash, --, as a single line comment identifier (some
extensions also support curly brackets or C-style /* comments */ for multi-line comments).

Example:
SELECT * FROM inventory -- Retrieve everything from inventory table

 Some SQL servers allow User Defined Functions

Criticisms of SQL
Technically, SQL is a declarative computer language for use with "SQL databases". Theorists
and some practitioners note that many of the original SQL features were inspired by, but in
violation of, the relational model for database management and its tuple calculus realization.
Recent extensions to SQL achieved relational completeness, but have worsened the violations,
as documented in The Third Manifesto.

In addition, there are also some criticisms about the practical use of SQL:

 Implementations are inconsistent and, usually, incompatible between vendors. In


particular date and time syntax, string concatenation, nulls, and comparison case
sensitivity often vary from vendor-to-vendor.

 The language makes it too easy to do a Cartesian join, which results in "run-away" result
sets when WHERE clauses are mistyped. Cartesian joins are so rarely used in practice
that requiring an explicit CARTESIAN keyword may be warranted.

 SQL - and the relational model as it is - offer no standard way for handling tree-
structures, i.e. rows recursively referring other rows of the same table. Oracle offers a
"CONNECT BY" clause, other solutions are database functions which use recursion and
return a row set, as possible in Postgresql with PL/PgSQL and other databases.

Alternatives to SQL
A distinction should be made between alternatives to relational query languages and alternatives
to SQL. The list below are proposed alternatives to SQL, but are still (nominally) relational. See
navigational database for alternatives to relational.

 IBM Business System 12 (IBM BS12)


 Tutorial D
 TQL - Luca Cardelli
 Top's Query Language - A draft language influenced by IBM BS12. Tentatively renamed
to SMEQL to avoid confusion with similar projects called TQL.
 Hibernate Query Language[2] (HQL) - A Java-based tool that uses modified SQL
 EJB-QL (Enterprise Java Bean Query Language/Java Persistence Query Language)[3] -
An object-based query language, which allows objects to be retrieved using a syntax
similar to SQL. It is used within the Java Persistence framework, and formerly within the
J2EE/JEE Enterprise Java Bean framework with Entity Beans.
 Quel introduced in 1974 by the U.C. Berkeley Ingres project.
 Object Query Language - Object Data Management Group.
 Datalog
 LINQ
 NoSQL

See also

Database systems using SQL

 Comparison of relational database management systems


 Comparison of truly relational database management systems
 Comparison of object-relational database management systems
 List of relational database management systems
 List of object-relational database management systems
 List of hierarchical database management systems

SQL variants

 Comparison of SQL syntax

References

1. ^ Donald D. Chamberlin and Raymond F. Boyce, 1974. "SEQUEL: A structured English


query language", International Conference on Management of Data, Proceedings of the
1974 ACM SIGFIDET (now SIGMOD) workshop on Data description, access and control,
Ann Arbor, Michigan, pp. 249–264

1. Discussion on alleged SQL flaws (C2 wiki)


2. Web page about FSQL: References and links.
3. Galindo J., Urrutia A., Piattini M., "Fuzzy Databases: Modeling, Design and
Implementation". Idea Group Publishing Hershey, USA, 2005.

External links

 SQL Basics
 The 1995 SQL Reunion: People, Projects, and Politics (early history of SQL)
 SQL:2003, SQL/XML and the Future of SQL (webcast and podcast with Jim Melton,
editor of the SQL standard)
 A Gentle Introduction to SQL at SQLzoo
 SQL Help and Tutorial
 The SQL Language (PostgreSQL specific details included)

Wikibooks Programming has more about this subject:


SQL

 SQL Exercises. SQL DML Help and Tutorial


 SQL Tutorial.
 SQL Recipes - A free SQL cookbook full of practical examples and queries for all dialects
 Online Interactive SQL Tutorials
 How well Oracle, DB2, MSSQL support the SQL Standard
 SQL Tutorial
 SQL Tutorial with examples
 The sbVB DataBase course - A free course on software development using cross-
platform C++ and SQL (for any Relational Database, such as Oracle, MSSQL,
PostgreSQL, MySQL, DB2, Informix and others)

Topics in database management systems (DBMS) ( view • talk • edit )

Concepts
Database | Database model | Relational database | Relational model | Relational algebra |
Primary key - Foreign key - Surrogate key - Superkey - Candidate key
Database normalization | Referential integrity | Relational DBMS | Distributed DBMS | ACID

Objects Topics in SQL


Trigger | View | Table | Cursor | Log | Select | Insert | Update | Merge | Delete | Join
Transaction | Index | Stored procedure | | Union | Create | Drop
Partition

Implementations of database management systems

QUEL
QUEL is a relational database access language, similar in most ways to SQL, but somewhat
better arranged and easier to use. It was created as a part of the groundbreaking Ingres effort at
University of California, Berkeley, based on Codd's earlier suggested but not implemented Data
Sub-Language ALPHA. QUEL was used for a short time in most products based on the freely-
available Ingres source code, most notably Informix. As Oracle and DB2 started taking over the
market in the early 1980s, most companies then supporting QUEL moved to SQL instead.

In many ways QUEL is similar to SQL. One difference is that QUEL statements are always
defined by tuple variables, which can be used to limit queries or return result sets. Consider this
example, taken from the original Ingres paper:

range of e is employee retrieve (comp = e.salary/ (e.age - 18)) where e.name = "Jones"

e is a tuple, defining a set of data, in this case all the rows in the employee table that have the
first name "Jones". In SQL the statement is very similar, and arguably cleaner:

select (e.salary/ (e.age - 18)) as comp from employee as e where e.name = "Jones"

QUEL is generally more "normalized" than SQL. Whereas every major SQL command has a
format that is at least somewhat different from the others, in QUEL a single syntax is used for all
commands.

For instance, here is a sample of simple session that creates a table, inserts a row into it, and
then retrieves and modifies the data inside it.

create student(name = c10, age = i4, sex = c1, state = c2)


append to student(name = "philip", age = 17, sex = "m", state = "FL")
range of s is student retrieve (s.all) where s.state = "FL"
print s
range of s is student replace s(age=s.age+1)
print s

Here is a similar set of SQL statements:

create table student(name char(10), age int, sex char(1), state char(2))
insert student (name, age, sex, state) values ("philip", 17, "m", "FL")
select * from student where state = "FL"
update student set age=age+1

Note that every command uses a unique syntax, and that even similar commands like INSERT
and UPDATE use completely different styles.

Another advantage of QUEL was a built-in system for moving records en-masse into and out of
the system. Consider this command:

copy student(name=c0, comma=d1, age=c0, comma=d1, sex=c0, comma=d1, address=c0,


nl=d1)
into "/student.txt"
which creates a comma-delimited file of all the records in the student table. The d1 indicates a
delimiter, as opposed to a data type. Changing the into to a from reverses the process. Similar
commands are available in many SQL systems, but usually as external tools, as opposed to
being internal to the SQL language. This makes them unavailable to stored procedures.

With these differences, however, the two languages are largely the same.

Query by Example
Query by Example (QBE) is a database query language for relational databases. It was devised
by Moshé M. Zloof at IBM Research during the mid 1970s, in parallel to the development of SQL.
It is the first graphical query language, using visual tables where the user would enter commands,
example elements and conditions. Many graphical front-ends for databases use the ideas from
QBE today.

QBE is based on the notion of Domain relational calculus.

Contents

 1 Example
 2 See also
 3 References
 4 QBE
 5 Sources

 6 External links

Example

A simple example using the Suppliers and Parts database is given here, just to give you a feel for
how QBE works.

This "query" selects all supplier numbers (S#) where the owner of the supplier company is "J.
Doe" and the supplier is located in "Rome".

Other commands like the "P." (print) command are: "U." (update), "I." (insert) and "D." (delete).

The result of this query depends on what the values are for your the Suppliers and Parts
database.
See also

 Microsoft Query by Example

References

 M. Zloof. Query by Example. AFIPS, 44, 1975.


 Raghu Ramakrishnan, Johannes Gehrke. Database Management Systems 3rd edition.
Chapter 6.
 Date, C.J. (2004). "8 Relational Calculus", in Maite Suarez-Rivas; Katherine Harutunian:
An Introduction to Database Systems. Pearson Education Inc.. ISBN 0-321-18956-6.

QBE
Query by Example (QBE) is a powerful search tool that allows anyone to search a system for
document(s) by entering an element such as a text string or document name, quickly searching
through documents to match your entered criteria. It’s commonly believed that QBE is far easier
to learn than other, more formal query languages, (i.e. SQL) while still providing people with the
opportunity to perform powerful searches.

Searching for documents based on matching text is easy with QBE; the user simply enters (or
copy and paste) the target text into the search form field. When the user clicks search (or hits
enter) the input is passed to the QBE parser for processing. The query is created and then the
search begins, using key words from the input the user provided. It auto-eliminates mundane
words such as and, is, or, the, etc… to make the search more efficient and not to barrage the
user with results. However, when compared with a formal query, the results in the QBE system
will be more variable.

The user can also search for similar documents based on the text of a full document that he or
she may have. This is accomplished by the user’s submission of documents (or numerous
documents) to the QBE results template. The analysis of these document(s) the user has inputted
via the QBE parser will generate the required query and submits it to the search engine, that will
obviously then search for relevant and similar material for the specified list.

Database normalization
This article or section is in need of attention from an expert on the subject.
WikiProject Computer science or the Computer science Portal may be able to help recruit one.
If a more appropriate WikiProject or portal exists, please adjust this template accordingly.

Database normalization is a design technique by which relational database tables are structured
in such a way as to make them invulnerable to certain types of logical inconsistencies and
anomalies. Tables can be normalized to varying degrees: relational database theory defines
"normal forms" of successively higher degrees of stringency, so, for example, a table in third
normal form is less open to logical inconsistencies and anomalies than a table that is only in
second normal form. Although the normal forms are often defined (informally) in terms of the
characteristics of tables, rigorous definitions of the normal forms are concerned with the
characteristics of mathematical constructs known as relations. Whenever information is
represented relationally—that is, roughly speaking, as values within rows beneath fixed column
headings—it makes sense to ask to what extent the representation is normalized.

Contents

 1 Problems addressed by normalization


 2 Background to normalization: definitions
 3 History
 4 Normal forms
 4.1 First normal form
 4.2 Second normal form
 4.3 Third normal form
 4.4 Boyce-Codd normal form
 4.5 Fourth normal form
 4.6 Fifth normal form
 4.7 Domain/key normal form
 4.8 Sixth normal form
 5 Example Of The Process
 5.1 Starting Point
 5.2 1NF
 5.3 2NF
 5.4 3NF and BCNF
 5.5 4NF
 5.6 5NF
 6 Denormalization
 6.1 Non-first normal form (NF²)
 7 Further reading
 8 References
 9 See also

 10 External links

Problems addressed by normalization


A table that is not sufficiently normalized can suffer from logical inconsistencies of various types, and from anomalies
involving data operations. In such a table:

 The same fact can be expressed on multiple records; therefore updates to the table may result in logical
inconsistencies. For example, each record in an unnormalized "DVD Rentals" table might contain a DVD ID,
Member ID, and Member Address; thus a change of address for a particular member will potentially need to be
applied to multiple records. If the update is not carried through successfully—if, that is, the member's address is
updated on some records but not others—then the table is left in an inconsistent state. Specifically, the table
provides conflicting answers to the question of what this particular member's address is. This phenomenon is
known as an update anomaly.
 There are circumstances in which certain facts cannot be recorded at all. In the above example, if it is the case
that Member Address is held only in the "DVD Rentals" table, then we cannot record the address of a member
who has not yet rented any DVDs. This phenomenon is known as an insertion anomaly.
 There are circumstances in which the deletion of data representing certain facts necessitates the deletion of
data representing completely different facts. For example, suppose a table has the attributes Student ID,
Course ID, and Lecturer ID (a given student is enrolled in a given course, which is taught by a given lecturer). If
the number of students enrolled in the course temporarily drops to zero, the last of the records referencing that
course must be deleted—meaning, as a side-effect, that the table no longer tells us which lecturer has been
assigned to teach the course. This phenomenon is known as a deletion anomaly.
Ideally, a relational database should be designed in such a way as to exclude the possibility of update, insertion, and
deletion anomalies. The normal forms of relational database theory provide guidelines for deciding whether a particular
design will be vulnerable to such anomalies. It is possible to correct an unnormalized design so as to make it adhere to
the demands of the normal forms: this is normalization. Normalization typically involves decomposing an unnormalized
table into two or more tables which, were they to be combined (joined), would convey exactly the same information as the
original table.

Background to normalization: definitions

 Functional dependency: Attribute B has a functional dependency on attribute A if, for each value of attribute A,
there is exactly one value of attribute B. For example, Member Address has a functional dependency on
Member ID, because a particular Member Address value corresponds to every Member ID value. An attribute
may be functionally dependent either on a single attribute or on a combination of attributes. It is not possible to
determine the extent to which a design is normalized without understanding what functional dependencies apply
to the attributes within its tables; understanding this, in turn, requires knowledge of the problem domain.
 Trivial functional dependency: A trivial functional dependency is a functional dependency of an attribute on a
superset of itself. {Member ID, Member Address} → {Member Address} is trivial, as is {Member Address} →
{Member Address}.
 Full functional dependency: An attribute is fully functionally dependent on a set of attributes X if it is a)
functionally dependent on X, and b) not functionally dependent on any proper subset of X. {Member Address}
has a functional dependency on {DVD ID, Member ID}, but not a full functional dependency, for it is also
dependent on {Member ID}.
 Multivalued dependency: A multivalued dependency is a constraint according to which the presence of certain
rows in a table implies the presence of certain other rows: see the Multivalued Dependency article for a rigorous
definition.
 Superkey: A superkey is an attribute or set of attributes that uniquely identifies rows within a table; in other
words, two distinct rows are always guaranteed to have distinct superkeys. {DVD ID, Member ID, Member
Address} would be a superkey for the "DVD Rentals" table; {DVD ID, Member ID} would also be a superkey.
 Candidate key: A candidate key is a minimal superkey, that is, a superkey for which we can say that no proper
subset of it is also a superkey. {DVD ID, Member ID} would be a candidate key for the "DVD Rentals" table.
 Non-prime attribute: A non-prime attribute is an attribute that does not occur in any candidate key. Member
Address would be a non-prime attribute in the "DVD Rentals" table.
 Primary key: Most DBMSs require a table to be defined as having a single unique key, rather than a number of
possible unique keys. A primary key is a candidate key which the database designer has designated for this
purpose.

History

Edgar F. Codd first proposed the process of normalization and what came to be known as the 1st normal form:

“ There is, in fact, a very simple elimination[1] procedure which we shall call normalization. Through decomposition
non-simple domains are replaced by "domains whose elements are atomic (non-decomposable) values." „

—Edgar F. Codd, A Relational Model of Data for Large Shared Data Banks[2]

In his paper, Edgar F. Codd used the term "non-simple" domains to describe a heterogeneous data structure, but later
researchers would refer to such a structure as an abstract data type.

Normal forms

The normal forms (abbrev. NF) of relational database theory provide criteria for determining a table's degree of
vulnerability to logical inconsistencies and anomalies. The higher the normal form applicable to a table, the less
vulnerable it is to such inconsistencies and anomalies. Each table has a "highest normal form" (HNF): by definition, a
table always meets the requirements of its HNF and of all normal forms lower than its HNF; also by definition, a table fails
to meet the requirements of any normal form higher than its HNF.

The normal forms are applicable to individual tables; to say that an entire database is in normal form n is to say that all of
its tables are in normal form n.
Newcomers to database design sometimes suppose that normalization proceeds in an iterative fashion, i.e. a 1NF design
is first normalized to 2NF, then to 3NF, and so on. This is not an accurate description of how normalization typically works.
A sensibly designed table is likely to be in 3NF on the first attempt; furthermore, if it is 3NF, it is overwhelmingly likely to
have an HNF of 5NF. Achieving the "higher" normal forms (above 3NF) does not usually require an extra expenditure of
effort on the part of the designer, because 3NF tables usually need no modification to meet the requirements of these
higher normal forms.

Edgar F. Codd originally defined the first three normal forms (1NF, 2NF, and 3NF). These normal forms have been
summarized as requiring that all non-key attributes be dependent on "the key, the whole key and nothing but the key".
The fourth and fifth normal forms (4NF and 5NF) deal specifically with the representation of many-to-many and one-to-
many relationships among attributes. Sixth normal form (6NF) incorporates considerations relevant to temporal
databases.

First normal form

The criteria for first normal form (1NF) are:

 A table must be guaranteed not to have any duplicate records; therefore it must have at
least one candidate key.
 There must be no duplicate groups, i.e. no attributes which occur a different number of
times on different records. For example, suppose that an employee can have multiple skills: a
possible representation of employees' skills is {Employee ID, Skill1, Skill2, Skill3 ...}, where
{Employee ID} is the unique identifier for a record. This representation would not be in 1NF.
 Note that all relations are in 1NF. The question of whether a given representation is in 1NF
is equivalent to the question of whether it is a relation.

Second normal form

The criteria for second normal form (2NF) are:

 The table must be in 1NF.


 None of the non-prime attributes of the table are functionally dependent on a part (proper
subset) of a candidate key; in other words, all functional dependencies of non-prime attributes on
candidate keys are full functional dependencies. For example, consider a "Department Members"
table whose attributes are Department ID, Employee ID, and Employee Date of Birth; and suppose
that an employee works in one or more departments. The combination of Department ID and
Employee ID uniquely identifies records within the table. Given that Employee Date of Birth depends
on only one of those attributes – namely, Employee ID – the table is not in 2NF.
 Note that if none of a 1NF table's candidate keys are composite – i.e. every candidate key
consists of just one attribute – then we can say immediately that the table is in 2NF.

Third normal form

The criteria for third normal form (3NF) are:

 The table must be in 2NF.


 There are no non-trivial functional dependencies between non-prime attributes. A violation
of 3NF would mean that at least one non-prime attribute is only indirectly dependent (transitively
dependent) on a candidate key, by virtue of being functionally dependent on another non-prime
attribute. For example, consider a "Departments" table whose attributes are Department ID,
Department Name, Manager ID, and Manager Hire Date; and suppose that each manager can
manage one or more departments. {Department ID} is a candidate key. Although Manager Hire Date
is functionally dependent on {Department ID}, it is also functionally dependent on the non-prime
attribute Manager ID. This means the table is not in 3NF.

Boyce-Codd normal form

The criteria for Boyce-Codd normal form (BCNF) are:

 The table must be in 3NF.


 Every non-trivial functional dependency must be a dependency on a superkey.

Fourth normal form

The criteria for fourth normal form (4NF) are:

 The table must be in BCNF.


 There must be no non-trivial multivalued dependencies on something other than a
superkey. A BCNF table is said to be in 4NF if and only if all of its multivalued dependencies are
functional dependencies.

Fifth normal form

The criteria for fifth normal form (5NF and also PJ/NF) are:

 The table must be in 4NF.


 There must be no non-trivial join dependencies that do not follow from the key constraints.
A 4NF table is said to be in the 5NF if and only if every join dependency in it is implied by the
candidate keys.

Domain/key normal form

Domain/key normal form (or DKNF) requires that a table not be subject to any constraints other than domain
constraints and key constraints.

Sixth normal form

It has been suggested that this section be split into a new article entitled Sixth normal form. (Discuss)

This normal form was, as of 2005, only recently proposed: the sixth normal form (6NF) was only defined when extending
the relational model to take into account the temporal dimension. Unfortunately, most current SQL technologies as of
2005 do not take into account this work, and most temporal extensions to SQL are not relational. See work by Date,
Darwen and Lorentzos[3] for a relational temporal extension, or see TSQL2 for a different approach.

Example Of The Process

The following example illustrates how a database designer might employ his knowledge of the normal forms to make
progressive improvements to an initially unnormalized database design. The example is somewhat contrived: in practice,
few designs lend themselves to being normalized in strict stages in which the HNF increases at each stage.

The database in the example captures information about the suppliers with which various companies' divisions have
relationships – more specifically, it captures information about the types of parts which each division of each company
sources from its suppliers.

Starting Point

Information has been presented initially in a way that does not even meet 1NF. Every record is for a particular
Company/Division combination: for each of these combinations, repeating groups of part- and supplier-related information
occur. 1NF does not permit repeating groups.

Suppliers and Parts By Company Division

Company Company Company Division Part Type Supplier Supplier Supplier


Founder Logo Country Continent

Tensile
Spring
Globodynamics USA N. Amer.
Pendulum
Allied Clock and Horace Tensile USA N. Amer.
Sundial Clocks Spring
Watch Washington Globodynamics Mexico N. Amer.
Toothed
Pieza de Acero Mexico N. Amer.
Wheel
Pieza de Acero

Quartz Crystal Microflux Belgium Europe


Allied Clock and Horace
Sundial Watches Tuning Fork Microflux Belgium Europe
Watch Washington
Battery Dakota Electrics USA N. Amer.

Flywheel
Wheels 4 Less USA N. Amer.
Axle
Industrial Wheels 4 Less USA N. Amer.
Global Robot Nils Neumann Gearbox Axle
Robots TransEuropa Italy Europe
Mechanical
TransEuropa Italy Europe
Arm

Artificial Brain Prometheus Labs Luxembourg Europe


Domestic Artificial Brain Frankenstein Labs Germany Europe
Global Robot Nils Neumann Gearbox
Robots Metal Housing Pieza de Acero Mexico N. Amer.
Backplate Pieza de Acero Mexico N. Amer.

1NF

We eliminate the repeating groups by ensuring that each group appears on its own record. The unique identifier for a
record is now {Company, Division, Part Type, Supplier}.

Suppliers and Parts By Company Division

Company Company Supplier Supplier


Company Division Part Type Supplier
Founder Logo Country Continent

Allied Clock and Horace Tensile


Sundial Clocks Spring USA N. Amer.
Watch Washington Globodynamics

Allied Clock and Horace Tensile


Sundial Clocks Pendulum USA N. Amer.
Watch Washington Globodynamics

Allied Clock and Horace


Sundial Clocks Spring Pieza de Acero Mexico N. Amer.
Watch Washington

Allied Clock and Horace Sundial Clocks Toothed Pieza de Acero Mexico N. Amer.
Watch Washington Wheel

Allied Clock and Horace


Sundial Watches Quartz Crystal Microflux Belgium Europe
Watch Washington

Allied Clock and Horace


Sundial Watches Tuning Fork Microflux Belgium Europe
Watch Washington

Allied Clock and Horace


Sundial Watches Battery Dakota Electrics USA N. Amer.
Watch Washington

Industrial
Global Robot Nils Neumann Gearbox Flywheel Wheels 4 Less USA N. Amer.
Robots

Industrial
Global Robot Nils Neumann Gearbox Axle Wheels 4 Less USA N. Amer.
Robots

Industrial
Global Robot Nils Neumann Gearbox Axle TransEuropa Italy Europe
Robots

Industrial Mechanical
Global Robot Nils Neumann Gearbox TransEuropa Italy Europe
Robots Arm

Domestic
Global Robot Nils Neumann Gearbox Artificial Brain Prometheus Labs Luxembourg Europe
Robots

Domestic
Global Robot Nils Neumann Gearbox Artificial Brain Frankenstein Labs Germany Europe
Robots

Domestic
Global Robot Nils Neumann Gearbox Metal Housing Pieza de Acero Mexico N. Amer.
Robots

Domestic
Global Robot Nils Neumann Gearbox Backplate Pieza de Acero Mexico N. Amer.
Robots

2NF

One problem with the design at this stage is that Company Founder and Company Logo details for a given company may
appear redundantly on more than one record; so may Supplier Countries and Continents. These phenomena arise from
the part-key dependencies of a) the Company Founder and Company Logo attributes on Company, and b) the Supplier
Country and Supplier Continent attributes on Supplier. 2NF does not permit part-key dependencies. We correct the
problem by splitting out the Company Founder and Company Logo details into their own table, called Companies, as well
as splitting out the Supplier Country and Supplier Continent Details into their own table, called Suppliers.

Suppliers and Parts By Company Division

Company Division Part Type Supplier

Allied Clock and Watch Clocks Spring Tensile Globodynamics

Allied Clock and Watch Clocks Pendulum Tensile Globodynamics

Allied Clock and Watch Clocks Spring Pieza de Acero

Allied Clock and Watch Clocks Toothed Wheel Pieza de Acero

Allied Clock and Watch Watches Quartz Crystal Microflux

Allied Clock and Watch Watches Tuning Fork Microflux

Allied Clock and Watch Watches Battery Dakota Electrics

Global Robot Industrial Robots Flywheel Wheels 4 Less

Global Robot Industrial Robots Axle Wheels 4 Less

Global Robot Industrial Robots Axle TransEuropa

Global Robot Industrial Robots Mechanical Arm TransEuropa

Global Robot Domestic Robots Artificial Brain Prometheus Labs

Global Robot Domestic Robots Artificial Brain Frankenstein Labs


Global Robot Domestic Robots Metal Housing Pieza de Acero

Global Robot Domestic Robots Backplate Pieza de Acero

Companies

Company Company Founder Company Logo

Allied Clock and Watch Horace Washington Sundial

Global Robot Nils Neumann Gearbox

Suppliers

Supplier Supplier Country Supplier Continent

Tensile Globodynamics USA N. Amer.

Pieza de Acero Mexico N. Amer.

Microflux Belgium Europe

Dakota Electrics USA N. Amer.

Wheels 4 Less USA N. Amer.

TransEuropa Italy Europe

Prometheus Labs Luxembourg Europe

Frankenstein Labs Germany Europe


3NF and BCNF

There is still, however, redundancy in the design. The Supplier Continent for a given Supplier Country may appear
redundantly on more than one record. This phenomenon arises from the dependency of non-key attribute Supplier
Continent on non-key attribute Supplier Country, and means that the design does not conform to 3NF. To achieve 3NF
(and, while we are at it, BCNF), we create a separate Countries table which tells us which continent a country belongs to.

Suppliers and Parts By Company Division

Company Division Part Type Supplier

Allied Clock and Watch Clocks Spring Tensile Globodynamics

Allied Clock and Watch Clocks Pendulum Tensile Globodynamics

Allied Clock and Watch Clocks Spring Pieza de Acero

Allied Clock and Watch Clocks Toothed Wheel Pieza de Acero

Allied Clock and Watch Watches Quartz Crystal Microflux

Allied Clock and Watch Watches Tuning Fork Microflux

Allied Clock and Watch Watches Battery Dakota Electrics

Global Robot Industrial Robots Flywheel Wheels 4 Less

Global Robot Industrial Robots Axle Wheels 4 Less

Global Robot Industrial Robots Axle TransEuropa

Global Robot Industrial Robots Mechanical Arm TransEuropa


Global Robot Domestic Robots Artificial Brain Prometheus Labs

Global Robot Domestic Robots Artificial Brain Frankenstein Labs

Global Robot Domestic Robots Metal Housing Pieza de Acero

Global Robot Domestic Robots Backplate Pieza de Acero

Suppliers

Supplier Supplier Country

Tensile
USA
Globodynamics

Pieza de Acero Mexico

Microflux Belgium

Dakota Electrics USA

Wheels 4 Less USA

TransEuropa Italy

Prometheus Labs Luxembourg

Frankenstein Labs Germany

Companies

Company Company Founder Company Logo


Allied Clock and Watch Horace Washington Sundial

Global Robot Nils Neumann Gearbox

Countries

Country Continent

USA N. Amer.

Mexico N. Amer.

Belgium Europe

Italy Europe

Luxembourg Europe

4NF
What happens if a company has more than one founder or more than one logo? (Let us assume for the sake of the
example that both of these things may happen.) One way of handling the situation would be to alter the primary key of our
Companies table to {Company, Company Founder, Company Logo}. Representing multiple founders and multiple logos
then becomes possible, but at the price of redundancy:
Companies
Company Company Founder Company Logo
Allied Clock and Watch Horace Washington Sundial
Global Robot Nils Neumann Gearbox
International Broom Gareth Patterson Whirlwind
International Broom Sandra Patterson Whirlwind
International Broom Gareth Patterson Sweeper
International Broom Sandra Patterson Sweeper
This type of redundancy reflects the fact that the design does not conform to 4NF. We correct the design by separating
facts about founders from facts about logos.
Suppliers and Parts By Company Division
Company Division Part Type Supplier
Allied Clock and Watch Clocks Spring Tensile Globodynamics
Allied Clock and Watch Clocks Pendulum Tensile Globodynamics
Allied Clock and Watch Clocks Spring Pieza de Acero
Allied Clock and Watch Clocks Toothed Wheel Pieza de Acero
Allied Clock and Watch Watches Quartz Crystal Microflux
Allied Clock and Watch Watches Tuning Fork Microflux
Allied Clock and Watch Watches Battery Dakota Electrics
Global Robot Industrial Robots Flywheel Wheels 4 Less
Global Robot Industrial Robots Axle Wheels 4 Less
Global Robot Industrial Robots Axle TransEuropa
Global Robot Industrial Robots Mechanical Arm TransEuropa
Global Robot Domestic Robots Artificial Brain Prometheus Labs
Global Robot Domestic Robots Artificial Brain Frankenstein Labs
Global Robot Domestic Robots Metal Housing Pieza de Acero
Global Robot Domestic Robots Backplate Pieza de Acero

Companies

Company

Allied Clock and Watch

Global Robot

International Broom

Company Logos

Company Company Logo

Allied Clock and Watch Sundial

Global Robot Gearbox

International Broom Whirlwind

International Broom Sweeper


Company Founders

Company Company Founder

Allied Clock and Watch Horace Washington

Global Robot Nils Neumann

International Broom Gareth Patterson

International Broom Sandra Patterson

Suppliers

Supplier Supplier Country

Tensile
USA
Globodynamics

Pieza de Acero Mexico

Microflux Belgium

Dakota Electrics USA

Wheels 4 Less USA

TransEuropa Italy

Prometheus Labs Luxembourg

Frankenstein Labs Germany


Countries

Country Continent

USA N. Amer.

Mexico N. Amer.

Belgium Europe

Italy Europe

Luxembourg Europe

5NF

We know that the Clocks division of Allied Clock and Watch relies upon its suppliers to provide springs, pendulums, and
toothed wheels. We also know that the Clocks division deals with suppliers Tensile Globodynamics and Pieza de Acero.
Let us suppose for the sake of the example that the following rule applies: if a supplier that a division deals with offers a
part that the division needs, the division will always purchase it. If, for example, Tensile Globodynamics start producing
Toothed Wheels, then Allied Clock and Watch will start purchasing them. This rule leads to redundancy in our design as it
stands, causing it to fall short of 5NF. We correct the design by recording part-types-by-company-division separately from
suppliers-by-company-division, and adding a further table that provides information as to which suppliers offer which
parts.

Part Types By Company Division

Company Division Part Type

Allied Clock and Watch Clocks Spring

Allied Clock and Watch Clocks Pendulum

Allied Clock and Watch Clocks Toothed Wheel

Allied Clock and Watch Watches Quartz Crystal


Allied Clock and Watch Watches Tuning Fork

Allied Clock and Watch Watches Battery

Global Robot Industrial Robots Flywheel

Global Robot Industrial Robots Axle

Global Robot Industrial Robots Mechanical Arm

Global Robot Domestic Robots Artificial Brain

Global Robot Domestic Robots Metal Housing

Global Robot Domestic Robots Backplate

Suppliers By Company Division

Company Division Supplier

Allied Clock and Watch Clocks Tensile Globodynamics

Allied Clock and Watch Clocks Pieza de Acero

Allied Clock and Watch Watches Microflux

Allied Clock and Watch Watches Dakota Electrics

Global Robot Industrial Robots Wheels 4 Less

Global Robot Industrial Robots TransEuropa

Global Robot Domestic Robots Prometheus Labs


Global Robot Domestic Robots Frankenstein Labs

Global Robot Domestic Robots Pieza de Acero

Parts By Supplier

Part Type Supplier

Spring Tensile Globodynamics

Pendulum Tensile Globodynamics

Spring Pieza de Acero

Toothed Wheel Pieza de Acero

Quartz Crystal Microflux

Tuning Fork Microflux

Battery Dakota Electrics

Flywheel Wheels 4 Less

Axle Wheels 4 Less

Axle TransEuropa

Mechanical Arm TransEuropa

Artificial Brain Prometheus Labs


Artificial Brain Frankenstein Labs

Metal Housing Pieza de Acero

Backplate Pieza de Acero

Companies

Company Company Logo

Allied Clock and Watch Sundial

Global Robot Gearbox

Company Founders

Company Company Founder

Allied Clock and Watch Horace Washington

Global Robot Nils Neumann

International Broom Gareth Patterson

International Broom Sandra Patterson

Suppliers

Supplier Supplier Country


Tensile
USA
Globodynamics

Pieza de Acero Mexico

Microflux Belgium

Dakota Electrics USA

Wheels 4 Less USA

TransEuropa Italy

Prometheus Labs Luxembourg

Frankenstein Labs Germany

Countries

Country Continent

USA N. Amer.

Mexico N. Amer.

Belgium Europe

Italy Europe

Luxembourg Europe
Denormalization

Databases intended for Online Transaction Processing (OLTP) are typically more normalized than databases intended for
On Line Analytical Processing (OLAP). OLTP Applications are characterized by a high volume of small transactions such
as updating a sales record at a super market checkout counter. The expectation is that each transaction will leave the
database in a consistent state. By contrast, databases intended for OLAP operations are primarily "read only" databases.
OLAP applications tend to extract historical data that has accumulated over a long period of time. For such databases,
redundant or "denormalized" data may facilitate Business Intelligence applications. Specifically, dimensional tables in a
star schema often contain denormalized data. The denormalized or redundant data must be carefully controlled during
ETL processing, and users should not be permitted to see the data until it is in a consistent state. The normalized
alternative to the star schema is the snowflake schema.

Denormalization is also used to improve performance on smaller computers as in computerized cash-registers. Since
these use the data for look-up only (e.g. price lookups), no changes are to be made to the data and a swift response is
crucial.

Non-first normal form (NF²)

In recognition that denormalization can be deliberate and useful, the non-first normal form is a definition of database
designs which do not conform to the first normal form, by allowing "sets and sets of sets to be attribute domains" (Schek
1982). This extension introduces hierarchies in relations.

Consider the following table:

Non-First Normal Form

Person Favorite Colors

Bob blue, red

Jane green, yellow, red

Assume a person has several favorite colors. Obviously, favorite colors consist of a set of colors modeled by the given
table.

To transform this NF² table into a 1NF an "unnest" operator is required which extends the relational algebra of the higher
normal forms. The reverse operator is called "nest" which is not always the mathematical inverse of "unnest", although
"unnest" is the mathematical inverse to "nest". Another constraint required is for the operators to be bijective, which is
covered by the Partitioned Normal Form (PNF).

Further reading

 Litt's Tips: Normalization


 Date, C. J., & Lorentzos, N., & Darwen, H. (2002). Temporal Data & the Relational Model (1st ed.). Morgan
Kaufmann. ISBN 1-55860-855-9.
 Date, C. J. (1999), An Introduction to Database Systems (8th ed.). Addison-Wesley Longman. ISBN 0-321-
19784-4.
 Kent, W. (1983) A Simple Guide to Five Normal Forms in Relational Database Theory, Communications of the
ACM, vol. 26, pp. 120-125
 Date, C.J., & Darwen, H., & Pascal, F. Database Debunkings
 H.-J. Schek, P.Pistor Data Structures for an Integrated Data Base Management and Information Retrieval
System

References
1. ^ His term eliminate is misleading, as nothing is "lost" in normalization. He probably described eliminate in a
mathematical sense to mean elimination of complexity.
2. ^ Codd, Edgar F. (June 1970). "A Relational Model of Data for Large Shared Data Banks". Communications of
the ACM 13 (6): 377-387.
3. ^ DBDebunk.

See also

 Aspect (computer science)


 Cross-cutting concern
 Inheritance semantics
 Functional normalization
 Orthogonalization
 Refactoring

External links

 Database Normalization Basics by Mike Chapple (About.com)


 Database Normalization Intro, Part 2
 An Introduction to Database Normalization by Mike Hillyer.
 Normalization by ITS, University of Texas.
 Rules of Data Normalization by Data Model.org
 A tutorial on the first 3 normal forms by Fred Coulson
 Free PDF poster available by Marc Rettig
 Description of the database normalization basics by Microsoft

Database Normalization Basics


If you've been working with databases for a while, chances are you've heard the term
normalization. Perhaps someone's asked you "Is that database normalized?" or "Is that in
BCNF?" All too often, the reply is "Uh, yeah." Normalization is often brushed aside as a luxury
that only academics have time for. However, knowing the principles of normalization and applying
them to your daily database design tasks really isn't all that complicated and it could drastically
improve the performance of your DBMS.

In this article, we'll introduce the concept of normalization and take a brief look at the most
common normal forms. Future articles will provide in-depth explorations of the normalization
process.

So, what is normalization? Basically, it's the process of efficiently organizing data in a database.
There are two goals of the normalization process: eliminate redundant data (for example, storing
the same data in more than one table) and ensure data dependencies make sense (only storing
related data in a table). Both of these are worthy goals as they reduce the amount of space a
database consumes and ensure that data is logically stored.

The database community has developed a series of guidelines for ensuring that databases are
normalized. These are referred to as normal forms and are numbered from one (the lowest form
of normalization, referred to as first normal form or 1NF) through five (fifth normal form or 5NF). In
practical applications, you'll often see 1NF, 2NF, and 3NF along with the occasional 4NF. Fifth
normal form is very rarely seen and won't be discussed in this article.

Before we begin our discussion of the normal forms, it's important to point out that they are
guidelines and guidelines only. Occasionally, it becomes necessary to stray from them to meet
practical business requirements. However, when variations take place, it's extremely important to
evaluate any possible ramifications they could have on your system and account for possible
inconsistencies. That said, let's explore the normal forms.

First normal form (1NF) sets the very basic rules for an organized database:

• Eliminate duplicative columns from the same table.


• Create separate tables for each group of related data and identify each row with a unique
column or set of columns (the primary key).

Second normal form (2NF) further addresses the concept of removing duplicative data:

• Meet all the requirements of the first normal form.


• Remove subsets of data that apply to multiple rows of a table and place them in separate
tables.
• Create relationships between these new tables and their predecessors through the use of
foreign keys.

Third normal form (3NF) goes one large step further:

• Meet all the requirements of the second normal form.


• Remove columns that are not dependent upon the primary key.

Finally, fourth normal form (4NF) has one additional requirement:

• Meet all the requirements of the third normal form.


• A relation is in 4NF if it has no multi-valued dependencies.

Remember, these normalization guidelines are cumulative. For a database to be in 2NF, it must
first fulfill all the criteria of a 1NF database.

Network model
Database models

Common models

 Hierarchical
 Network
 Relational
 Object-relational

 Object
Other models

 Associative
 Concept-oriented
 Multi-dimensional
 Star schema

 XML database

The network model is a database model conceived as a flexible way of representing objects and
their relationships. Its original inventor was Charles Bachman, and it was developed into a
standard specification published in 1969 by the CODASYL Consortium. Where the hierarchical
model structures data as a tree of records, with each record having one parent record and many
children, the network model allows each record to have multiple parent and child records, forming
a lattice structure.

The chief argument in favour of the network model, in comparison to the hierarchic model, was
that it allowed a more natural modeling of relationships between entities. Although the model was
widely implemented and used, it failed to become dominant for two main reasons. Firstly, IBM
chose to stick to the hierarchical model with semi-network extensions in their established
products such as IMS and DL/I. Secondly, it was eventually displaced by the relational model,
which offered a higher-level, more declarative interface. Until the early 1980s the performance
benefits of the low-level navigational interfaces offered by hierarchical and network databases
were persuasive for many large-scale applications, but as hardware became faster, the extra
productivity and flexibility of the relational model replaced the network model in corporate
enterprise usage.

The navigational interface offered by the network model bears some resemblance to the
hyperlink-based models that have become popular with the advent of the Internet and World
Wide Web. However, the network model (like the relational model) assumes that the entire
database has a centrally-managed schema, and as such it is not well suited to distributed,
heterogeneous environments.

Contents

 1 History
 2 See also
 3 References

 4 External links

History
In 1969, the Conference on Data Systems Languages (CODASYL) established the first
specification of the network database model. This was followed by a second publication in 1971,
which became the basis for most implementations. Subsequent work continued into the early
1980s, culminating in an ISO specification, but this had little influence on products.

Hierarchical model
Database models

Common models
 Hierarchical
 Network
 Relational
 Object-relational

 Object
Other models

 Associative
 Concept-oriented
 Multi-dimensional
 Star schema

 XML database

In a hierarchical data model, data are organized into a tree-like structure. The structure allows
repeating information using parent/child relationships: each parent can have many children but
each child only has one parent. All attributes of a specific record are listed under an entity type. In
a database, an entity type is the equivalent of a table; each individual record is represented as a
row and an attribute as a column. Entity types are related to each other using 1: N mapping, also
known as one-to-many relationships.

An example of a hierarchical data model would be if an organization had records of employees


in a table (entity type) called "Employees". In the table there would be attributes/columns such as
First Name, Last Name, Job Name and Wage. The company also has data about the employee’s
children in a separate table called "Children" with attributes such as First Name, Last Name, and
DOB. The Employee table represents a parent segment and the Children table represents a Child
segment. These two segments form a hierarchy where an employee may have many children, but
each child may only have one parent.

Hierarchical structures were widely used in the first mainframe database management systems.
Hierarchical relationships between different types of data can make it very easy to answer some
questions, but very difficult to answer others. If a one-to-many relationship is violated (e.g. a
patient can have more than one physician), then the hierarchy becomes a network.[1]

The most common form of hierchical model used currently is the LDAP model. Other than that,
the hierarchical model is rare in modern databases. It is, however, common in many other means
of storing information, ranging from file systems to the Windows registry to XML documents.

Contents

 1 Tree Data structure in Relational Model


 2 Some Well-known Hierarchical Databases
 3 References

 4 External links
Tree Data structure in Relational Model
Chapter 23 'Logic-Based Databases' of An Introduction To Database Systems by C.J.Date
seventh Edition

In Relational Database model, an example of hierarchical data could be displaying the hierarchy
of departmental responsibility or 'who reports to whom'.

Consider the following table:

Employee_Table

EmpNo Designation

10 Director

20 Senior Manager

30 Typist

40 Programmer

Their hierarchy stating EmpNo 10 is boss of 20 and 30,40 report to 20 is represented by following
table:

WhoIsBoss_Table

BossEmpNo ReportingEmpNo

10 20

10 30

20 40
In the example above if a person does not report to 2 bosses then the tree of hierarchy is of type
'Achild has only oneparent'. Now, let us see the hierarchy of 'A child with many parents'. The
simple example is the 'Bill Of Material' of Engineering Assembly.

A Car Engine could have 2 different assemblies both having similar parts.

Consider the following table:

Engine_Part_Master

PartNum Description

10 CrankAssembly

20 HeadAssembly

30 ConnectingRod

40 Crank Shaft

90 3/4 Dia Bolt

Assembly Hierarchy is described in the following table,

Engine_Assembly

Parent_PartNum Child_PartNum

10 30

10 90

20 40
20 90

......

PartNum 90 has 2 parents, as it is present in both assemblies.

Some Well-known Hierarchical Databases

 Adabas
 GT.M
 IMS
 MUMPS
 Caché (software)
 Metakit
 Multidimensional hierarchical toolkit
 Mumps compiler
 DMSII

Data dictionary

From Wikipedia, the free encyclopedia

A data dictionary is a set of metadata that contains definitions and representations of data
elements. Within the context of a DBMS, a data dictionary is a read-only set of tables and views.
Amongst other things, a data dictionary holds the following information:

 Precise definition of data elements


 Usernames, roles and privileges
 Schema objects
 Integrity constraints
 Stored procedures and triggers
 General database structure
 Space allocations

One benefit of a well-prepared data dictionary is a consistency between data items across
different tables. For example, several tables may hold telephone numbers; using a data dictionary
the format of this telephone number field will be consistent.

When an organization builds an enterprise-wide data dictionary, it may include both semantics
and representational definitions for data elements. The semantic components focus on creating
precise meaning of data elements. Representation definitions include how data elements are
stored in a computer structure such as an integer, string or date format (see data type). Data
dictionaries are one step along a pathway of creating precise semantic definitions for an
organization.
Initially, data dictionaries are sometimes simply a collection of database columns and the
definitions of what the meaning and types the columns contain. Data dictionaries are more
precise than glossaries (terms and definitions) because they frequently have one or more
representations of how data is structured. Data dictionaries are usually separate from data
models since data models usually include complex relationships between data elements.

Data dictionaries can evolve into full ontology (computer science) when discrete logic has been
added to data element definitions.

B+ tree

A simple B+ tree example linking the keys 1-7 to data values d 1-d7. Note the linked list (red)
allowing rapid in-order traversal.

In computer science, a B+ tree is a type of tree data structure. It represents sorted data in a way
that allows for efficient insertion and removal of elements. It is a dynamic, multilevel index with
maximum and minimum bounds on the number of keys in each node. The NTFS filesystem for
Microsoft Windows, ReiserFS filesystem for Linux, XFS filesystem for IRIX and Linux and JFS2
filesystem for AIX, OS/2 and Linux use this type of tree.

A B+ tree is a variation on a B-tree. In a B+ tree, in contrast to a B-tree, all data is saved in the
leaves. Internal nodes contain only keys and tree pointers. All leaves are at the same lowest
level. Leaf nodes are also linked together as a linked list to make range queries easy.

The maximum number of pointers in a record is called the order of the B+ tree.

The minimum number of keys per record is 1/2 of the maximum number of keys. For example, if
the order of a B+ tree is n+1, each node (except for the root) must have between (n+1)/2 and n
keys. If n is an odd number, the minimum number of keys can be either (n + 1)/2 or (n - 1)/2, but it
must be the same in the whole tree.
The number of keys that may be indexed using a B+ tree is a function of the order of the tree and
its height.

For a n-order B+ tree with a height of h:

 maximum number of nodes is nh


 minimum number of keys is 2(n / 2)h − 1.

The B+ tree was first described in the paper "Rudolf Bayer, Edward M. McCreight: Organization
and Maintenance of Large Ordered Indices. Acta Informatica 1: 173-189 (1972)".

An extension of a B+ tree is called a B# Trees which use the B+ Tree Structure and adds further
restrictions.

Computer storage

From Wikipedia, the free encyclopedia

This article or section does not cite its references or sources.


Please help improve this article by adding citations to reliable sources. (help, get involved!)
This article has been tagged since December 2006.

1 GiB of SDRAM mounted in a personal computer

Computer storage, computer memory, and often casually memory refer to computer
components, devices and recording media that retain data for some interval of time. Computer
storage provides one of the core functions of the modern computer, that of information retention.
It is one of the fundamental components of all modern computers, and coupled with a central
processing unit (CPU), implements the basic Von Neumann computer model used since the
1940s.

In contemporary usage, memory usually refers to a form of solid state storage known as random
access memory (RAM) and sometimes other forms of fast but temporary storage. Similarly,
storage more commonly refers to mass storage - optical discs, forms of magnetic storage like
hard disks, and other types of storage which are slower than RAM, but of a more permanent
nature. These contemporary distinctions are helpful, because they are also fundamental to the
architecture of computers in general. As well, they reflect an important and significant technical
difference between memory and mass storage devices, which has been blurred by the historical
usage of the terms "main storage" (and sometimes "primary storage") for random access
memory, and "secondary storage" for mass storage devices. This is explained in the following
sections, in which the traditional "storage" terms are used as sub-headings for convenience.

Contents

 1 Purposes of storage
 1.1 Primary storage
 1.2 Secondary and off-line storage
 1.3 Tertiary and database storage
 1.4 Network storage
 2 Characteristics of storage
 2.1 Volatility of information
 2.2 Ability to access non-contiguous information
 2.3 Ability to change information
 2.4 Addressability of information
 2.5 Capacity and performance
 3 Technologies, devices and media
 3.1 Magnetic storage
 3.2 Semiconductor storage
 3.3 Optical disc storage
 3.3.1 Magneto-optical disc storage
 3.3.2 Ultra Density Optical disc storage
 3.3.3 Optical Jukebox storage
 3.4 Other early methods
 3.5 Other proposed methods
 3.6 Primary storage topics
 3.7 Secondary, tertiary and off-line storage topics
 3.8 Data storage conferences

 4 References

Purposes of storage

The fundamental components of a general-purpose computer are arithmetic and logic unit,
control circuitry, storage space, and input/output devices. If storage was removed, the device we
had would be a simple digital signal processing device (e.g. calculator, media player) instead of a
computer. The ability to store instructions that form a computer program, and the information that
the instructions manipulate is what makes stored program architecture computers versatile.

A digital computer represents information using the binary numeral system. Text, numbers,
pictures, audio, and nearly any other form of information can be converted into a string of bits, or
binary digits, each of which has a value of 1 or 0. The most common unit of storage is the byte,
equal to 8 bits. A piece of information can be manipulated by any computer whose storage space
is large enough to accommodate the corresponding data, or the binary representation of the
piece of information. For example, a computer with a storage space of eight million bits, or one
megabyte, could be used to edit a small novel.
Various forms of storage, divided according to their distance from the central processing unit.
Additionally, common technology and capacity found in home computers of 2005 is indicated next
to some items.

Various forms of storage, based on various natural phenomena, have been invented. So far, no
practical universal storage medium exists, and all forms of storage have some drawbacks.
Therefore a computer system usually contains several kinds of storage, each with an individual
purpose, as shown in the diagram.

Primary storage

Primary storage is directly connected to the central processing unit of the computer. It must be
present for the CPU to function correctly, just as in a biological analogy the lungs must be present
(for oxygen storage) for the heart to function (to pump and oxygenate the blood). As shown in the
diagram, primary storage typically consists of three kinds of storage:

 Processor registers are internal to the central processing unit. Registers contain
information that the arithmetic and logic unit needs to carry out the current instruction.
They are technically the fastest of all forms of computer storage, being switching
transistors integrated on the CPU's silicon chip, and functioning as electronic "flip-flops".
 Cache memory is a special type of internal memory used by many central processing
units to increase their performance or "throughput". Some of the information in the main
memory is duplicated in the cache memory, which is slightly slower but of much greater
capacity than the processor registers, and faster but much smaller than main memory.
Multi-level cache memory is also commonly used—"primary cache" being smallest,
fastest and closest to the processing device; "secondary cache" being larger and slower,
but still faster and much smaller than main memory.
 Main memory contains the programs that are currently being run and the data the
programs are operating on. In modern computers, the main memory is the electronic
solid-state random access memory. It is directly connected to the CPU via a "memory
bus" (shown in the diagram) and a "data bus". The arithmetic and logic unit can very
quickly transfer information between a processor register and locations in main storage,
also known as a "memory addresses". The memory bus is also called an address bus or
front side bus and both busses are high-speed digital "superhighways". Access methods
and speed are two of the fundamental technical differences between memory and mass
storage devices. (Note that all memory sizes and storage capacities shown in the
diagram will inevitably be exceeded with advances in technology over time.)

Secondary and off-line storage

Secondary storage requires the computer to use its input/output channels to access the
information, and is used for long-term storage of persistent information. However most computer
operating systems also use secondary storage devices as virtual memory - to artificially increase
the apparent amount of main memory in the computer. Secondary storage is also known as
"mass storage", as shown in the diagram above. Secondary or mass storage is typically of much
greater capacity than primary storage (main memory), but it is also much slower. In modern
computers, hard disks are usually used for mass storage. The time taken to access a given byte
of information stored on a hard disk is typically a few thousandths of a second, or milliseconds.
By contrast, the time taken to access a given byte of information stored in random access
memory is measured in thousand-millionths of a second, or nanoseconds. This illustrates the very
significant speed difference which distinguishes solid-state memory from rotating magnetic
storage devices: hard disks are typically about a million times slower than memory. Rotating
optical storage devices, such as CD and DVD drives, are typically even slower than hard disks,
although their access speeds are likely to improve with advances in technology. Therefore, the
use of virtual memory, which is millions of times slower than "real" memory, significantly degrades
the performance of any computer. Virtual memory is implemented by many operating systems
using terms like swap file or "cache file". The main historical advantage of virtual memory was
that it was much less expensive than real memory. That advantage is less relevant today, yet
surprisingly most operating systems continue to implement it, despite the significant performance
penalties.

Off-line storage is a system where the storage medium can be easily removed from the storage
device. Off-line storage is used for data transfer and archival purposes. In modern computers,
CDs, DVDs, memory cards, flash memory devices including "USB drives", floppy disks, Zip disks
and magnetic tapes are commonly used for off-line mass storage purposes. "Hot-pluggable" USB
hard disks are also available. Off-line storage devices used in the past include punched cards,
microforms, and removable Winchester disk drums.
Tertiary and database storage

Tertiary storage is a system where a robotic arm will "mount" (connect) or "dismount" off-line
mass storage media (see the next item) according to the computer operating system's demands.
Tertiary storage is used in the realms of enterprise storage and scientific computing on large
computer systems and business computer networks, and is something a typical personal
computer user never sees firsthand.

Database storage is a system where information in computers is stored in large databases, data
banks, data warehouses, or data vaults. It involves packing and storing large amounts of storage
devices throughout a series of shelves in a room, usually an office, all linked together. The
information in database storage systems can be accessed by a supercomputer, mainframe
computer, or personal computer. Databases, data banks, and data warehouses, etc, can only be
accessed by authorized users.

Network storage

Network storage is any type of computer storage that involves accessing information over a
computer network. Network storage arguably allows to centralize the information management in
an organization, and to reduce the duplication of information. Network storage includes:

 Network-attached storage is secondary or tertiary storage attached to a computer


which another computer can access at file level over a local-area network, a private wide-
area network, or in the case of online file storage, over the Internet.
 Storage area network provides other computers with storage capacity over a network,
the crucial difference between network-attached storage (NAS) and storage area
Networks (SAN) is the former presents and manages file systems to client computers,
whilst a SAN provides access to disks at block addressing level, leaving it to attaching
systems to manage data or file systems within the provided capacity. See SANS for a
fuller description.
 Network computers are computers that do not contain internal secondary storage
devices. Instead, documents and other data are stored on a network-attached storage.

Confusingly, these terms are sometimes used differently. Primary storage can be used to refer
to local random-access disk storage, which should properly be called secondary storage. If this
type of storage is called primary storage, then the term secondary storage would refer to offline,
sequential-access storage like tape media.

Characteristics of storage

The division to primary, secondary, tertiary and off-line storage is based on memory hierarchy,
or distance from the central processing unit. There are also other ways to characterize various
types of storage.

Volatility of information

 Volatile memory requires constant power to maintain the stored information. Volatile
memory is typically used only for primary storage. (Primary storage is not necessarily
volatile, even though today's most cost-effective primary storage technologies are. Non-
volatile technologies have been widely used for primary storage in the past and may
again be in the future.)
 Non-volatile memory will retain the stored information even if it is not constantly
supplied with electric power. It is suitable for long-term storage of information, and
therefore used for secondary, tertiary, and off-line storage.
 Dynamic memory is volatile memory which also requires that stored information is
periodically refreshed, or read and rewritten without modifications.

Ability to access non-contiguous information

 Random access means that any location in storage can be accessed at any moment in
the same, usually small, amount of time. This makes random access memory well suited
for primary storage.
 Sequential access means that the accessing a piece of information will take a varying
amount of time, depending on which piece of information was accessed last. The device
may need to seek (e.g. to position the read/write head correctly), or cycle (e.g. to wait for
the correct location in a revolving medium to appear below the read/write head).

Ability to change information

 Read/write storage, or mutable storage, allows information to be overwritten at any


time. A computer without some amount of read/write storage for primary storage
purposes would be useless for many tasks. Modern computers typically use read/write
storage also for secondary storage.
 Read only storage retains the information stored at the time of manufacture, and write
once storage (WORM) allows the information to be written only once at some point after
manufacture. These are called immutable storage. Immutable storage is used for
tertiary and off-line storage. Examples include CD-R.
 Slow write, fast read storage is read/write storage which allows information to be
overwritten multiple times, but with the write operation being much slower than the read
operation. Examples include CD-RW.

Addressability of information

 In location-addressable storage, each individually accessible unit of information in


storage is selected with its numerical memory address. In modern computers, location-
addressable storage usually limits to primary storage, accessed internally by computer
programs, since location-addressability is very efficient, but burdensome for humans.
 In file system storage, information is divided into files of variable length, and a particular
file is selected with human-readable directory and file names. The underlying device is
still location-addressable, but the operating system of a computer provides the file system
abstraction to make the operation more understandable. In modern computers,
secondary, tertiary and off-line storage use file systems.
 In content-addressable storage, each individually accessible unit of information is
selected with a hash value, or a short identifier with number? pertaining to the memory
address the information is stored on. Content-addressable storage can be implemented
using software (computer program) or hardware (computer device), with hardware being
faster but more expensive option.

Capacity and performance


 Storage capacity is the total amount of stored information that a storage device or
medium can hold. It is expressed as a quantity of bits or bytes (e.g. 10.4 megabytes).
 Storage density refers to the compactness of stored information. It is the storage
capacity of a medium divided with a unit of length, area or volume (e.g. 1.2 megabytes
per square centimeter).
 Latency is the time it takes to access a particular location in storage. The relevant unit of
measurement is typically nanosecond for primary storage, millisecond for secondary
storage, and second for tertiary storage. It may make sense to separate read latency and
write latency, and in case of sequential access storage, minimum, maximum and
average latency.
 Throughput is the rate at which information can read from or written to the storage. In
computer storage, throughput is usually expressed in terms of megabytes per second or
MB/s, though bit rate may also be used. As with latency, read rate and write rate may
need to be differentiated.

Technologies, devices and media

Magnetic storage

Magnetic storage uses different patterns of magnetization on a magnetically coated surface to


store information. Magnetic storage is non-volatile. The information is accessed using one or
more read/write heads. Since the read/write head only covers a part of the surface, magnetic
storage is sequential access and must seek, cycle or both. In modern computers, the magnetic
surface will take these forms:

 Magnetic disk
 Floppy disk, used for off-line storage
 Hard disk, used for secondary storage
 Magnetic tape data storage, used for tertiary and off-line storage

In early computers, magnetic storage was also used for primary storage in a form of magnetic
drum, or core memory, core rope memory, thin film memory, twistor memory or bubble memory.
Also unlike today, magnetic tape was often used for secondary storage.

Semiconductor storage

Semiconductor memory uses semiconductor-based integrated circuits to store information. A


semiconductor memory chip may contain millions of tiny transistors or capacitors. Both volatile
and non-volatile forms of semiconductor memory exist. In modern computers, primary storage
almost exclusively consists of dynamic volatile semiconductor memory or dynamic random
access memory. Since the turn of the century, a type of non-volatile semiconductor memory
known as flash memory has steadily gained share as off-line storage for home computers. Non-
volatile semiconductor memory is also used for secondary storage in various advanced electronic
devices and specialized computers.

Optical disc storage

Optical disc storage uses tiny pits etched on the surface of a circular disc to store information,
and reads this information by illuminating the surface with a laser diode and observing the
reflection. Optical disc storage is non-volatile and sequential access. The following forms are
currently in common use:

 CD, CD-ROM, DVD: Read only storage, used for mass distribution of digital information
(music, video, computer programs)
 CD-R, DVD-R, DVD+R: Write once storage, used for tertiary and off-line storage
 CD-RW, DVD-RW, DVD+RW, DVD-RAM: Slow write, fast read storage, used for tertiary
and off-line storage
 Blu-ray
 HD DVD

The following form have also been proposed:

 Holographic Versatile Disc (HVD)


 Phase-change Dual

Magneto-optical disc storage

Magneto-optical disc storage is optical disc storage where the magnetic state on a
ferromagnetic surface stores information. The information is read optically and written by
combining magnetic and optical methods. Magneto-optical disc storage is non-volatile, sequential
access, slow write, fast read storage used for tertiary and off-line storage.

Ultra Density Optical disc storage

Ultra Density Optical disc storage An Ultra Density Optical disc or UDO is a 5.25" ISO cartridge
optical disc encased in a dust-proof caddy which can store up to 30 GB of data. Utilising a design
based on a nagneto-optical disc, but utilising phase change technology combined with a blue
violet laser, a UDO disc can store substantially more data than a magneto-optical disc or MO,
because of the shorter wavelength (405 nm) of the blue-violet laser employed. MOs use a 650-
nm-wavelength red laser. Because its beam width is shorter when burning to a disc than a red-
laser for MO, a blue-violet laser allows more information to be stored digitally in the same amount
of space.

Current generations of UDO store up to 30 GB, but 60 GB and 120 GB versions of UDO are in
development and are expected to arrive sometime in 2007 and beyond, though up to 500 GB has
been speculated as a possibility for UDO. [1]

Optical Jukebox storage

Optical jukebox storage is a robotic storage device that utilizes optical disk device and can
automatically load and unload optical disks and provide terabytes of near-line information. The
devices are often called optical disk libraries, robotic drives, or autochangers. Jukebox devices
may have up to 1,000 slots for disks, and usually have a picking device that traverses the slots
and drives. The arrangement of the slots and picking devices affects performance, depending on
the space between a disk and the picking device. Seek times and transfer rates vary depending
upon the optical technology. Jukeboxes are used in high-capacity archive storage environments
such as imaging, medical, and video. HSM is a strategy that moves little-used or unused files
from fast magnetic storage to optical jukebox devices in a process called migration. If the files are
needed, they are migrated back to magnetic disk.

Other early methods

Paper tape and punch cards have been used to store information for automatic processing
since the 1890s, long before general-purpose computers existed. Information was recorded by
punching holes into the paper or cardboard medium, and was read by electrically (or, later,
optically) sensing whether a particular location on the medium was solid or contained a hole.

Williams tube used a cathode ray tube, and Selectron tube used a large vacuum tube to store
information. These primary storage devices were short-lived in the market, since Williams tube
was unreliable and Selectron tube was expensive.

Delay line memory used sound waves in a substance such as mercury to store information.
Delay line memory was dynamic volatile, cycle sequential read/write storage, and was used for
primary storage.

Other proposed methods

Phase-change memory uses different mechanical phases of phase change material to store
information, and reads the information by observing the varying electric resistance of the material.
Phase-change memory would be non-volatile, random access read/write storage, and might be
used for primary, secondary and off-line storage.

Holographic storage stores information optically inside crystals or photopolymers. Holographic


storage can utilize the whole volume of the storage medium, unlike optical disc storage which is
limited to a small number of surface layers. Holographic storage would be non-volatile, sequential
access, and either write once or read/write storage. It might be used for secondary and off-line
storage.

Molecular memory stores information in polymers that can store electric charge. Molecular
memory might be especially suited for primary storage.

Primary storage topics

 Memory management
 Virtual memory
 Physical memory
 Memory allocation
 Dynamic memory
 Memory leak
 Memory protection
 Flash memory
 Solid state disk
 Dynamic random access memory
 Static random access memory

Secondary, tertiary and off-line storage topics

 List of file formats


 Wait state
 Write protection
 Virtual Tape Library

Data storage conferences

 Storage Decisions
 Storage Networking World
 Storage World Conference

Flat file database

A simple diagram depicting conversion of a CSV-format flat file database table into a relational
database table. This is one of several typical uses for a flat file database.

A flat file database describes any of various means to encode a data model (most commonly a
table) as a plain text file.

Contents

 1 Flat files
 2 Implementation
 2.1 Historical implementations
 2.2 Contemporary implementations
 3 Terms
 4 Example database
 5 Flat-File relational database storage model
 5.1 Simple example name index on File1

 6 See also

Flat files

A flat file generally records one record per line. Fields may simply have a fixed width with
padding, or may be delimited by whitespace, tabs, commas (CSV) or other characters. Extra
formatting may be needed to avoid delimiter collision. There are no structural relationships. The
data are "flat" as in a sheet of paper, in contrast to more complex models such as a relational
database.

The classic example of a flat file database is a basic name-and-address list, where the database
consists of a small, fixed number of fields: Name, Address, and Phone Number. Another example
is a simple HTML table, consisting of rows and columns. This type of database is routinely
encountered, although often not expressly recognized as a database.

Implementation

It is possible to write out by hand, on a sheet of paper, a list of names, addresses, and phone
numbers; this is a flat file database. This can also be done with any typewriter or word processor.
But many pieces of computer software are designed to implement flat file databases.

Historical implementations

The first uses of computing machines were implementations of simple databases. Herman
Hollerith conceived the idea that any resident of the United States could be represented by a
string of exactly 80 digits and letters—name, age, and so forth, padded as needed with spaces to
make everyone's name the same length, so the database fields would "line up" properly. He sold
his concept, his machines, and the punched cards which both recorded and stored this data to
the US Census Bureau; thus, the Census of 1890 was the first ever computerized database—
consisting, in essence, of thousands of boxes full of punched cards.

Throughout the years following World War II, primitive electronic computers were run by
governments and corporations; these were very often used to implement flat file databases, the
most typical of which were accounting functions, such as payroll. Very quickly, though, these
wealthy customers demanded more from their extremely expensive machines, which led to early
relational databases. Amusingly enough, these early applications continued to use Hollerith
cards, slightly modified from the original design; Hollerith's enterprise grew into computer giant
IBM, which dominated the market of the time. The rigidity of the fixed-length field, 80-column
punch card driven database made the early computer a target of attack, and a mystery to the
common man.
In the 1980s, configurable flat-file database computer applications were popular on DOS and the
Macintosh. These programs were designed to make it easy for individuals to design and use their
own databases, and were almost on par with word processors and spreadsheets in popularity.
Examples of flat-file database products were early versions of FileMaker and the shareware PC-
File. Some of these offered limited relational capabilities, allowing some data to be shared
between files.

Contemporary implementations

Today, there are few programs designed to allow novices to create and use general-purpose flat
file databases. This function is implemented in Microsoft Works (available only for some versions
of Windows) and AppleWorks, sometimes named ClarisWorks (available for both Macintosh and
Windows platforms). Over time, products like Borland's Paradox, and Microsoft's Access started
offering some relational capabilities, as well as built-in programming languages. Database
Management Systems (DBMS) like MySQL or Oracle generally require programmers to build
applications.

Flat file databases are still used internally by many computer applications to store configuration
data. Many applications allow users to store and retrieve their own information from flat files using
a pre-defined set of fields. Examples are programs to manage collections of books or
appointments. Some small "contact" (name-and-address) database implementations essentially
use flat files.

XML is now a popular format for storing data in plain text files, but as XML allows very complex
nested data structures to be represented and contains the definition of the data, it would be
incorrect to describe this type of database as conforming to the flat-file model.

Terms

"Flat file database" may be defined very narrowly, or more broadly. The narrower interpretation is
correct in database theory; the broader covers the term as generally used.

Strictly, a flat file database should consist of nothing but data and delimiters. More broadly, the
term refers to any database which exists in a single file in the form of rows and columns, with no
relationships or links between records and fields except the table structure.

Terms used to describe different aspects of a database and its tools differ from one
implementation to the next, but the concepts remain the same. FileMaker uses the term "Find",
while MySQL uses the term "Query"; but the concept is the same. FileMaker "files" are equivalent
to MySQL "tables", and so forth. To avoid confusing the reader, one consistent set of terms is
used throughout this article.

However, the basic terms "record" and "field" are used in nearly every database implementation.

Example database
Consider a simple example database, storing a person's name, a numeric ID, and the team they
support. The data—the information itself—has simply been written out in table form:

id name team
1 Amy Blues
2 Bob Reds
3 Chuck Blues
4 Dick Blues
5 Ethel Reds
6 Fred Blues
7 Gilly Blues
8 Hank Reds

Note that the data in the first column is "all the same"—that is, they are all id numbers (serial
numbers). Likewise, the data in the second column is "all the same"—names. We have decided
Pico users will gang up into teams, and some belong to the Reds and some to the Blues. All
these team designations are found in the third column only. These columns are called "fields".

Also note that all the information in, say, the third line from the top "belongs to" one person: Bob.
Bob's id is "2"; his name is "Bob" (no surprise!); and his team is the "Reds". Each line is called a
"record". (Sometimes, although this is not strictly correct, the word "field" refers to just one datum
within the file—the intersection of a field and a record.)

The first line is not a record at all, but a row of "field labels"—names that identify the contents of
the fields which they head. Some databases omit this, in which case the question is left open:
"What is in these fields?" The answer must be supplied elsewhere.

In this implementation, fields can be detected by the fact that they all "line up": each datum uses
up the same number of characters as all other data in the same column; extra spaces are added
to make them all the same length. This is a very primitive and brittle implementation, dating back
to the days of punch cards.

Today, the same effect is achieved by delimiting fields with a tab character; records are delimited
by a newline. This is "tab-separated" format. Other ways to implement the same database are:

"1","Amy","Blues"
"2","Bob","Reds"
"3","Chuck","Blues"
"4","Dick","Blues"
"5","Ethel","Reds"
"6","Fred","Blues"
"7","Gilly","Blues"
"8","Hank","Reds"

—which is "comma-separated" (CSV) format. We could also write:

1-Amy-Blues/2-Bob-Reds/3-Chuck-Blues/4-Dick-Blues/5-Ethel-Reds/6-Fred-Blues/7-Gilly-Blues/8-Hank-Reds/
All are equivalent databases.

There is not much we can do with such a simple database. We can look at it, and, depending on
the storage format, do a textual search for specific fields; if we can edit it at all, we can add new
records to it; we can edit the contents of any field. We can import the entire database into another
tool. Sometimes this is enough, but only for the most basic needs. Beyond those, we turn to a tool
designed for the task, a database management system.

An advantage of a database tool is that it is specifically designed for database management. We


can add, delete, or edit records or individual units of data. We can add additional records to the
file explicitly, via an 'insert' or equivalent command; we can define certain processes to take place
when this happens. We can add additional fields, too, extending the structure. We can choose to
control what kind of data may be stored in a given field. For instance, id is defined to hold only a
serial number, which is assigned automatically when a new record is created.

This is about the limit of what a simple flat file can do. For more advanced applications, relational
databases are usually used.

Flat-File relational database storage model

There is a difference between the concept of a single flat-file database as a flat database model
as used above, and multiple flat-file tables as a relational database model. In order for flat-files to
be part of a relational database the RDBMS must be able to recognize various foreign key
relationships between multiple flat-file tables.

File1

file-offset id name team


0x00 8 Hank Reds
0x13 1 Amy Blues
0x27 3 Chuck Blues
0x3B 4 Dick Blues
0x4F 5 Ethel Reds
0x62 7 Gilly Blues
0x76 6 Fred Blues
0x8A 2 Bob Reds

The file-offset isn't actually part of the database, rather it is only there for clarification.

File2

team arena
Blues le Grand Bleu
Reds Super Smirnoff Stadium

In this setting flat-files simply act as a data store of a modern relational-database, all that is
needed to be a modern database are separate files supplied by the RDBMS for storing indexes,
constraints, triggers, foreign key relationships, fragmentation plans, replication plans, and other
modern distributed relational database concepts.

Simple example name index on File1

0x00000013
0x0000008A
0x00000027
0x0000003B
0x0000004F
0x00000076
0x00000062
0x00000000

Index (database)
A database index is a data structure that improves the speed of operations in a table. Indexes
can be created using one or more columns. The disk space required to store the index is typically
less than the storage of the table[citation needed]. In a relational database an index is a copy of part of a
table.

Some databases extend the power of indexes even further by allowing indexes to be created on
functions or expressions. For example, an index could be created on upper(last_name), which
would only store the uppercase versions of the last_name field in the index[citation needed].

Indexes are defined as unique or non-unique. A unique index acts as a constraint on the table by
preventing identical rows in the index and thus, the original columns.

Contents

• 1 Architecture
• 2 Column order
• 3 Applications and limitations

• 4 See also

Architecture
Index architectures are classified as clustered or non-clustered. Clustered indexes are indexes that are built based on the
same key by which the data is ordered on disk.[citation needed]
In some relational database management systems such as
Microsoft SQL Server, the leaf node of the clustered index corresponds to the actual data, not simply a pointer to data that
resides elsewhere, as is the case with a non-clustered index. Due to the fact that the clustered index corresponds (at the
leaf level) to the actual data, the data in the table is sorted as per the index, and therefore, only one clustered index can
exist in a given table (whereas many non-clustered indexes can exist, limited by the particular RDBMS vendor).
Unclustered indexes are indexes that are built on any key. Each relation can have a single clustered index and many
unclustered indexes. Clustered indexes usually store the actual records within the data structure and as a result can be
much faster than unclustered indexes.[citation needed]
Unclustered indexes are forced to store only record IDs in the data
structure and require at least one additional i/o operation to retrieve the actual record. 'Intrinsic' might be a better adjective
than 'clustered' -- indicating that the index is an integral part of the data structure storing the table.

Indexes can be implemented using a variety of data structures. Popular indices include balanced trees, B+ trees and
hashes.[citation needed]...

Column order
The order in which columns are listed in the index definition is important. It is possible to retrieve
a set of row identifiers using only the first indexed column. However, it is not possible or efficient
(on most databases) to retrieve the set of row identifiers using only the second or greater indexed
column.

For example, imagine a phone book that is organized by city first, then by last name, and then by
first name. If given the city, you can easily extract the list of all phone numbers for that city.
However, in this phone book it would be very tedious to find all the phone numbers for a given
last name. You would have to look within each city's section for the entries with that last name.
Some databases can do this, others just won’t use the index.

Applications and limitations


Indexes are useful for many applications but come with some limitations. Consider the following
SQL statement: SELECT first_name FROM people WHERE last_name = 'Finkelstein';. To
process this statement without an index the database software must look at the last_name
column on every row in the table (this is known as a full table scan). With an index the database
simply follows the b-tree data structure until the Finkelstein entry has been found; this is much
less computationally expensive than a full table scan.

Consider this SQL statement: SELECT email_address FROM customers WHERE email_address
LIKE '%@yahoo.com';. This query would yield an email address for every customer whose email
address ends with "@yahoo.com", but even if the email_address column has been indexed the
database still must perform a full table scan. This is because the index is built with the
assumption that words go from left to right. With a wildcard at the beginning of the search-term
the database software is unable to use the underlying b-tree data structure. This problem can be
solved through the addition of another index created on reverse(email_address) and a SQL query
like this: select email_address from customers where reverse(email_address) like
reverse('%@yahoo.com');. This puts the wild-card at the right most part of the query (now
moc.oohay@%) which the index on reverse(email_address) can satisfy.

File system
In computing, a file system (often also written as filesystem) is a method for storing and
organizing computer files and the data they contain to make it easy to find and access them. File
systems may use a storage device such as a hard disk or CD-ROM and involve maintaining the
physical location of the files, they might provide access to data on a file server by acting as clients
for a network protocol (e.g., NFS, SMB, or 9P clients), or they may be virtual and exist only as an
access method for virtual data (e.g. procfs).

More formally, a file system is a set of abstract data types that are implemented for the storage,
hierarchical organization, manipulation, navigation, access, and retrieval of data. File systems
share much in common with database technology, but it is debatable whether a file system can
be classified as a special-purpose database (DBMS).

Contents

 1 Aspects of file systems


 2 Types of file systems
 2.1 Disk file systems
 2.2 Database file systems
 2.3 Transactional file systems
 2.4 Network file systems
 2.5 Special purpose file systems
 3 File systems and operating systems
 3.1 Flat file systems
 3.2 File systems under Unix and Unix-like systems
 3.2.1 File systems under Mac OS X
 3.3 File systems under Plan 9 from Bell Labs
 3.4 File systems under Microsoft Windows
 3.5 File systems under OpenVMS
 3.6 File systems under MVS [IBM Mainframe]
 4 See also
 5 References
 5.1 Further reading

 6 External links

Aspects of file systems


The most familiar file systems make use of an underlying data storage device that offers access
to an array of fixed-size blocks, sometimes called sectors, generally 512 bytes each. The file
system software is responsible for organizing these sectors into files and directories, and keeping
track of which sectors belong to which file and which are not being used.

However, file systems need not make use of a storage device at all. A file system can be used to
organize and represent access to any data, whether it be stored or dynamically generated (eg,
from a network connection).

Whether the file system has an underlying storage device or not, file systems typically have
directories which associate file names with files, usually by connecting the file name to an index
into a file allocation table of some sort, such as the FAT in an MS-DOS file system, or an inode in
a Unix-like file system. Directory structures may be flat, or allow hierarchies where directories
may contain subdirectories. In some file systems, file names are structured, with special syntax
for filename extensions and version numbers. In others, file names are simple strings, and per-file
metadata is stored elsewhere.

Other bookkeeping information is typically associated with each file within a file system. The
length of the data contained in a file may be stored as the number of blocks allocated for the file
or as an exact byte count. The time that the file was last modified may be stored as the file's
timestamp. Some file systems also store the file creation time, the time it was last accessed, and
the time that the file's meta-data was changed. (Note that many early PC operating systems did
not keep track of file times.) Other information can include the file's device type (e.g., block,
character, socket, subdirectory, etc.), its owner user-ID and group-ID, and its access permission
settings (e.g., whether the file is read-only, executable, etc.).

The hierarchical file system was an early research interest of Dennis Ritchie of Unix fame;
previous implementations were restricted to only a few levels, notably the IBM implementations,
even of their early databases like IMS. After the success of Unix, Ritchie extended the file system
concept to every object in his later operating system developments, such as Plan 9 and Inferno.

Traditional file systems offer facilities to create, move and delete both files and directories. They
lack facilities to create additional links to a directory (hard links in Unix), rename parent links (".."
in Unix-like OS), and create bidirectional links to files.

Traditional file systems also offer facilities to truncate, append to, create, move, delete and in-
place modify files. They do not offer facilities to prepend to or truncate from the beginning of a file,
let alone arbitrary insertion into or deletion from a file. The operations provided are highly
asymmetric and lack the generality to be useful in unexpected contexts. For example,
interprocess pipes in Unix have to be implemented outside of the file system because the pipes
concept does not offer truncation from the beginning of files.

Secure access to basic file system operations can be based on a scheme of access control lists
or capabilities. Research has shown access control lists to be difficult to secure properly, which is
why research operating systems tend to use capabilities. Commercial file systems still use access
control lists. see: secure computing

Arbitrary attributes can be associated on advanced file systems, such as XFS, ext2/ext3, some
versions of UFS, and HFS+, using extended file attributes. This feature is implemented in the
kernels of Linux, FreeBSD and Mac OS X operating systems, and allows metadata to be
associated with the file at the file system level. This, for example, could be the author of a
document, the character encoding of a plain-text document, or a checksum.

Types of file systems

File system types can be classified into disk file systems, network file systems and special
purpose file systems.

Disk file systems

A disk file system is a file system designed for the storage of files on a data storage device, most
commonly a disk drive, which might be directly or indirectly connected to the computer. Examples
of disk file systems include FAT, NTFS, HFS and HFS+, ext2, ext3, ISO 9660, ODS-5, and UDF.
Some disk file systems are journaling file systems or versioning file systems.

Database file systems


A new concept for file management is the concept of a database-based file system. Instead of, or
in addition to, hierarchical structured management, files are identified by their characteristics, like
type of file, topic, author, or similar metadata.

Transactional file systems

This is a special kind of file system in that it logs events or transactions to files. Each operation
that you do may involve changes to a number of different files and disk structures. In many cases,
these changes are related, meaning that it is important that they all be executed at the same time.
Take for example a bank sending another bank some money electronically. The bank's computer
will "send" the transfer instruction to the other bank and also update its own records to indicate
the transfer has occurred. If for some reason the computer crashes before it has had a chance to
update its own records, then on reset, there will be no record of the transfer but the bank will be
missing some money. A transactional system can rebuild the actions by resynchronizing the
"transactions" on both ends to correct the failure. All transactions can be saved, as well, providing
a complete record of what was done and where. This type of file system is designed and intended
to be fault tolerant and necessarily, incurs a high degree of overhead.

Network file systems

A "network file system" is a file system that acts as a client for a remote file access protocol,
providing access to files on a server. Examples of network file systems include clients for the
NFS, SMB, AFP, and 9P protocols, and file-system-like clients for FTP and WebDAV.

Special purpose file systems

A special purpose file system is basically any file system that is not a disk file system or network
file system. This includes systems where the files are arranged dynamically by software, intended
for such purposes as communication between computer processes or temporary file space.

Special purpose file systems are most commonly used by file-centric operating systems such as
Unix. Examples include the procfs (/proc) file system used by some Unix variants, which grants
access to information about processes and other operating system features.

Deep space science exploration craft, like Voyager I & II used digital tape based special file
systems. Most modern space exploration craft like Cassini-Huygens used Real-time operating
system file systems or RTOS influenced file systems. The Mars Rovers are one such example of
an RTOS file system, important in this case because they are implemented in flash memory.

File systems and operating systems

Most operating systems provide a file system, as a file system is an integral part of any modern
operating system. Early microcomputer operating systems' only real task was file management —
a fact reflected in their names (see DOS and QDOS). Some early operating systems had a
separate component for handling file systems which was called a disk operating system. On
some microcomputers, the disk operating system was loaded separately from the rest of the
operating system. On early operating systems, there was usually support for only one, native,
unnamed file system; for example, CP/M supports only its own file system, which might be called
"CP/M file system" if needed, but which didn't bear any official name at all.

Because of this, there needs to be an interface provided by the operating system software
between the user and the file system. This interface can be textual (such as provided by a
command line interface, such as the Unix shell, or OpenVMS DCL) or graphical (such as
provided by a graphical user interface, such as file browsers). If graphical, the metaphor of the
folder, containing documents, other files, and nested folders is often used (see also: directory and
folder).

Flat file systems

In a flat file system, there are no directories — everything is stored at the same (root) level on the
media, be it a hard disk, floppy disk, etc. While simple, this system rapidly becomes inefficient as
the number of files grows, and makes it difficult for users to organise data into related groups.

Like many small systems before it, the original Apple Macintosh featured a flat file system, called
Macintosh File System. Its version of Mac OS was unusual in that the file management software
(Macintosh Finder) created the illusion of a partially hierarchical filing system on top of MFS. MFS
was quickly replaced with Hierarchical File System, which supported real directories.

File systems under Unix and Unix-like systems

Wikibooks Guide to Unix has a page on the topic of


Filesystems and Swap

Unix and Unix-like operating systems assign a device name to each device, but this is not how
the files on that device are accessed. Instead, Unix creates a virtual file system, which makes all
the files on all the devices appear to exist under one hierarchy. This means, in Unix, there is one
root directory, and every file existing on the system is located under it somewhere. Furthermore,
the Unix root directory does not have to be in any physical place. It might not be on your first hard
drive - it might not even be on your computer. Unix can use a network shared resource as its root
directory.

To gain access to files on another device, you must first inform the operating system where in the
directory tree you would like those files to appear. This process is called mounting a file system.
For example, to access the files on a CD-ROM, one must tell the operating system "Take the file
system from this CD-ROM and make it appear under thus-and-such a directory". The directory
given to the operating system is called the mount point - it might, for example, be /mnt. The /mnt
directory exists on many Unix-like systems (as specified in the Filesystem Hierarchy Standard)
and is intended specifically for use as a mount point for temporary media like floppy disks or CDs.
It may be empty, or it may contain subdirectories for mounting individual devices. Generally, only
the administrator (i.e. root user) may authorize the mounting of file systems.

Unix-like operating systems often include software and tools that assist in the mounting process
and provide it new functionality. Some of these strategies have been coined "auto-mounting" as a
reflection of their purpose.

1. In many situations, file systems other than the root need to be available as soon as the
operating system has booted. All Unix-like systems therefore provide a facility for
mounting file systems at boot time. System administrators define these file systems in the
configuration file fstab, which also indicates options and mount points.
2. In some situations, there is no need to mount certain file systems at boot time, although
their use may be desired thereafter. There are some utilities for Unix-like systems that
allow the mounting of predefined file systems upon demand.
3. Removable media have become very common with microcomputer platforms. They allow
programs and data to be transferred between machines without a physical connection.
Two common examples include CD-ROMs and DVDs. Utilities have therefore been
developed to detect the presence and availability of a medium and then mount that
medium without any user intervention.
4. Progressive Unix-like systems have also introduced a concept called supermounting;
see, for example, the Linux supermount-ng project. For example, a floppy disk that has
been supermounted can be physically removed from the system. Under normal
circumstances, the disk should have been synchronised and then unmounted before its
removal. Provided synchronisation has occurred, a different disk can be inserted into the
drive. The system automatically notices that the disk has changed and updates the
mount point contents to reflect the new medium. Similar functionality is found on standard
Windows machines.
5. A similar innovation preferred by some users is the use of autofs, a system that, like
supermounting, eliminates the need for manual mounting commands. The difference
from supermount, other than compatibility in an apparent greater range of applications
such as access to file systems on network servers, is that devices are mounted
transparently when requests to their file systems are made, as would be appropriate for
file systems on network servers, rather than relying on events such as the insertion of
media, as would be appropriate for removable media.

File systems under Mac OS X

Mac OS X uses a file system that it inherited from Mac OS called HFS Plus. HFS Plus is a
metadata-rich and case preserving file system. Due to the Unix roots of Mac OS X, Unix
permissions were added to HFS Plus. Later versions of HFS Plus added journaling to prevent
corruption of the file system structure and introduced a number of optimizations to the allocation
algorithms in an attempt to defragment files automatically without requiring an external
defragmenter.

Filenames can be up to 255 characters. HFS Plus uses Unicode to store filenames. On Mac OS
X, the filetype can come from the type code stored in file's metadata or the filename.

HFS Plus has three kinds of links: Unix-style hard links, Unix-style symbolic links and aliases.
Aliases are designed to maintain a link to their original file even if they are moved or renamed;
they are not interpreted by the file system itself, but by the File Manager code in userland.

File systems under Plan 9 from Bell Labs

Plan 9 from Bell Labs was originally designed to extend some of Unix's good points, and to
introduce some new ideas of its own while fixing the shortcomings of Unix.

With respect to file systems, the Unix system of treating things as files was continued, but in Plan
9, everything is treated as a file, and accessed as a file would be (ie., no ioctl or mmap). Perhaps
surprisingly, while the file interface is made universal it is also simplified considerably, for
example symlinks, hard links and suid are made obsolete, and an atomic create/open operation is
introduced. More importantly the set of file operations becomes well defined and subversions of
this like ioctl are eliminated.

Secondly, the underlying 9P protocol was used to remove the difference between local and
remote files (except for a possible difference in latency). This has the advantage that a device or
devices, represented by files, on a remote computer could be used as though it were the local
computer's own device(s). This means that under Plan 9, multiple file servers provide access to
devices, classing them as file systems. Servers for "synthetic" file systems can also run in user
space bringing many of the advantages of micro kernel systems while maintaining the simplicity
of the system.
Everything on a Plan 9 system has an abstraction as a file; networking, graphics, debugging,
authentication, capabilities, encryption, and other services are accessed via I-O operations on file
descriptors. For example, this allows the use of the IP stack of a gateway machine without need
of NAT, or provides a network-transparent window system without the need of any extra code.

Another example: a Plan-9 application receives FTP service by opening an FTP site. The ftpfs
server handles the open by essentially mounting the remote FTP site as part of the local file
system. With ftpfs as an intermediary, the application can now use the usual file-system
operations to access the FTP site as if it were part of the local file system. A further example is
the mail system which uses file servers that synthesize virtual files and directories to represent a
user mailbox as /mail/fs/mbox. The wikifs provides a file system interface to a wiki.

These file systems are organized with the help of private, per-process namespaces, allowing
each process to have a different view of the many file systems that provide resources in a
distributed system.

The Inferno operating system shares these concepts with Plan 9.

File systems under Microsoft Windows

Microsoft Windows developed out of an earlier operating system (MS-DOS which in turn was
based on QDOS and that on CP/M-80, which took many ideas from still earlier operating
systems, notably several from DEC), and has added file systems from several other sources
since its first release (e.g. Unix). As such, Windows makes use of the FAT (File Allocation Table)
and NTFS (New Technology File System) file systems. Older versions of the FAT file system
(FAT12 and FAT16) had file name length limits, a limit on the number of entries in the root
directory of the file system and had restrictions on the maximum size of FAT-formatted disks or
partitions. Specifically, FAT12 and FAT16 had a limitation of 8 characters for the file name, and 3
characters for the extension. (This is commonly referred to as the 8.3 limit.) VFAT, which was an
extension to FAT12 and FAT16 introduced in Windows NT 3.5 and subsequently included in
Windows 95, allowed for long file names (LFN). FAT32 also addressed many of the limits in
FAT12 and FAT16, but remains limited compared to NTFS.

NTFS, introduced with the Windows NT operating system, allowed ACL-based permission
control. Hard links, multiple file streams, attribute indexing, quota tracking, compression and
mount-points for other file systems (called "junctions") are also supported, though not all these
features are well-documented.

Unlike many other operating systems, Windows uses a drive letter abstraction at the user level to
distinguish one disk or partition from another. For example, the path C:\WINDOWS\ represents a
directory WINDOWS on the partition represented by the letter C. The C drive is most commonly
used for the primary hard disk partition, on which Windows is installed and from which it boots.
This "tradition" has become so firmly ingrained that bugs came about in older versions of
Windows which made assumptions that the drive that the operating system was installed on was
C. The tradition of using "C" for the drive letter can be traced to MS-DOS, where the letters A and
B were reserved for up to two floppy disk drives; in a common configuration, A would be the 3½-
inch floppy drive, and B the 5¼-inch one. Network drives may also be mapped to drive letters.

Since Windows primarily interacts with the user via a graphical user interface, its documentation
refers to directories as a folder which contains files, and is represented graphically with a folder
icon.

Database transaction
A database transaction is a unit of interaction with a database management system or similar
system that is treated in a coherent and reliable way independent of other transactions that must
be either entirely completed or aborted. Ideally, a database system will guarantee all of the ACID
properties for each transaction. In practice, these properties are often relaxed somewhat to
provide better performance.

In some systems, transactions are also called LUWs for Logical Units of Work.

Contents

 1 Purpose of transaction
 2 Transactional databases
 3 Transactional filesystems
 4 See also

 5 External links

Purpose of transaction
In database products the ability to handle transactions allows the user to ensure that integrity of a
database is maintained.

A single transaction might require several queries, each reading and/or writing information in the
database. When this happens it is usually important to be sure that the database is not left with
only some of the queries carried out. For example, when doing a money transfer, if the money
was debited from one account, it is important that it also be credited to the depositing account.
Also, transactions should not interfere with each other. For more information about desirable
transaction properties, see ACID.

A simple transaction is usually issued to the database system in a language like SQL in this form:

1. Begin the transaction


2. Execute several queries (although any updates to the database aren't actually visible to
the outside world yet)
3. Commit the transaction (updates become visible if the transaction is successful)

If one of the queries fails the database system may rollback either the entire transaction or just
the failed query. This behaviour is dependent on the DBMS in use and how it is set up. The
transaction can also be rolled back manually at any time before the commit.

Transactional databases

Databases that support transactions are called transactional databases. Most modern relational
database management systems fall into this category.

Transactional filesystems
The Namesys Reiser4 filesystem for Linux [1] and the newest version of the Microsoft NTFS
filesystem both support transactions [2].

See also

 Distributed transaction
 Nested transaction
 ACID properties
 Atomic transaction
 Software transactional memory
 Long running transaction

Transaction processing
In computer science, transaction processing is information processing that is divided into
individual, indivisible operations, called transactions. Each transaction must succeed or fail as a
complete unit; it cannot remain in an intermediate state.

Contents

 1 Description
 2 Methodology
 2.1 Rollback
 2.2 Rollforward
 2.3 Deadlocks
 3 ACID criteria
 4 Implementations
 5 See also

 6 Books

Description

Transaction processing is designed to maintain a database in a known, consistent state, by


ensuring that any operations carried out on the database that are interdependent are either all
completed successfully or all cancelled successfully.

For example, consider a typical banking transaction that involves moving £500 from a customer's
savings account to a customer's checking account. This transaction is a single operation in the
eyes of the bank, but it involves at least two separate operations in computer terms: debiting the
savings account by £500, and crediting the checking account by £500. If the debit operation
succeeds but the credit does not (or vice versa), the books of the bank will not balance at the end
of the day. There must therefore be a way to ensure that either both operations succeed or both
fail, so that there is never any inconsistency in the bank's database as a whole. Transaction
processing is designed to provide this.
Transaction processing allows multiple individual operations on a database to be linked together
automatically as a single, indivisible transaction. The transaction-processing system ensures that
either all operations in a transaction are completed without error, or none of them are. If some of
the operations are completed but errors occur when the others are attempted, the transaction-
processing system “rolls back” all of the operations of the transaction (including the successful
ones), thereby erasing all traces of the transaction and restoring the database to the consistent,
known state that it was in before processing of the transaction began. If all operations of a
transaction are completed successfully, the transaction is “committed” by the system, and all
changes to the database are made permanent; the transaction cannot be rolled back once this is
done.

Transaction processing guards against hardware and software errors that might leave a
transaction partially completed, with a database left in an unknown, inconsistent state. If the
computer system crashes in the middle of a transaction, the transaction processing system
guarantees that all operations in any uncommitted (i.e., not completely processed) transactions
are cancelled.

Transactions are processed in a strict chronological order. If transaction n+1 touches the same
portion of the database as transaction n, transaction n+1 does not begin until transaction n is
committed. Before any transaction is committed, all other transactions affecting the same part of
the database must also be committed; there can be no “holes” in the sequence of preceding
transactions.

Methodology

The basic principles of all transaction-processing systems are the same. However, the
terminology may vary from one transaction-processing system to another, and the terms used
below are not necessarily universal.

Rollback

Transaction-processing systems ensure database integrity by recording intermediate states of the


database as it is modified, then using these records to restore the database to a known state if a
transaction cannot be committed. For example, copies of information on the database prior to its
modification by a transaction are set aside by the system before the transaction can make any
modifications (this is sometimes called a before image). If any part of the transaction fails before it
is committed, these copies are used to restore the database to the state it was in before the
transaction began (rollback).

Rollforward

It is also possible to keep a separate journal of all modifications to a database (sometimes called
after images); this is not required for rollback of failed transactions, but it is useful for updating the
database in the event of a database failure, so some transaction-processing systems provide it. If
the database fails entirely, it must be restored from the most recent back-up. The back-up will not
reflect transactions committed since the back-up was made. However, once the database is
restored, the journal of after images can be applied to the database (rollforward) to bring the
database up to date. Any transactions in progress at the time of the failure can then be rolled
back. The result is a database in a consistent, known state that includes the results of all
transactions committed up to the moment of failure.

Deadlocks

In some cases, two transactions may, in the course of their processing, attempt to access the
same portion of a database at the same time, in a way that prevents them from proceeding. For
example, transaction A may access portion X of the database, and transaction B may access
portion Y of the database. If, at that point, transaction A then tries to access portion Y of the
database while transaction B tries to access portion X, a deadlock occurs, and neither transaction
can move forward. Transaction-processing systems are designed to detect these deadlocks when
they occur. Typically both transactions will be cancelled and rolled back, and then they will be
started again in a different order, automatically, so that the deadlock doesn't occur again.

ACID criteria
There are many minor variations on the exact methods used to protect database consistency in a
transaction-processing system, but the basic principles remain the same. All transaction-
processing systems support these functions, which are referred to as the ACID properties:
atomicity, consistency, isolation, and durability.

Implementations
Standard transaction-processing software, notably IBM's Information Management System, was
first developed in the 1960s, and was often closely coupled to particular database management
systems. Client-server computing implemented similar principles in the 1980s with mixed
success. However, in more recent years, the distributed client-server model has become
considerably more difficult to maintain. As the number of transactions grew in response to various
online services (especially the Web), a single distributed database was not a practical solution. In
addition, most online systems consist of a whole suite of programs operating together, as
opposed to a strict client-server model where the single server could handle the transaction
processing. Today a number of transaction processing systems are available that work at the
inter-program level and which scale to large systems, including mainframes.

An important open industry standard is the X/Open Distributed Transaction Processing (DTP)
(see JTA). However, proprietary transaction

Distributed transaction
A distributed transaction is an operations bundle, in which two or more network hosts are
involved. Usually, hosts provide transactional resources, while the transaction manager is
responsible for creating and managing global transaction that encompasses all operations
against such resources. Distributed transactions, as any other transactions, must have all four
ACID properties, where atomicity guarantees all-or-nothing outcomes for the unit of work
(operations bundle).

Open Group, a vendor consortium, proposed the X/Open Distributed Transaction Processing
(DTP) Model, which became a de-facto standard for behavior of transaction model components.

Databases are common transactional resources and, often, transactions span a couple of such
databases. In this case, a distributed transaction can be seen as a database transaction that
must be synchronized (or provide ACID properties) among multiple participating databases which
are distributed among different physical locations. The isolation property poses a special
challenge for multi database transactions, since the (global) serializability property could be
violated, even if each database provides it. In practice most commercial database systems use
strict two-phase locking for concurrency control, which ensures global serializability, if all the
participating databases employ it. (see also commitment ordering for multi databases.)

A common algorithm for ensuring correct completion of a distributed transaction is the two-phase
commit. This algorithm is usually applied for updates able to commit in a short period of time,
ranging from couple of milliseconds to couple of minutes.

There are also long-lived distributed transactions, for example a transaction to book a trip, which
consists of booking a flight, a rental car and a hotel. Since booking the flight might take up to a
day to get a confirmation, two-phase commit is not applicable here, it will lock the resources for
this long. In this case more sophisticated techniques that involve multiple undo levels are used.
The way you can undo the hotel booking by calling a desk and cancelling the reservation, a
system can be designed to undo certain operations (unless they are irreversibly finished).

In practice, long-lived distributed transactions are implemented in systems based on Web


Services. Usually these transactions utilize principles of Compensating Transactions, Optimism
and Isolation Without Locking. X/Open standard does not cover long-lived DTP.

A couple of modern technologies, including Enterprise Java Beans (EJBs) and Microsoft
Transaction Server (MTS) fully support distributed transaction standards.

Strict two-phase locking


In computer science, strict two-phase locking (Strict 2PL) is a locking method used in
concurrent systems.

The two rules of Strict 2PL are:

1. If a transaction T wants to read/write an object, it must request a shared/exclusive lock on


the object.
2. All exclusive locks held by transaction T are released when T commits (and not before).
Here is an example of Strict 2PL in action with interleaved actions.

or in text form:

T1: S(A), R(A); T2: S(A), R(A), X(B), R(B), W(B), Commit; T1: X(C), R(C), W(C), Commit

where

• S(O) is a shared lock action on an object O


• X(O) is an exclusive lock action on an object O
• R(O) is a read action on an object O
• W(O) is a write action on an object O

Strict 2PL prevents transactions reading uncommitted data, overwriting uncommitted data, and
unrepeatable reads. Thus, it prevents cascading rollbacks, since eXclusive locks (for write
privileges) must be held until a transaction commits.

[edit] Strict 2PL does not guarantee a deadlock-free schedule

Avoiding deadlocks can be important in real time systems, and may additionally be difficult to
enforce in distributed data bases, or fault tolerant systems with multiple redundancy.

A deadlocked schedule allowed under Strict 2PL:


Text: T1: X(A) T2:X(B) T1:X(B) T2: X(A)

T1 is waiting for T2's lock on B to be released, while T2 is waiting for T1's lock on A to be
released. These transactions cannot proceed and both are deadlocked.

There is no general solution to the problem of deadlocks in computing systems, so they must be
anticipated and dealt with accordingly. Nonetheless, several solutions such as the Banker's
algorithm or the imposition of a partial ordering on lock acquisition exist for avoiding deadlocks
under certain conditions.

Even more strict than strict two-phase locking is rigorous two-phase locking, in which transactions
can be serialized by the order in which they commit. Under rigorous 2PL, all locks (shared and
exclusive) must be held until a transaction commits. Most database systems use strict 2PL.

Concurrency control
In computer science — more specifically, in the field of databases and database theory —
concurrency control is a method used to ensure that database transactions are executed in a
safe manner (i.e., without data loss). Concurrency control is especially applicable to database
management systems (DBMS), which must ensure that transactions are executed safely and that
they follow the ACID rules, as described in the following section. The DBMS must be able to
ensure that only serializable, recoverable schedules are allowed, and that no actions of
committed transactions are lost while undoing aborted transactions.

In computer science — in the field of concurrent programming (see also parallel programming
and parallel computing on multiprocessor machines) — concurrency control is a method used to
ensure that correct results are generated, while getting those results as quickly as possible.

Several algorithms can be used for either type of concurrency control (i.e., with in-RAM data
structures on systems that have no database, or with on-disk databases).

Contents

[hide]

• 1 Transaction ACID rules


• 2 Concurrency control mechanism
• 3 See also

• 4 External links

[edit] Transaction ACID rules

• Atomicity - Either all or no operations are completed - in other words to the outside world
the transaction appears to happen indivisibly (Undo)
• Consistency - All transactions must leave the database in a consistent state.
• Isolation - Transactions cannot interfere with each other.
• Durability - Successful transactions must persist through crashes. (Redo)

[edit] Concurrency control mechanism

The main categories of concurrency control mechanisms are:

• Optimistic - Delay the synchronization for transactions until the operations are
performed. Conflicts are less likely but won't be known until they happen.
• Pessimistic - The potentially concurrent executions of transactions are synchronized
early in their execution life cycle. Blocking is thus more likely but will be known earlier.

There are many methods for concurrency control, the majority of which uses Strict 2PL locking:

• Strict two-phase locking


• Non-strict two-phase locking
• Conservative two-phase locking
• Index locking
• Multiple granularity locking

Locks are bookkeeping objects associated with a database object.

There are also non-lock concurrency control methods. All the currently implemented lock-based
and almost all the implemented non-lock based concurrency controls will guarantee that the
resultant schedule is conflict serializable; however, there are many academic texts encouraging
view serializable schedules for environments where gains due to improvement in concurrency
outstrip overheads in generating schedule plans.

Schedule (computer science)


\In the field of databases, a schedule is a list of actions, (i.e. reading, writing, aborting, committing), from a set of
transactions.

Here is a sample schedule:


In this example, Schedule D is the set of 3 transactions T1, T2, T3. The schedule describes the actions of the transactions
as seen by the DBMS. T1 Reads and writes to object X, and then T2 Reads and writes to object Y, and finally T3 Reads
and writes to object Z. This is an example of a serial schedule, because the actions of the 3 transactions are not
interleaved.

Contents

• 1 Types of schedule
o 1.1 Serial
o 1.2 Serializable
 1.2.1 Conflicting actions
 1.2.2 Conflict equivalence
 1.2.3 Conflict-serializable
 1.2.4 Commitment-ordered
 1.2.5 View equivalence
 1.2.6 View-serializable
o 1.3 Recoverable
 1.3.1 Unrecoverable
 1.3.2 Avoids cascading aborts (rollbacks)
 1.3.3 Strict
• 2 Hierarchical relationship between serializability classes
• 3 Practical implementations

• 4 See also

Types of schedule
Serial

The transactions are executed one by one, non-interleaved. (see above)

Serializable
A schedule that is equivalent to a serial schedule has the serializability property.

In schedule E, the order in which the actions of the transactions are executed is not the same as
in D, but in the end, E gives the same result as D.

[edit] Conflicting actions

Two or more actions are said to be in conflict if:

1. The actions belong to different transactions.


2. At least one of the actions is a write operation.
3. The actions access the same object (read or write).

The following set of actions is conflicting:

• T1:R(X), T2:W(X), T3:W(X)

While the following sets of actions are not:

• T1:R(X), T2:R(X), T3:R(X)


• T1:R(X), T2:W(Y), T3:R(X)

[edit] Conflict equivalence

The schedules S1 and S2 are said to be conflict-equivalent if the following conditions are
satisfied:

1. Both schedules S1 and S2 involve the same set of actions in the same set of
transactions. (informally speaking, both schedules are containing and working on the
same thing)
2. The order of each pair of conflicting actions in S1 and S2 are the same.

[edit] Conflict-serializable

A schedule is said to be conflict-serializable when the schedule is conflict-equivalent to one or


more serial schedules.
Another definition for conflict-serializability is that a schedule is conflict-serializable if and only if
there exists an acyclic precedence graph/serializability graph for the schedule.

Which is conflict-equivalent to the serial schedule <T1,T2>

[edit] Commitment-ordered

A schedule is said to be commitment-ordered, or commitment-order-serializable, if it obeys the


Commitment ordering (commit-order-serializability) schedule property. This means that it is
conflict-serializable, and the precedence order of transactions' commitment events is identical to
the precedence (partial) order of the respective transactions, as induced by their schedule's
acyclic precedence graph/serializability graph.

[edit] View equivalence

Two schedules S1 and S2 are said to be view-equivalent when the following conditions are
satisfied:

1. If the transaction Ti in S1 reads an initial value for object X, so does the transaction Ti in
S2.
2. If the transaction Ti in S1 reads the value written by transaction Tj in S1 for object X, so
does the transaction Ti in S2.
3. If the transaction Ti in S1 is the final transaction to write the value for an object X, so is
the transaction Ti in S2.

[edit] View-serializable

A schedule is said to be view-serializable if it is view-equivalent to some serial schedule. Note


that by definition, all conflict-serializable schedules are view-serializable.
Notice that the above example (which is the same as the example in the discussion of conflict-
serializable) is both view-serializable and conflict-serializable at the same time. There are
however view-serializable schedules that are not conflict-serializable: those schedules with a
transaction performing a blind write.

The above example is not conflict-serializable, but it is view-serializable since it has a view-
equivalent serial schedule <T1, T2, T3>.

Since determining whether a schedule is view-serializable is NP-complete, view-serializability has


little practical interest.

[edit] Recoverable

Transactions commit only after all transactions whose changes they read commit.
These schedules are recoverable. F is recoverable because T1 commits before T2, that makes
the value read by T2 correct. Then T2 can commit itself. In F2, if T1 aborted, T2 has to abort
because the value of A it read is incorrect. In both cases, the database is left in a consistent state.

[edit] Unrecoverable

If a transaction T1 aborts, and a transaction T2 commits, but T2 relied on T1, we have an


unrecoverable schedule.

In this example, G is unrecoverable, because T2 read the value of A written by T1, and
committed. T1 later aborted, therefore the value read by T2 is wrong, but since T2 committed, this
schedule is unrecoverable.

[edit] Avoids cascading aborts (rollbacks)

Also named cascadeless. A single transaction abort leads to a series of transaction rollback.
Strategy to prevent cascading aborts is to disallow a transaction from reading uncommitted
changes from another transaction in the same schedule.

The following examples are the same as the one from the discussion on recoverable:
In this example, although F2 is recoverable, it does not avoid cascading aborts. It can be seen
that if T1 aborts, T2 will have to be aborted too in order to maintain the correctness of the
schedule as T2 has already read the uncommitted value written by T1.

The following is a recoverable schedule which avoids cascading abort. Note, however, that the
update of A by T1 is always lost.

Cascading aborts avoidance is sufficient but not necessary for a schedule to be recoverable.

[edit] Strict

A schedule is strict if for any two transactions T1, T2, if a write operation of T1 precedes a
conflicting operation of T2 (either read or write), then the commit event of T1 also precedes that
conflicting operation of T2.

Any strict schedule is cascadeless, but not the converse.

[edit] Hierarchical relationship between serializability classes

The following subclass clauses illustrate the hierarachical relationships between serializability
classes:

• Serial ⊂ commitment-ordered ⊂ conflict-serializable ⊂ view-serializable ⊂ all schedules


• Serial ⊂ strict ⊂ avoids cascading aborts ⊂ recoverable ⊂ all schedules
The Venn diagram illustrates the above clauses graphically.

Venn diagram for serializability classes

Practical implementations

In practice, most businesses aim for conflict-serializable and recoverable (primarily strict)
schedules.

Serializability
In databases and transaction processing, serializability is the property of a schedule (history)
being serializable. It means equivalence (in its outcome, the resulting database state, the values
of the database's data) to a serial schedule (serial schedule: No overlap in two transactions'
execution time intervals; consecutive transaction execution). It relates to the isolation property of
a transaction, and plays an essential role in concurrency control. Transactions are usually being
executed concurrently since serial executions are typically extremely inefficient and thus
impractical.

Contents

 1 Correctness - Serializability
 2 Correctness - Recoverability
 3 Relaxing serializability
 4 View serializability and Conflict serializability
 5 Testing conflict serializability
 6 Common mechanism - (Strong) Strict Two Phase Locking

 7 Global serializability - Commitment ordering

Correctness - Serializability

Serializability is the major criterion for the correctness of concurrent transactions' executions (i.e.,
transactions that have overlapping execution time intervals, and possibly access same shared
resources), and a major goal for concurrency control. As such it is supported in all general
purpose database systems. The rationale behind it is the following: If each transaction is correct
by itself, then any serial execution of these transactions is correct. As a result, any execution that
is equivalent (in its outcome) to a serial execution, is correct.

Schedules that are not serializable are likely to generate erroneous outcomes. Well known
examples are with transactions that debit and credit accounts with money. If the related
schedules are not serializable, then the total sum of money may not be preserved. Money could
disappear, or be generated from nowhere. This is caused by one transaction writing, and
"stepping on" and erasing what has been written by another transaction before it has become
permanent in the database. This does not happen if serializability is maintained.

Correctness - Recoverability

In systems where transactions can abort (virtually all real systems), serializability by itself is not
sufficient for correctness. Schedules also need to possess the Recoverability property.
Recoverability means that committed transactions have not read data written by aborted
transactions (whose effects do not exist in the resulting database states). While serializability can
be compromised in many applications, compromising recoverability always violates the
database's integrity.

Relaxing serializability

In many applications, unlike with finances, absolute correctness is not needed. For example,
when retrieving a list of products according to specification, in most cases it does not matter
much if a product, whose data was updated a short time ago, does not appear in the list, even if it
meets the specification. It will typically appear in such a list when tried again a short time later.
Commercial databases provide concurrency control with a whole range of (controlled)
serializability violations (see isolation levels) in order to achieve higher performance, when the
application can tolerate such violations. Higher performance means better transaction execution
rate and shorter transaction response time (transaction duration).

View serializability and Conflict serializability


Two major types of serializability exist: View serializability, and Conflict serializability. Any
schedule with the latter property also has the first property. However, conflict serializability is
easier to achieve, and is widely utilized.

View serializability of a schedule is defined by equivalence to a serial schedule with the same
transactions, such that respective transactions in the two schedules read and write the same data
values ("view" the same data values).

Conflict serializability is defined by equivalence to a serial schedule with the same transactions,
such that both schedules have the same sets of respective ordered (by time) pairs of conflicting
operations (same precedence relations of respective conflicting operations). Two operations
(read or write) are conflicting if they are of different transactions, upon the same data item, and at
least one of them is write. A more general definition of conflicting operations (also for complex
operations, which may consist each of several "simple" read/write operations) requires that they
are noncommutative (changing their order also changes their combined result). Each such
operation needs to be atomic by itself (by proper system support) in order to be commutative
(nonconflicting) with the other. For example, the operations increment and decrement of a
counter are both write operations, but do not need to be considered conflicting since they are
commutative.

Testing conflict serializability

Schedule compliance with Conflict serializability can be tested as follows: The Conflict graph
(Serializability graph) of the schedule for committed transactions, the directed graph representing
precedence of transactions in the schedule, as reflected by precedence of conflicting operations
in the transactions (transactions are nodes, precedence relations are directed edges), needs to
be acyclic. This means that when a cycle of committed transactions is generated, serializability is
violated. Thus conflict serializability mechanisms prevent cycles of committed transactions by
aborting an undecided (neither committed, nor aborted) transaction (one is sufficient; at least one
is aborted) on each such cycle, when generated, in order to break it. The probability of cycle
generation is typically low, but nevertheless, such a situation is carefully handled, since
correctness is involved. Many mechanisms do not maintain a conflict graph as a data structure,
but rather prevent or break cycles implicitly (e.g., see SS2PL below). Transactions aborted due to
serializability violation prevention are executed again.

Common mechanism - (Strong) Strict Two Phase Locking

(Strong) Strict Two Phase Locking (SS2PL) is a common mechanism (and schedule property)
utilized to enforce in database systems both conflict serializability and Strictness, a special case
of recoverability. The related schedule property is also referred to as Rigorousness. In this
mechanism each data item is locked by a transaction before accessing it (any read or modify
operation): The item is marked by a lock of a certain type, depending on operation (and the
specific implementation; various models with different lock types exist). Access by another
transaction may be blocked, typically upon conflict, depending on lock type and the other
transaction's access operation type. All locked data on behalf of a transaction are released only
after the transaction has ended (either committed or aborted).

Mutual blocking of two transactions or more results in a deadlock, where execution of these
transactions is stalled, and no completion can be reached. A deadlock is a reflection of a potential
cycle in the conflict graph, that would occur without the blocking. Deadlocks are resolved by
aborting a transaction involved with such potential cycle (aborting one transaction per cycle is
sufficient). Transactions aborted due to deadlock resolution are executed again.

Global serializability - Commitment ordering

Enforcing global serializability in a multidatabase system (typically distributed), where


transactions span multiple databases (two or more), is problematic, since even if each database
enforces serializability, the global schedule of all the databases is not necessarily serializable,
and the needed communication between databases to reach conflict serializability using conflict
information is excessive and unfeasible. An effective way to enforce conflict serializability globally
in such a system is to enforce the Commitment ordering (CO, or Commit-order-serializability)
property in each database. CO is a broad special case of conflict serializability, and if enforced
locally in each database, also the global schedule possesses this property (CO). The only needed
communication between the databases for this purpose is the (unmodified) messages of an
atomic commitment protocol (e.g., the two phase commit protocol), already needed by each
distributed transaction to reach atomicity. An effective local (to any single database) CO algorithm
can run beside any local concurrency control mechanism (serializability enforcing mechanism)
without interfering with its resource access scheduling strategy. As such CO provides a general,
fully distributed solution (no central processing component or central data structure are needed)
for guaranteeing global serializability in heterogeneous environments with different database
system types and other multiple transactional objects (objects with states accessed and modified
only by transactions) that may employ different serializability mechanisms.

CO by itself is not sufficient as a concurrency control mechanism, since it lacks the recoverability
property, which should be supported as well.

SS2PL implies Commitment ordering, and any SS2PL compliant database can participate in
multidatabase systems that utilize the CO solution without any modification or addition of a CO
algorithm component.

With the Commitment ordering property the precedence (partial) order of transactions'
commitment events is identical to the precedence (partial) order of the respective transactions as
determined by their schedule's (acyclic) conflict graph. Any conflict serializable schedule can be
made a CO compliant one, without aborting any transaction in the schedule, by delaying
commitment events to comply with the needed partial order. The commitment event of a
distributed transaction is always generated by some atomic commitment protocol (utilized to
reach consensus among the transaction's components on whether to commit or abort it; this
procedure is always carried out for distributed transactions, independently of concurrency control
and CO). The atomic commitment protocol plays a central role in the distributed CO algorithm
which enforces CO globally. In case of incompatible local commitment orders in two or more
databases, which implies a global cycle (a cycle that spans two or more database) in the global
conflict graph, the atomic commitment protocol breaks that cycle by aborting a transaction on the
cycle.

Deadlock
It has been suggested that Circular wait be merged into this article or section. (Discuss)
This article is about deadlock in computing. For other uses of the word "deadlock", see Deadlock
(disambiguation).

A deadlock is a situation wherein two or more competing actions are waiting for the other to
finish, and thus neither ever does. It is often seen in a paradox like 'the chicken or the egg'.

In the computing world deadlock refers to a specific condition when two or more processes are
each waiting for another to release a resource, or more than two processes are waiting for
resources in a circular chain (see Necessary conditions). Deadlock is a common problem in
multiprocessing where many processes share a specific type of mutually exclusive resource
known as a software, or soft, lock. Computers intended for the time-sharing and/or real-time
markets are often equipped with a hardware lock (or hard lock) which guarantees exclusive
access to processes, forcing serialization. Deadlocks are particularly troubling because there is
no general solution to avoid (soft) deadlocks.

This situation may be likened to two people who are drawing diagrams, with only one pencil and
one ruler between them. If one person takes the pencil and the other takes the ruler, a deadlock
occurs when the person with the pencil needs the ruler and the person with the ruler needs the
pencil, before he can give up the ruler. Both requests can't be satisfied, so a deadlock occurs.

Contents

 1 Necessary conditions
 2 Examples of deadlock conditions
 3 Deadlock avoidance
 4 Deadlock prevention
 5 Deadlock detection
 6 Distributed deadlocks
 7 Livelock
 8 See also

 9 External links

Necessary conditions
There are four necessary conditions for a deadlock to occur, known as the Coffman conditions
from their first description in a 1971 article by E. G. Coffman.
1. Mutual exclusion condition: a resource is either assigned to one process or it is available
2. Hold and wait condition: processes already holding resources may request new
resources
3. No preemption condition: only a process holding a resource may release it
4. Circular wait condition: two or more processes form a circular chain where each process
waits for a resource that the next process in the chain holds

Deadlock only occurs in systems where all 4 conditions happen.

Examples of deadlock conditions

An example of a deadlock which may occur in database products is the following. Client
applications using the database may require exclusive access to a table, and in order to gain
exclusive access they ask for a lock. If one client application holds a lock on a table and attempts
to obtain the lock on a second table that is already held by a second client application, this may
lead to deadlock if the second application then attempts to obtain the lock that is held by the first
application. (But this particular type of deadlock is easily prevented, e.g., by using an all-or-none
resource allocation algorithm.)

Another example might be a text formatting program that accepts text sent to it to be processed
and then returns the results, but does so only after receiving "enough" text to work on (e.g. 1KB).
A text editor program is written that sends the formatter with some text and then waits for the
results. In this case a deadlock may occur on the last block of text. Since the formatter may not
have sufficient text for processing, it will suspend itself while waiting for the additional text, which
will never arrive since the text editor has sent it all of the text it has. Meanwhile, the text editor is
itself suspended waiting for the last output from the formatter. This type of deadlock is sometimes
referred to as a deadly embrace (properly used only when only two applications are involved) or
starvation. However, this situation, too, is easily prevented by having the text editor send a
forcing message (eg. EOF) with its last (partial) block of text, which message will force the
formatter to return the last (partial) block after formatting, and not wait for additional text.

Nevertheless, since there is no general solution for deadlock prevention, each type of deadlock
must be anticipated and specially prevented. But general algorithms can be implemented within
the operating system so that if one or more applications becomes blocked, it will usually be
terminated after a time (and, in the meantime, is allowed no other resources and may need to
surrender those it already has, rolled back to a state prior to being obtained by the application).

Deadlock avoidance

Deadlock can be avoided if certain information about processes is available in advance of


resource allocation. For every resource request, the system sees if granting the request will mean
that the system will enter an unsafe state, meaning a state that could result in deadlock. The
system then only grants request that will lead to safe states. In order for the system to be able to
figure out whether the next state will be safe or unsafe, it must know in advance at any time the
number and type of all resources in existence, available, and requested. One known algorithm
that is used for deadlock avoidance is the Banker's algorithm, which requires resource usage limit
to be known in advance. However, for many systems it is impossible to know in advance what
every process will request. This means that deadlock avoidance is often impossible.

Two other algorithms are Wait/Die and Wound/Wait. In both these algorithms there exists an
older process (O) and a younger process (Y). Process age can be determined by a time stamp at
process creation time. Smaller time stamps are older processes, while larger timestamps
represent younger processes.

Wait/Die Wound/Wait
O is waiting for a resource that is being held by Y O waits Y dies
Y is waiting for a resource that is being held by O Y dies Y waits

Deadlock prevention

Deadlocks can be prevented by ensuring that at least one of the following four conditions occur:

 Removing the mutual exclusion condition means that no process may have exclusive
access to a resource. This proves impossible for resources that cannot be spooled, and
even with spooled resources deadlock could still occur. Algorithms that avoid mutual
exclusion are called non-blocking synchronization algorithms.
 The "hold and wait" conditions may be removed by requiring processes to request all the
resources they will need before starting up (or before embarking upon a particular set of
operations); this advance knowledge is frequently difficult to satisfy and, in any case, is
an inefficient use of resources. Another way is to require processes to release all their
resources before requesting all the resources they will need. This too is often impractical.
(Such algorithms, such as serializing tokens, are known as the all-or-none algorithms.)
 A "no preemption" (lockout) condition may also be difficult or impossible to avoid as a
process has to be able to have a resource for a certain amount of time, or the processing
outcome may be inconsistent or thrashing may occur. However, inability to enforce
preemption may interfere with a priority algorithm. (Note: Preemption of a "locked out"
resource generally implies a rollback, and is to be avoided, since it is very costly in
overhead.) Algorithms that allow preemption include lock-free and wait-free algorithms
and optimistic concurrency control.
 The circular wait condition: Algorithms that avoid circular waits include "disable interrupts
during critical sections" , and "use a hierarchy to determine a partial ordering of
resources" (where no obvious hierarchy exists, even the memory address of resources
has been used to determine ordering) and Dijkstra's solution.

Deadlock detection

Often neither deadlock avoidance nor deadlock prevention may be used. Instead deadlock
detection and process restart are used by employing an algorithm that tracks resource allocation
and process states, and rolls back and restarts one or more of the processes in order to remove
the deadlock. Detecting a deadlock that has already occurred is easily possible since the
resources that each process has locked and/or currently requested are known to the resource
scheduler or OS.

Detecting the possibility of a deadlock before it occurs is much more difficult and is, in fact,
generally undecidable, because the halting problem can be rephrased as a deadlock scenario.
However, in specific environments, using specific means of locking resources, deadlock detection
may be decidable. In the general case, it is not possible to distinguish between algorithms that
are merely waiting for a very unlikely set of circumstances to occur and algorithms that will never
finish because of deadlock.

Distributed deadlocks

Distributed deadlocks can occur in distributed systems when distributed transactions or


concurrency control is being used. Distributed deadlocks can be detected either by constructing a
global wait-for graph from local wait-for graphs at a deadlock detector or by a distributed
algorithm like edge chasing.

Phantom deadlocks are deadlocks that are detected in a distributed system but don't actually
exist - they have either been already resolved or no longer exist due to transactions aborting.

Livelock

A livelock is similar to a deadlock, except that the state of the processes involved in the livelock
constantly changes with regards to each other, none progressing. [1] Livelock is a special case of
resource starvation; the general definition only states that a specific process is not progressing.
[2]

As a real-world example, livelock occurs when two people meet in a narrow corridor, and each
tries to be polite by moving aside to let the other pass, but they end up swaying from side to side
without making any progress because they always both move the same way at the same time.

Livelock is a risk with some algorithms that detect and recover from deadlock. If more than one
process takes action, the deadlock detection algorithm can repeatedly trigger. This can be
avoided by ensuring that only one process (chosen randomly or by priority) takes action. [3]

See also

 Banker's algorithm
 Computer bought the farm
 Deadlock provision
 Dining philosophers problem
 Gridlock (in vehicular traffic)
 Hang
 Infinite loop
 Mamihlapinatapai
 Race condition
 Sleeping barber problem
 Stalemate
 Synchronization
 the SPIN model checker can be used to formally verify that a system will never enter a
deadlock.
Atomicity
(Redirected from Atomic transaction)
See also Atomicity (disambiguation).

In database systems, atomicity is one of the ACID transaction properties. An atomic


transaction is series of database operations which either all occur, or all do not occur ("fail",
although failure is not considered catastrophic). A guarantee of atomicity prevents updates to the
database occurring only partially, which can cause greater problems than rejecting the whole
series outright.

One example of atomicity is in ordering airline tickets. Tickets must either be paid for and
reserved on a flight, or neither paid for nor reserved. It is not acceptable for customers to pay for
tickets without securing their requested flight or to reserve tickets without payment succeeding.
One atomic transaction might include the booking not only of flights, but of hotels and transport, in
exchange for the right money at the exact current exchange rate.

[edit] Orthogonality

Atomicity is not completely orthogonal to the other ACID properties of transactions. For example,
isolation relies on atomicity to roll back changes in the event of isolation failures such as
deadlock; consistency also relies on rollback in the event of a consistency violation by an illegal
transaction. Finally, atomicity itself relies on durability to ensure transactions are atomic even in
the face of external failures.

As a result of this, failure to detect errors and manually roll back the enclosing transaction may
cause isolation and consistency failures.

[edit] Implementation

Typically, atomicity is implemented by providing some mechanism to indicate which transactions


have been started and finished, or by keeping a copy of the data before any changes were made.
Several filesystems have developed methods for avoiding the need to keep multiple copies of
data, using journaling (see journaling file system). Many databases also support a commit-
rollback mechanism aiding in the implementation of atomic transactions. These are usually also
implemented using some form of logging/journaling to be able to track changes. These logs (often
the metadata) are synchronized as necessary once the actual changes were successfully made.
Unrecorded entries are simply ignored afterwards on crash recovery. Although implementations
vary depending on factors such as concurrency issues, the principle of atomicity - ie. complete
success or complete failure - remain.

Ultimately, any application-level implementation relies on operating system functionality which in


turn makes use of specialized hardware to guarantee that an operation is non-interruptible by
either software attempting to re-divert system resources (see pre-emptive multitasking) or
resource unavailability (e.g. power outages). For example, POSIX-compliant systems provide the
open(2) system call which allows applications to atomically open a file. Other popular system
calls that may assist to achieve atomic operations from userspace consist of mkdir(2), flock(2),
fcntl(2), rasctl(2) (NetBSD restartable sequences), semop(2), sem_wait(2), sem_post(2),
fdatasync(2), fsync(2) and rename(2).

At the hardware level, atomic operations such as test-and-set (TAS), and/or atomic
increment/decrement operations are needed. When these are lacking, or when necessary, raising
the interrupt level to disable all possible interrupts (of hardware and software origin) may be used
to implement the atomic synchronization function primitives. These low-level operations are often
implemented in machine language or assembly language.

The etymology of the phrase originates in the Classical Greek concept of a fundamental and
indivisible component; see atom.

Atomic operation
It has been suggested that this article or section be merged into Linearizability. (Discuss)
See also Atomicity (disambiguation).

An atomic operation in computer science refers to a set of operations that can be combined so
that they appear to the rest of the system to be a single operation with only two possible
outcomes: success or failure.

Contents

• 1 Conditions
• 2 Example
o 2.1 One process
o 2.2 Two processes
• 3 Locking

• 4 See also

Conditions

To accomplish this, two conditions must be met:

1. Until the entire set of operations completes, no other process can know about the
changes being made; and
2. If any of the operations fail then the entire set of operations fails, and the state of the
system is restored to the state it was in before any of the operations began.

To the rest of the system, it appears that the set of operations either succeeds or fails all at once.
No in-between state is accessible. This is an atomic operation.
Even without the complications of multiple processing units, this can be non-trivial to implement.
As long as there is the possibility of a change in the flow of control, without atomicity there is the
possibility that the system can enter an invalid state (invalid as defined by the program, a so-
called invariant).

Example

One process

For example, imagine a single process is running on a computer incrementing a memory location.
To increment that memory location:

1. the process reads the value in the memory location;


2. the process adds one to the value;
3. the process writes the new value back into the memory location.

Two processes

Now, imagine two processes are running incrementing a single, shared memory location:

1. the first process reads the value in memory location;


2. the first process adds one to the value;

but before it can write the new value back to the memory location it is suspended, and the second
process is allowed to run:

1. the second process reads the value in memory location, the same value that the first
process read;
2. the second process adds one to the value;
3. the second process writes the new value into the memory location.

The second process is suspended and the first process allowed to run again:

1. the first process writes a now-wrong value into the memory location, unaware that the
other process has already updated the value in the memory location.

This is a trivial example. In a real system, the operations can be more complex and the errors
introduced extremely subtle. For example, reading a 64-bit value from memory may actually be
implemented as two sequential reads of two 32-bit memory locations. If a process has only read
the first 32-bits, and before it reads the second 32-bits the value in memory gets changed, it will
have neither the original value nor the new value but a mixed-up garbage value.

Furthermore, the specific order in which the processes run can change the results, making such
an error difficult to detect and debug.

Locking
A clever programmer might suggest that a lock should be placed around this "critical section".
However, without hardware support in the processor, a lock is nothing more than a memory
location which must be read, inspected, and written. Algorithms, such as spin locking, have been
devised that implement software-only locking, but these can be inefficient.

Most modern processors have some facility which can be used to implement locking, such as an
atomic test-and-set or compare-and-swap operation, or the ability to temporarily turn off interrupts
ensuring that the currently running process cannot be suspended.

See also

Race condition
This article may require cleanup to meet Wikipedia's quality
standards.
Please discuss this issue on the talk page or replace this tag with a more specific
message.
This article has been tagged since April 2006.

A race condition or race hazard is a flaw in a system or process whereby the output of the
process is unexpectedly and critically dependent on the sequence or timing of other events. The
term originates with the idea of two signals racing each other to influence the output first.

Race conditions can occur in poorly-designed electronics systems, especially logic circuits, but
they can and often do also arise in computer software.

Contents

 1 Electronics
 1.1 Types
 2 Computing
 2.1 Real life examples
 2.1.1 File systems
 2.1.2 Networking
 2.1.3 Life-critical systems
 2.2 Computer security
 2.3 Asynchronous finite state machines
 3 See also

 4 External links

Electronics

A typical example of a race condition may occur in a system of logic gates, where inputs vary. If a
particular output depends on the state of the inputs, it may only be defined for steady-state
signals. As the inputs change state, a finite delay will occur before the output changes, due to the
physical nature of the electronic system. For a brief period, the output may change to an
unwanted state before settling back to the designed state. Certain systems can tolerate such
glitches, but if for example this output signal functions as a clock for further systems that contain
memory, the system can rapidly depart from its designed behaviour (in effect, the temporary
glitch becomes permanent).

For example, consider a two input AND gate fed with a logic signal X on input A and its negation,
NOT X, on input B. In theory, the output (X AND NOT X) should never be high. However, if
changes in the value of X take longer to propagate to input B than to input A then when X
changes from false to true, a brief period will ensue during which both inputs are true, and so the
gate's output will also be true.

Proper design techniques (e.g. Karnaugh maps—note, the Karnaugh map article includes a
concrete example of a race condition and how to eliminate it) encourage designers to recognise
and eliminate race conditions before they cause problems.

As well as these problems, logic gates can enter metastable states, which create further
problems for circuit designers.

See critical race and non-critical race for more information on specific types of race conditions.

Types

Static race conditions


These are caused when a signal and its complement are combined together.
Dynamic race conditions
These result in multiple transitions when only one is intended. They are due to interaction
between gates (Dynamic race conditions can be eliminated by using not more than two
levels of gating).
Essential race conditions
These are caused when an input has two transitions in less than the total feedback
propagation time. Sometimes they are cured using inductive delay-line elements to
effectively increase the time duration of an input signal.

Computing

Race conditions may arise in software, especially when communicating between separate
processes or threads of execution.

Here is a simple example:

Let us assume that two threads T1 and T2 each want to increment the value of a global integer
by one. Ideally, the following sequence of operations would take place:

1. Integer i = 0;
2. T1 reads the value of i from memory into a register : 0
3. T1 increments the value of i in the register: (register contents) + 1 = 1
4. T1 stores the value of the register in memory : 1
5. T2 reads the value of i from memory into a register : 1
6. T2 increments the value of i in the register: (register contents) + 1 = 2
7. T2 stores the value of the register in memory : 2
8. Integer i = 2

In the case shown above, the final value of i is 2, as expected. However, if the two threads run
simultaneously without locking or synchronization, the outcome of the operation could be wrong.
The alternative sequence of operations below demonstrates this scenario:

1. Integer i = 0;
2. T1 reads the value of i from memory into a register : 0
3. T2 reads the value of i from memory into a register : 0
4. T1 increments the value of i in the register: (register contents) + 1 = 1
5. T2 increments the value of i in the register: (register contents) + 1 = 1
6. T1 stores the value of the register in memory : 1
7. T2 stores the value of the register in memory : 1
8. Integer i = 1

The final value of i is 1 instead of the expected result of 2. This occurs because the increment
operations of the second case are non-atomic. Atomic operations are those that cannot be
interrupted while accessing some resource, such as a memory location. In the first case, T1 was
not interrupted while accessing the variable i, so its operation was atomic.

For another example, consider the following two tasks, in pseudocode:

global integer A = 0;

// increments the value of A and print "RX"


// activated whenever an interrupt is received from the serial controller
task Received()
{
A = A + 1;
print "RX";
}

// prints out only the even numbers


// is activated every second
task Timeout()
{
if (A is divisible by 2)
{
print A;
}
}

Output would look something like:

0
0
0
RX
RX
2
RX
RX
4
4

Now consider this chain of events, which might occur next:

1. timeout occurs, activating task Timeout


2. task Timeout evaluates A and finds it is divisible by 2, so elects to execute the "print A"
next.
3. data is received on the serial port, causing an interrupt and a switch to task Received
4. task Received runs to completion, incrementing A and printing "RX"
5. control returns to task Timeout
6. task timeout executes print A, using the current value of A, which is 5.

Mutexes are used to address this problem in concurrent programming.

Real life examples

File systems

In file systems, two or more programs may "collide" in their attempts to modify or access a file,
which could result in data corruption. File locking provides a commonly-used solution. A more
cumbersome remedy involves reorganizing the system in such a way that one unique process
(running a daemon or the like) has exclusive access to the file, and all other processes that need
to access the data in that file do so only via interprocess communication with that one process
(which of course requires synchronization at the process level).

A different form of race hazard exists in file systems where unrelated programs may affect each
other by suddenly using up available resources such as disk space (or memory, or processor
cycles). Software not carefully designed to anticipate and handle this rare situation may then
become quite fragile and unpredictable. Such a risk may be overlooked for a long time in a
system that seems very reliable. But eventually enough data may accumulate or enough other
software may be added to critically destabilize many parts of a system. Probably the best known
example of this occurred with the near-loss of the Mars Rover "Spirit" not long after landing, but
this is a commonly overlooked hazard in many computer systems. A solution is for software to
request and reserve all the resources it will need before beginning a task; if this request fails then
the task is postponed, avoiding the many points where failure could have occurred. (Alternately,
each of those points can be equipped with error handling, or the success of the entire task can be
verified before proceeding afterwards.) A more common but incorrect approach is to simply verify
that enough disk space (for example) is available before starting a task; this is not adequate
because in complex systems the actions of other running programs can be unpredictable.

Networking

In networking, consider a distributed chat network like IRC, where a user acquires channel-
operator privileges in any channel he starts. If two users on different servers, on different ends of
the same network, try to start the same-named channel at the same time, each user's respective
server will grant channel-operator privileges to each user, since neither server will yet have
received the other server's signal that it has allocated that channel. (Note that this problem has
been largely solved by various IRC server implementations.)
In this case of a race condition, the concept of the "shared resource" covers the state of the
network (what channels exist, as well as what users started them and therefore have what
privileges), which each server can freely change as long as it signals the other servers on the
network about the changes so that they can update their conception of the state of the network.
However, the latency across the network makes possible the kind of race condition described. In
this case, heading off race conditions by imposing a form of control over access to the shared
resource—say, appointing one server to control who holds what privileges—would mean turning
the distributed network into a centralized one (at least for that one part of the network operation).
Where users find such a solution unacceptable, a pragmatic solution can have the system 1)
recognize when a race condition has occurred; and 2) repair the ill effects.

Life-critical systems

Software flaws in Life-critical systems can be disastrous. Race conditions were among the flaws
in the Therac-25 radiation therapy machine, which led to the death of five patients and injuries to
several more. Another example is the Energy Management System provided by GE Energy and
used by Ohio-based FirstEnergy Corp. (and by many other power facilities as well). A race
condition existed in the alarm subsystem; when three sagging power lines were tripped
simultaneously, the condition prevented alerts from being raised to the monitoring technicians,
delaying their awareness of the problem. This software flaw eventually led to the North American
Blackout of 2003. (GE Energy later developed a software patch to correct the previously
undiscovered error.)

Computer security

A specific kind of race condition involves checking for a predicate (e.g. for authentication), then
acting on the predicate, while the state can change between the time of check and the time of
use. When this kind of bug exists in security-conscious code, a security vulnerability called a
time-of-check-to-time-of-use (TOCTTOU) bug is created.

Asynchronous finite state machines

Even after ensuring that single bit transitions occur between states, the asynchronous machine
will fail if multiple inputs change at the same time. The solution to this is to design a machine so
that each state is sensitive to only one input change.

S-ar putea să vă placă și