Sunteți pe pagina 1din 165

1

DISTRIBUTED DATABASES

Authored by: Sathyanarayana.S.V Assistant Professor Department of Information Science S&I'()A * +,,-./ Emai%: sathya0s12rediffmai%.com En!ineerin! ".#.#.$o%%e!e of En!ineerin!

$(#TE#TS S%.#o 4. -. 3. /. +. :. ,. 9. < Units U#IT 5 4 U#IT 5 U#IT 5 3 U#IT 5 / U#IT 5 + U#IT 5 : U#IT 5 , U#IT 5 9 Topics Preface Introduction to $omputer #et6or7s Distributed $omputin! System * An o1er1ie6 Distributed Databases * An o1er1ie6 8e1e%s of Distribution Transparency Distributed Database Desi!n (1er1ie6 of ;uery Processin! Transaction 'ana!ement and $oncurrency $ontro% Time and Synchroni=ation References Pa!e #o 3 / -+ +, ,. 9. <4.: 4/. 4:3

PRE>A$E
Nineteen seventies saw the usage of computers extensively for building powerful integrated database systems and it experienced a large number of applications. The decade also witnessed the excellent price performance ratio offered by microprocessor-based workstations over mainframe systems. Eighties saw the advent of computer networks extensively allowing the

3 connection of different computers, exchange of data and resource sharing. This lead the distributed database, which is an integrated database, built on top of a computer network rather than on a single computer. The data that forms the database are stored at different sites of the computer network. Lot of research work has been done to solve the problems faced in building and implementing the distributed database. nowledge of traditional databases and computer network is necessary to integrate them to a new discipline ! "istributed #omputing $ystem%. This has lead to the concepts and design issues of "istributed &perating $ystems, which are now commercially available. 'e begin the discussion with an introduction to computer networking in (N)T-* followed by a discussion on "istributed #omputing $ystems in (N)T-+ where we present several models. The differences in distributed and centrali,ed databases have been discussed in (N)T--. (N)T-. and (N)T-/ describe transparency and design issues of distributed databases. The 0uery processing has been explained with the help of relational algebra and relational calculus in (N)T-1. 2inally, (N)T-3 and (N)T-4 explain relevant issues in distributed data processing like transaction management and time 5 synchroni,ation. The author depended heavily on the books written by $tefano #uri, 6iuseppe 7elagatti, 7radeep. .$inha and 6eorge #oulouris in preparing this study material and he is indebted to the above authors. ) wish to thank $.N.8agadeesha and $uma.$.$ who have carefully reviewed the manuscript and suggested many important improvements. The author would like to express his heartfelt thanks to the $&( for entrusting him this work. Last but not the least the author warmly thanks the technical staff of #$ 5 E department and all his friends for their help in preparing this course material. - $athyanarayana.$.9

U#IT 5 4 I#TR(DU$TI(# T( $('PUTER #ET?(R@S Structure 4.. (bAecti1es

4.4 4.4.3

Introduction #et6or7 types 8A# Techno%o!ies 4.3.4 4.3.8A# Topo%o!ies 'edium Access Protoco%s 4.3.-.4 $S'AB$D Protoco% 4.3.-.- To7en rin! Protoco%

4./

?A# Techno%o!ies 4./.4 S6itchin! TechniCues 4./.4.4 $ircuit S6itchin! 4./.4.- Pac7et S6itchin! 4./.Routin! TechniCues 4./.-.4 Static Routin! 4./.-.- Dynamic Routin!

4.+

$ommunication Protoco%s 4.+.4 The (SI Reference 'ode%

4.: Summary 4.. (bAecti1es: Computer networks provide the necessary means for communication between computing elements of the system. At the end of this unit, you will come to know the basic concepts of computer networking. We outline the characteristics of local and wide area networks. We summarize the principles of protocols and protocol layering. In total, this unit will help you to understand the advanced concept of task execution called Distributed computing. 4.4 Introduction: A computer network is a communication system that links end systems by communication lines and software protocols to exchange data between two processes running on different end systems of the network. The end systems are often referred to as nodes, sites, hosts, computers, machines, and so on. The nodes may ary in si!e and function. "i!e#wise, a node may be a small microprocessor, a workstation, a

minicomputer, or a large supercomputer. %unction#wise, a node may be dedicated system &such as a print ser er or a file ser er' without any capability for interacti e users, a single personal computer or a general#purpose time#sharing system. A distributed computing system is basically a computer network whose nodes ha e their own local memory and also other hardware and software resources. A distributed system, therefore, relies entirely on the underlying computer network for the communication of data and control information between the nodes of which they are composed. The performance and reliability of a distributed system depend to a great extent on the performance and reliability of the underlying computer network. (ence, a basic knowledge of computer networks is re)uired for the study of distributed operating systems. Therefore, this unit deals with important aspects of networking concepts and designs emphasi!ing the aspects needed for designing distributed operating systems. 4.- #et6or7s Types: *etworks are broadly classified into two types+ local area networks !"#s$ and wide% area networks W"#s$. The W"#s are also referred to as long%haul network. The key characteristics that are often used to differentiate between these two types of networks are as follows+ 1. &eographic distribution' The main difference between the two of networks is the way in which they are geographically distributed. A ,A* is restricted to a limited geographic co erage of a few kilometers. Therefore, ,A*s typically pro ide communication facilities within a building or a campus, whereas -A*s may operate nationwide or e en worldwide. 2. Data rate' .ata transmission rates are usually much higher in ,A*s than in -A*s. Transmission rates in ,A*s usually range from /.2 megabit per second &0bps' to 1 gigabit per second &1bps', whereas, transmission rates in -A*s usually range from 12// bits per second to slightly o er 1 0bps. 3. (rror rate' ,ocal area networks generally experience fewer data transmission errors than -A*s do. Typically, bit error rates are in the range 1/ #2 3 1/#12 with ,A*s as opposed to 1/#$ 3 1/#4 with -A*s.

4. Communication link' The most common communication links used in ,A*s are twisted pair, coaxial cable, and fiber optics. 6n the other hand, -A*s are physically distributed o er a large geographic area, and the communication links used are by default relati ely slow and unreliable. Telephone lines, microwa e links, and satellite channels are used as links in -A*s. $. )wnership' A ,A* is typically owned by a single organi!ation because of its limited geographic co erage. 7nterconnecting multiple ,A*s each of which may belong to a different organi!ation, howe er usually forms a -A*. Therefore, administrati e and maintenance complexities and costs for ,A*s are usually much lower than for -A*s. 5. Communication cost' The o erall communication costs of a ,A* are much lower than that of a -A*. The main reasons for this are lower error rates, simple &or absence of' routing algorithms, and lower administrati e and maintenance costs. 0oreo er, the cost to transmit data in a ,A* is negligible since the transmission medium is usually owned by the user organi!ation. (owe er, with a -A*, this cost may be ery high because the transmission media used are leased lines or public communication systems, such as telephone lines, microwa e links, and satellite channels. *etworks that share some of the characteristics of both ,A*s and -A*s are referred to as *etropolitan "rea #etworks *"#s$. The 0A*s usually co er a wider geographic area &up to about $/ km in diameter' than ,A*s and fre)uently operate at speeds ery close to ,A* speeds. A main ob8ecti e of 0A*s is to interconnect ,A*s located in an entire city or metropolitan area. 9ommunication links commonly used for 0A*s are coaxial cable and microwa e links.

4.3 8A# TE$&#(8()IES: This section presents a description of topologies and principles of operation of ,A*s. 4.3.4 8A# Topo%o!ies: The two commonly used network topologies for constructing ,A*s are multi-access bus and ring.

7n a simple multi%access bus network, all sites are directly connected to a single transmission medium &called the bus' that spans the whole length of the network &%ig 1.1'. The bus is passi e and is shared by all the sites for any message transmission in the network. :ach site is connected to the bus by a drop cable using a T#connection or tap. ;roadcast communication is used for message transmission.
Sites

Shared bus

>i! 4.4 "imple multi#access bus network topology That is, a message is transmitted from one site to another by placing it on the shared bus. An address designator is associated with the message. As the message tra els on the bus, each site checks whether it addresses to it and the addressed site picks up the message. A ariant of the simple multi#access bus network topology is the multi#access branching bus network topology. 7n such a network, two or more simple Sites multi# access bus networks are interconnected using repeaters &%ig 1.2'. +epeaters are the hardware de ices used to connect cable segments. They amplify and copy electric signals from one segment of a network to its next segment.

Repeater

Shared bus

>i! 4.- 0ulti#access branching bus network topology 7n a ring network, each site is connected to exactly two other sites so that a loop is formed &%ig 1.3'. A separate link is used to connect two sites. The links are interconnected using repeaters. .ata is transmitted in one direction around the ring by signaling between sites. That is, to send a message from one site to another, the source site writes the destination site<s address in the message header to see if the message is addressed to it. 7f not the site passes on the message to its own neighbor. 7n this manner, the message circulates around the ring until some site &to which the message is addressed' remo es the message from the ring= otherwise, it is remo ed by the Repeater source site &which sent the message'. 7n the latter case, the message always circulates for one complete round on the ring. 1enerally, in ring networks, one of the sites acts as a monitor site to ensure that a Repeater Repeater message does not circulate indefinitely &that is, in case the source site or destination site fails'. The monitor site also perform other 8obs, such as housekeeping functions, ring utili!ation, and handling other error conditions Repeater Repeater 'onitor Site Repeater Sites

>i! 4.3 >ing network topology

4.3.- 'edium5Access $ontro% Protoco%s: 7n case of both multi#access bus and ring networks, we say that all the sites of a network share a single channel, resulting in a multi#access en ironment. 7n such an en ironment, it is possible that se eral sites try to transmit information o er the shared channel simultaneously. 7n this case, the transmitted information may become scramble and must be discarded. The concerned sites must be notified about the discarded information, so that they can retransmit their information. 7f no special pro isions are made, this situation may be repeated, resulting in a multi#access en ironment to control the access to a shared channel. These schemes are known as medium%access control protocols. 9learly, in a multi#access en ironment, the use of a medium ha ing high raw data rate alone is not sufficient. The medium#access control protocol used must also pro ide for efficient bandwidth use of the medium. Therefore, the medium#access control protocol has a significant effect on the o erall performance of a computer network, and often it is by such protocols that the networks differ the computer network, and often it is by such protocols that the networks differ the most. The three important performance

1/

ob8ecti es of a medium#access control protocol are high throughput, high channel utilization, and low message delay. 7n addition to meeting the performance ob8ecti es, other desirable characteristics of a medium#access protocol are+ %or fairness, unless a priority scheme is intentionally implemented, the protocol should pro ide e)ual opportunity to all sites in allowing them to transmit their information o er the shared medium. %or better scalability, sites should re)uire a minimum knowledge of the network structure &topology, si!e, or relati e location of other sites', and addition, remo al, or mo ement of a site from one place to another in the network should be possible without the need to change the protocol. %urthermore, it should not be necessary to ha e knowledge of the exact alue of the end#to#end propagation delay of the network for the protocol to function correctly. %or higher reliability, centrali!ed control should be a oided and the operation of the protocol should be completely distributed. %or supporting real time applications, the protocol should exhibit bounded delay properties. That is, the maximum message transfer delay from one site to another in the network must be known and fixed. "e eral protocols ha e been de eloped for medium#access control in a multi# access en ironment. 6f these, the 9arrier "ense 0ultiple Access with 9ollision .etection &9"0A@9.' protocol is popular and is used for ring networks. "ome of these protocols are described in the following section.

11

4.3.-.4 $S'AB$D Protoco%: The 9"0A@9. scheme employs decentrali!ed control of the shared medium. 7n this scheme, each site has e)ual status, in the sense there is no central controller site. The sites contend with each other for use of the shared medium and the site that first gains access during an idle period of the medium uses the medium for the transmission of its own message. 6b iously, occasional collisions of messages may occur when more than one site senses the medium to be idle and transmits messages simultaneousely. The scheme uses collision detection, reco ery, and controlled transmission mechanisms to deal with this problem. Therefore, the scheme is comprised of the following three mechanisms+ Carrier sense and defer mechanism. -hene er a site wishes to transmit a packet, it first listens for the presence of a signal &known as a carrier by analogy with radio broadcasting' on the shared medium. 7f the medium is found to be free &no carrier is present on the medium', the site starts transmitting its packet. 6therwise, the site defers its packet transmission and waits &continues to listen' until the medium becomes free. The site initiali!es its packet transmission as soon as it senses the medium free. Collision detection mechanism. Anfortunately, carrier sensing does not pre ent all collisions because of the non!ero propagation delay of the shared medium. 6b iously, collisions occur only within a short time inter al following the start of transmission, since after this inter al all sites will detect that the medium is not free and defer transmission. This time inter al is called the collision window or collision interval and is e)ual to the amount of time re)uired for a signal to propagate from one end of the shared medium to the other and back again. 7f a site attempts to transmit a packet, it must listen to the shared medium for a time period that is at least e)ual to the collision inter al in order to guarantee that the packet will not experience a collision. 9ollision a oidance by listening to the shared medium for at least the collision inter al time before initiation of packet transmission leads to inefficient utili!ation of the medium when collisions are rare. Therefore, instead of trying to a oid collisions, the 9"0A@9. scheme allows collisions to occur, detects them, and then takes necessary reco ery actions.

12

Controlled retransmission mechanism. After a collision, the packets that become corrupted due to the collision must be retransmitted. 7f all the transmitting stations whose packets were corrupted by the collision attempt to retransmit their packets immediately after the 8amming signal, collision will probably occur again. To minimi!e repeated collisions and to achie e channel stability under o erload conditions, a controlled retransmission strategy is used in which the competition for the shared medium is resol ed using a suitable algorithm.

4.3.-.- The To7en Rin! Protoco%: This scheme also employs decentrali!ed control of the shared medium. 7n this scheme a single token is circulated among the sites in the system, the site in possession of the token will ha e access to the shared medium. A token is a special type of message &ha ing a uni)ue bit pattern' that entitles its holder to use the shared medium for transmitting its messages. A special field in the token indicates whether it is free or busy. The token is passed from one site to the ad8acent site around the ring in one direction. A site that has a message ready for transmission must wait until the token reaches it and if it is free, it transmits the token to the next site in the ring. A recei ing site checks the status of the token. 7f it is free, it uses it to transmit its own message. 6therwise, it passes the token to the next site. -hen it recei es the free token, it will set it to busy, attaches its message to the token, and transmit its own message. 6therwise, it checks to see if the message attached to the busy token is addressed to it. 7f it is, it retrie es the message attached to the token and forwards the token without the attached message to the next site in the ring. -hen the busy token returns to the sending site after one passes it to the next site, allowing the next site to transmit its message &if it has any'. The free token circulates from one site to another until it reaches a site that has some message to transmit. To pre ent a site from holding the token for a ery long time, a token#holding timer is used to control the length of time for which a site may occupy the token.

13

To guarantee reliable operation, the token has to be protected against loss or duplication. That is, if the token gets lost due to a site failure, the system must detect the loss and generate a new token. The monitor site usually does this. 0oreo er, if a site i crashes, the ring must be reconfigured so that site i#1 will send the token directly to site iB1. An ad antage of the token protocol is that the message delay can be bounded because of the absence of collisions. Another ad antage is that it can work with both large and small packet si!e as well as ariable#si!e packets. 7n principle, a message attached to the token may be of almost any length. A ma8or disad antage howe er is the initial waiting time to recei e a free token e en at ery light loads. This initial waiting time could be appreciable, especially in large rings. 4./ ?A# Techno%o!ies: A -A* of computers is considered by interconnecting computers that are separated by large distances= they may be located in different cities or e en in different countries. 7n general, no fixed regular network topology is used for interconnecting the computers of a -A*. 0oreo er, different communication media may be used for different links of a -A*. %or example, in a -A*, computers located in the same country may be interconnected by coaxial cables &telephone lines', but communications satellite may be used to interconnect two computers that are located in different countries. The computers of a -A* are not connected directly to the communication channels but are connected to hardware de ices called packet%switching exchanges ,-(s$, which are special#purpose computers dedicated to the task of data communication. Therefore, the communication channels of the network interconnect the C":s, which actually perform the task of data communication across the network &%ig. 1.4'. To send a message packet to another computer on the network, a computer sends the packet to the C": to which it is connected. The packet is transmitted from the sending computer<s C": to the recei ing computer<s C":, possibly ia another C":s. The actual mode of packet transmission and the route used for forwarding a packet from its sending computer<s C": to its recei ing computer<s C": depend on the switching and routing techni)ues used by the C":s of the network. Darious possible options for these

14

techni)ues are described next. -hen the packet reaches its recei ing computer<s C":, it is deli ered to the recei ing computer.
9omputer s ; :

C":

1 C":

2 3 9 C": 4 . % C":

>i! 4./ The -A* using Cacket "witching :xchanges 4./.4 S6itchin! TechniCues: -e saw that in a -A* communication is achie ed by transmitting a packet from its source computer to its destination computer through two or more C":s. The C":s pro ide switching facility to mo e a packet from one C": to another until the packet reaches its destination. That is, a C": remo es a packet from an input channel and places it on an output channel. *etwork latency is highly dependent on the switching techni)ue used by the C":s of the -A*. The two most commonly used schemes are circuit switching and packet switching. They are described next. 4./.4.4 $ircuit S6itchin!: This scheme is similar to that used in the public telephone system. 7n this system when a telephone call is made, a dedicated circuit is established by the telephone switching office from the caller<s telephone to the callee<s telephone. 6nce this circuit is established, the only delay in ol ed in the communication is the time re)uired for the propagation of the electromagnetic signal through all the wires and switches. -hile it might be hard to obtain a circuit sometimes &such as during busy hour',

1$

once the circuit is established, exclusi e access to it is guaranteed until the calls terminated. 7n this method, before data transmission starts, a physical circuit is constructed between the sender and recei er computers during the circuit establishment phase. .uring this phase, the channels constituting the circuit are reser ed exclusi ely for the circuit= hence there is no need for buffers at the intermediate C":s. 6nce the circuit is established, all packets of the data are transferred one after another through the dedicated circuit without being buffered at intermediate sites, the packets appear to form a continuous data stream. %inally, in the circuit termination phase, the circuit is torn down as the last packet of the data transmitted. As soon as the circuit is torn down, the channels that were reser ed for the circuit become a ailable for use by others. 7f a circuit cannot be established because a desired channel is busy &being used', the circuit is said to be blocked. .epending on the way blocked circuits are handled, the partial circuit may be torn down, with establishment to be attempted later. The main advantage of a circuit-switching technique: 6nce the circuit is established, data is transmitted with no delay other than the propagation delay, which is negligible. "ince the full capacity of the circuit is a ailable for exclusi ely use by the connected pair of computers, the transmission time re)uired to send a message can be known and guaranteed after the circuit has been successfully established. (owe er, the method re)uires additional o erhead during circuit establishment and circuit disconnection phases, and channel bandwidth may be wasted if the connected pair of computers does not utili!e the channel capacities of the path forming the circuit efficiently. Therefore, the method is considered suitable only for long continuous transmissions or for transmissions that re)uire guaranteed maximum transmission delay. 7t is preferred method for transmission of &C"T*'. 4./.4.- Pac7et S6itchin!: 7n this method, instead of establishing a dedicated path between a sender and recei er pair &of computers', the channels are shared for transmitting packets of different sender#recei er pairs. That is, a channel is occupied by a oice and real#time data in distributed applications. 9ircuit switching is used in the Cublic "witched Telephone *etwork

15

sender#recei er pair only while transmitting a single packet of the message of that pair= the channel may then be used for transmitting either another packet of the sender#recei er pair or a packet of the some other sender#recei er pair. 7n this method, each packet of a packet of the message contains the address of the destination computer, so that it can be sent to its destination independently of all other packets. *otice that different packets of the same message may take a different path through the network and, and at the destination computer, the recei er may get the packets in an order different from the order in which they were sent. Therefore, at the destination computer, the packets ha e to be properly reassembled into a message. -hen a packet reaches a C": &>efer %ig.1.4', the packet is temporarily stored there in a packet buffer. The packet is then forwarded to a selected neighboring C": when the next channel becomes a ailable and the neighboring C": has an a ailable packet buffer. (ence the actual path taken by a packet to its destination is dynamic because the path is established as the packet tra els along. Cacket#switching techni)ue is also known as store%and%forward communication because e ery packet is temporarily stored by each C": along its routing before it is forwarded to another C":. As compared to circuit switching, packet switching is: "uitable for transmitting small amounts of data that are bursty in nature. The method allows efficient usage of channels because the communication bandwidth of a channel is shared for transmitting se eral messages. %urthermore, the dynamic selection of the actual path to be taken by a packet gi es the network considerable reliability because failed C":s or channel can be ignored and alternate paths may be used. %or example, in the -A* of %igure 1.4, if channel 2 fails, using the path 1#3 the message can still be sent from computer A to .. 6n the other hand, due to the need to buffer each packet at e ery C": and to reassemble the packets at the destination computer, the o erhead incurred is large. Therefore, the method is inefficient for transmitting large messages. Another drawback of the method is that there is no guarantee of how long it takes a message to go from its source computer to its destination computer because the

14

time taken for each packet depends on the route chosen for the packet, along with the olume of data being transferred along this route. Cacket switching is used in the E.2$ public packet network and the 7nternet. 4./.- Routin! TechniCues: 7n a -A*, when multiple paths exist between the source and destination computers of a packet, any one of the paths may be used to transfer the packet. %or example, in the -A* of %igure 1.4, there are two paths between computers : and %+ 3#4 and 1#2#4#and any one of the two may be used to transmit a packet from computer : to %. The selection of the actual path to be used for transmitting a packet is determined by the routing techni)ue used. An efficient routing techni)ue is crucial to the o erall performance of the network. This re)uires that the routing decision process must be as fast as possible to reduce the network. This re)uires that the routing decision process should be easily implementable in hardware. %urthermore, the decision process usually should not re)uire global state information of the network because such information gathering is a difficult task and creates additional traffic in the network. >outing algorithms are usually classified based on the following three attributes+ ,lace where routing decisions are made .ime constant of the information upon which the routing decisions are based Control mechanism used for dynamic routing

*ote that routing techni)ues are not needed in ,A*s because the sender of a message simply puts the message on the communication channel and the recei er takes it off from the channel. There is no need to decide the path to be used for transmitting the message from the sender to the recei er. 6ut of the three attributes let us only consider the second one as it ery important as far as our re)uirement is concerned. According to this, the routing algorithms are classified as follows+ -tatic routing. 7n this method, routing tables &stored on C":s' are set once and do not change for ery long periods of time. They are changed only when the network undergoes ma8or modifications. "tatic routing is also known as fixed or deterministic routing. "tatic routing is simple and easy to implement. (owe er, it makes poor use of network bandwidth and causes blocking of a

12

packet e en when alternati e paths are a ailable for its transaction. (ence, static routing schemes A>: susceptible to component failures. Dynamic routing. 7n this method routing tables are updated relati ely fre)uently, reflecting shorter#term changes in the network en ironment. .ynamic routing strategy is also known as adaptive routing because it has a tendency to adapt to the dynamically changing state of the network, such as the presence of faulty or congested channels. .ynamic routing schemes can use alternati e paths for packet transmissions, making more efficient use of network bandwidth and pro iding resilience to failures. The latter property is particularly important for large#scale architectures, since expanding network si!e can increase the probability of encountering a faulty network component. 7n dynamic routing, howe er, packets of a message may arri e out of order at the destination computer. Appending a se)uence number to each packet and property reassembling the packets at the destination computer can sol e this problem. The path selection policy for dynamic routing may either be minimal or non%minimal. 7n the minimal policy, the selected path is one of the shortest paths between the source and destination pair of computers. Therefore, e ery channel isited will bring the packet closer to the destination. 6n the other hand, in the non#minimal policy, a packet may follow a longer path, usually in response to current network conditions. 7f the non#minimal policy is used, care must be taken to a oid a situation in which the packet will continue to be routed through the network but ne er reach the destination. 4.+ $ommunication Protoco%s: %or transmission of message data comprised of multiple packets, the sender and recei er must also agree upon the method used for identifying the first packet and the last packet of the packet. 0oreo er, agreement is also needed for handling duplicate messages, a oiding buffer o erflows, and assuring proper message se)uencing. *etwork designers define all such agreements, needed for communication between the communicating parties, in terms of rules and con entions. The term protocol is used to refer to a set of such rules and con entions.

1?

9omputer networks are implemented using the concept using the concept of layered protocols. According to this concept, the protocols of a network are organi!ed into a series of layers in such a way that each layer contains protocols for exchanging data and pro iding functions in a logical sense with the peer entities at other sites in the network. :ntities in the ad8acent layers interact in a physical sense though the common interface defined between the two layers by passing parameters such as headers, trailers, and data parameters. The main reasons for using the concept of layered protocols in network design are as follows+ The protocols of a network are fairly complex. .esigning them in layers makes their implementation more manageable. ,ayering of protocols pro ides well#defined interfaces between the layers, so that a change in one layer does not affect an ad8acent layer. That is, the arious functionalities can be partitioned and implemented independently so that each one can be changed as technology impro es without the other ones being affected. %or example, a change to a routing algorithm in a network control program should not affect the functions of message se)uencing, which is located in another layer of the network architecture. ,ayering of protocols also allows interaction between functionality#paired layers in different locations. This concept aids in permitting the distribution of functions to remote sites. The terms protocol suite, protocol family, or protocol stack are used to refer to the collection of protocols &of all layers' of a particular network system.

4.+.4 The (SI Reference 'ode%: The basic goal of communication protocols for network systems is to allow remote computers to communicate with each other and to allow users to access remote resources. 6n the other hand, the basic goal of communication protocols for distributed systems is not only to allow users to access remote resources but also to do so in a transparent manner. "e eral standards and protocols for network systems are already a ailable.

2/

The number of layers, the name of each layer, and the functions of each layer may be different from one network to other network. To make the 8ob of the network communication protocol designers easier, the 7nternational "tandardi!ation 6rgani!ation &7"6' has de eloped a reference model that identifies se en standard layers and defines the 8obs to be performed at each layer. This model is called the )pen -ystem International +eference *odel &)-I model$. 7t is a guide, not a specification. 7t pro ides a framework in which standards can be de eloped for the ser ices and protocols at each layer. To pro ide an understanding of the structure and functioning of layered network protocols, a brief description of the 6"7 model here as shown in fig 1.$. 7t is a se en#layer architecture in which a separate set of protocols is defined for each layer. Thus each layer has an independent function and deals with one or more specific aspects of the communication. The se en layers are Chysical, .ata link, *etwork, Transport, "ession, Cresentation, and Application. ,et us deal them one by one in detail. o Physical ayer: This specifies the physical link interconnection, including electrical@photonic characteristics. o !ata link layer: This specifies how data tra els between two end points of a communication link& for e.g., a host and a packet switch'. At this le el, data is deli ered in a frame, which consists of a stream of binary data /s and 1s in which checksum techni)ues are applied to detect errors. The :thernet protocol is one example of this. o "etwork layer: This defines the basic unit of transfer across the network and includes the concept of multiplexing and routing. At this le el the software assembles a packet in the form the network expects and uses layer 2 to transfer it o er single links. o Transport layer: This pro ides end to end reliability by ha ing the destination host communicate with source host to compensate for the fact that multiple networks with different )ualities of ser ice my ha e been utili!ed.

21

o #ession layer: This describes how protocol software can be organi!ed to handle all the functionality needed by the application programs particularly to maintain transfer#le el synchroni!ation.
Site 4 Site -

,rocess " App%ication protoco%

,rocess /

8ayer , DApp%icationE
Interface

8ayer , DApp%icationE
Interface

8ayer : DPresentationE
Interface

Presentation protoco%

8ayer : DPresentationE
Interface

8ayer + DSessionE
Interface

Session protoco%

8ayer + DSessionE
Interface

8ayer / DTransportE
Interface

Transport protoco%

8ayer / DTransportE
Interface

8ayer 3 D#et6or7E
Interface

#et6or7 protoco%

8ayer 3 D#et6or7E
Interface

8ayer DData %in7E


Interface

Data5%in7 protoco%

8ayer DData %in7E


Interface

8ayer 4 DPhysica%E

Physica% protoco%

8ayer 4 DPhysica%E

#et6or7 >i!.4.+ The architecture of the 6"7 model

22

o Presentation layer: This includes functions re)uired for the basic encoding rules used in transferring information, be it text, oice, ideo, or multimedia. o Application layer: This includes application programs like electronic mail or file transfer programs. $heck %our Progress-&: 'ill up the blanks: A. The end systems are called as###############. ;. *etworks are broadly classified as ############# and #####################. 9. The two commonly used network topologies for ,A* are ########### and ##########. .. An example of a medium access protocol##########. :. The two types of "witching techni)ues are ###########and#############. %. An example of a communication protocol is ###########. 1. A reference model suggested by 7"6 is known as #############. $heck %our Progress-(: Answer the following questions: 1' :xplain the different network types. 2' -hat are the key characteristics used to differentiate the different network typesF :xplain 3' .escribe the ,A* topologies. 4' .iscuss the importance of routing techni)ues and briefly explain the different types. $' -hat do you mean by switchingF .ifferentiate 9ircuit and Cacket switching techni)ues. 5' :xplain the different medium access protocols in brief. 4' .escribe the se en#layer architecture suggested by 7"6. 4.: Summary: A distributed system relays entirely on the underlying computer network for the communication of data and control information between the nodes of which they are composed. A computer network is a communication system that links the nodes by communication lines and software protocols to exchange the data between the two processors running on different nodes of the network.

23

;ased on characteristics such as geographic distribution of nodes. .ata rate, error rate and communication cost, networks are broadly classified into two types, ,A* and -A*. *etworks that share from the characteristics of both ,A* and -A* are sometimes referred to as 0A*s.

The two commonly used network topologies for constructing ,A* are the multi#access bus and ring. A wide area network of computers are constructed by interconnecting computers that are separated by large distances, special hardware de ices called packet switching exchanges are used to connect computers to the communication channels.

The selection of actual path to be used to transmit a packet in a -A* is determined by the routing strategy used. The path used to transmit a packet in a -A* can be either statically or dynamically changed based on the network conditions using suitable algorithms.

9omputer networks are implemented using the concept of layered protocols. The 6"7 model pro ides stand for layered protocols for -A*s. The se en layers of the 6"7 model are Chysical, .ata#,ink, *etwork, Transport, "ession, Cresentation and Application.

24

U#IT5DISTRIBUTED $('PUTI#) SFSTE' * A# I#TR(DU$TI(# Structure -.. (bAecti1es -.4 Introduction -.- Distributed $omputin! System * An (ut%ine -.3 E1o%ution of Distributed $omputin! System -./ Distributed $omputin! System 'ode%s -./.4 'inicomputer mode% -./.- ?or7station mode% -./.3 ?or7station * Ser1er mode% -././ Processor * poo% mode% -./.+ &ybrid mode% -.+ Uses of Distributed $omputed System -.: Distributed (peratin! System -., Issues in desi!nin! a Distributed (peratin! system -.9 Introduction to Distributed $omputin! En1ironment -.9.4 D$E -.9.- D$E $omponents -.9.3 D$E $e%%s -.< Summary (.) *b+ectives: In this unit we will be learning a new task execution strategy approach and the following related terminologies. Distributed Computing -ystem DC-$ Distributed Computing models Distributed )perating -ystem Distributed Computing (nvironment DC($ called 0 Distributed Computing1. /y the end of this unit you will be able to understand this new

2$

-.4 Introduction: Ad ancements in microelectronic technology ha e resulted in the a ailability of fast, inexpensi e processors, and ad ancements in communication technology ha e resulted in the a ailability of cost#effecti e and highly efficient computer networks. The net result of the ad ancements in these two technologies is that the price performance ratio has now changed to fa or the use of interconnected, multiple processors in place of a single, high#speed processor. The merging of computer and networking technologies ga e birth to .istributed computing systems in the late 1?4/s. Therefore, starting from the late 1?4/s, a significant amount of research work was carried out in both uni ersities and industries in the area of distributed operating systems. These research acti ities ha e pro ided us with the basic ideas of designing distributed operating systems. Although the field is still immature, with ongoing acti e research acti ities, commercial distributed operating systems ha e already started to emerge. These systems are based on already established basic concepts. This unit deals with these basic concepts and their use in the design and implementation of distributed operating systems. %inally, the unit will gi e a brief look about a complete system known as .istributed 9omputing :n ironment. -.- Distributed $omputin! System 5 An (ut%ine: 9omputer architecture consisting of interconnected= multiple processors are basically of two types+ 1. Ti!ht%y $oup%ed systems+ 7n these systems, there is a single system wide primary memory &address space' that is shared by all the processors &%ig 2.1 &a''. 7f any processor writes. Therefore, in these systems, any communication between the processors usually takes place through the shared memory. 2. 8oose%y $oup%ed system+ 7n these systems, the processors do not share memory, and each processor has its own local memory &%ig 2.1&b''. 7n these systems, all physical communication between the processors is done by passing messages across the network that interconnects the processors.

25

9CA

9CA

"ystem#wide "hared memory

9CA

9CA

7nterconnection hardware

>i!. -.4 DaE A tightly coupled multiprocessor systems

,ocal memory 9CA

,ocal memory 9CA

,ocal memory 9CA

,ocal memory 9CA

9ommunication network

>i! -.4 DbE A loosely coupled multiprocessor systems ,et us see some points with respect to both tightly coupled multiprocessor systems and loosely coupled multiprocessor. Tightly coupled systems are referred to as parallel processing systems, and loosely coupled systems are referred to as distributed computing systems, or simply distributed systems. 7n case of tightly coupled systems, the processors of distributed computing systems can be located far from each other to co er a wider geographical area. 7n tightly coupled systems, the number of processors that can be usefully deployed is usually small and limited by the bandwidth of the shared memory. The .istributed computing systems are more freely expandable and can ha e an almost unlimited number of processors.

24

7n short, a distributed computing system is basically a collection of processors interconnected by a communication network in which each processor has its own local memory and other peripherals, and the communication between any two processors of the system takes place by message passing over the communication network. -.3 E1o%ution of Distributed $omputin! Systems: :arly computers were ery expensi e &they cost millions of dollars' and ery large in si!e &they occupied a big room'. There were ery few computers and were a ailable only in research laboratories of uni ersities and industries. These computers were run from a console by an operator and were not accessible to ordinary users. The programmers would write their programs and submit them to the computer center on some media such as punched cards, for processing. ;efore processing a 8ob, the operator would set up the necessary en ironment &mounting tapes, loading punched cards in a card reader etc.,' for processing the 8ob. The 8ob was then executed and the result, in the form of printed output, was later returned to the programmer. The 8ob setup time was a real problem in early computers and wasted most of the aluable central processing unit &9CA' time. "e eral new concepts were introduced in the 1?$/s and 1?5/s to increase 9CA utili!ation of these computers. *otable among these are batching together of 8obs with similar needs before processing them, automatic se)uencing of 8obs, off#line processing by using the concepts of buffering and spooling and multiprogramming. Automatic 8ob se)uencing with the use of control cards to define the beginning and end of a 8ob impro ed 9CA utili!ation by eliminating the need for human 8ob se)uencing. 6ff#line processing impro ed 9CA utili!ation by allowing o erlap of 9CA and input@output &7@6' operations by executing those two actions on two independent machines &7@6 de ices are normally se eral orders of magnitude slower than the 9CA'. %inally, multiprogramming impro ed 9CA utili!ation by organi!ing 8obs so that the 9CA always had something to execute. (owe er, none of these ideas allowed multiple users to directly interact with a computer system and to share its resources simultaneously. Therefore, execution of interacti e 8obs that are composed of many short actions in which the next action depends on the result of a pre ious action was a tedious and time#consuming acti ity. .e elopment and debugging of programs are examples of interacti e 8obs. 7t was not

22

until the early 1?4/s that computers started to use the concept of time#sharing to o ercome this hurdle. :arly time#sharing system had se eral dumb terminals attached to main computer. These terminals were placed in a room different from the main computer room. These terminals were placed in a room different from the main computer room. Asing these terminals, multiple users could now simultaneously execute interacti e 8obs and share the resources of the computer system. 7n a time#sharing system, each user is gi en the impression that he or she has his or her own computer because the system switches rapidly from one user<s 8ob to the next user<s 8ob, executing only a ery small part of each 8ob at a time. Although the idea of time#sharing was demonstrated as early as 1?5/, time#sharing computer systems were not common until the early 1?4/s because they ware difficult and expensi e to build. Carallel ad ancements in hardware technology allowed reduction in the si!e and increase in the processing speed of computers, causing large#si!ed computers to be gradually replaced by smaller and cheaper ones that had more processing capability than their predecessors. These systems were called minicomputers. The ad ent of time5sharin! systems was the first step was distributed computin! systems because it pro ided us with two important concepts used in distributed computing systems# The sharing of computer resources simultaneously by many users The accessing of computers from a place different from the main computer room. 7nitially the terminals of a time#sharing system were dumb terminals and all processing was done by the main computer system. Ad ancements in microprocessor technology in the 1?4/s allowed the dumb terminals to be replaced by intelligent terminals so that the concepts of offline processing and time sharing could be combined to ha e the ad antages of both concepts in a single system. 0icroprocessor technology continued to ad ance rapidly, making a ailable in the early 1?2/s single#user computers called workstations that had computing power almost e)ual to that of minicomputers but were a ailable for only a small fraction of the price of a minicomputer. %or example, the first workstation de eloped at Eerox CA>9 &called Alto' had a high#resolution monochrome display, a mouse 122 kilobytes of main memory, a 2.$ megabyte hard disk, and a micro programmed 9CA that executed machine#le el instruction at speeds of 2#5 s. These

2?

workstations were then used as terminals in the time#sharing systems. 7n these time# sharing systems, most of the processing of user<s 8ob could be done at the user<s own computer, allowing the main computer to be simultaneously shared by a larger number of users. "hared resources such as files, databases, and software libraries were placed on the main computer. 9entrali!ed time#sharing systems described abo e had a limitation in that the terminals could not be placed ery far from the main computer room since ordinary cables were used to connect the terminals to the main computer. (owe er, in parallel, there were ad ancements in compute networking technology in the late 1?5/s and early 1?4/s that emerged as two key networking technologies# 8A# D8oca% Area #et6or7E+ The ,A* technology allowed se eral computers located within a building or a campus to be interconnected in such a way that these machines could exchange information with each other at data rates of about 1/ megabits per second &0bps'. The first high#speed ,A* was the :thernet de eloped at Eerox CA>9 in 1?43 ?A# techno%o!y: allowed computers located far from each other &may be in different cities or countries or continents' to be interconnected in a such a way that these machines could exchange information with each other at data rates of about $5 kilobits per second &Gbps'. The first -A* was the A>CA*:T &Ad anced >esearch Cro8ects Agency *etwork' de eloped by the A.". .epartment of .efense in 1?5?. The AT' techno%o!y+ The data rates of networks continued to impro e gradually in the 1?2/s pro iding data rates of up to 1// 0bps for ,A*s and data rates of up to 54 Gbps for -A*s. >ecently &early 1??/s' there ha e been another ma8or ad ancements in networking technology 3 the AT0 &Asynchronous Transfer 0ode' technology. The AT0 technology is an emerging technology that is still not ery well established. 7t will make ery high speed networking possible, pro iding data transmission rates up to 1.2 gigabits per second &1bps' in both ,A* and -A* en ironments. The a ailability of such high#bandwidth networks will allow future distributed computing systems to support a completely new class of distributed applications, called multimedia applications, that deal with the handling of a mixture of information, including oice, ideo and ordinary

3/

data. The merging of computer and networking technologies ga e birth to .istributed computing systems in the late 1?4/s. -./ Distributed $omputin! System 'ode%s: Darious models are used for building distributed computing system. These models can be broadly classifies into fi e categories#minicomputerG 6or7stationG 6or7station5ser1erG processor5poo% and hybrid. They are briefly described below. -./.4 'inicomputer 'ode%: The minicomputer model is a simple extension of the centrali!ed time#sharing system. As shown in %ig 2.2, a distributed computing system based on this model consists of few minicomputers &they may be large supercomputers as well' interconnected by a communication network. :ach minicomputer usually has multiple users simultaneously logged on to it. %or this, se eral interacti e terminals are connected to each minicomputer. :ach user is logged on to one specific minicomputer, with remote access to other minicomputer. The network allows a user to access remote resources that are a ailable on some machine other than the one on to which the user is currently logged.
0ini# 9omputer

Terminals

0ini# 9omputer

9ommunication *etwork

0ini# 9omputer

0ini# 9omputer

>i! -.-. A distributed computing system based on the minicomputer model

31

The minicomputer model may be used when resource sharing &such as sharing of information databases of different types, with each type of database located on a different machine' with remote users is desired. The early A>CA net is an example of a distributed computing system based on the minicomputer model.
-./.- ?or7station 'ode%: As shown in figure 2.3, a distributed computing systems based

on the workstation model consists of se eral workstations interconnected by a communication network. A company<s office or a uni ersity department may ha e se eral workstations scattered throughout a building or campus, each workstation e)uipped with its own disk and ser ing as a single user computer. 7t has been often found that in such an en ironment, at any one time &especially at night', significant proportions of the workstations are idle &not being used', resulting the waste of large amounts of 9CA time. Therefore, the idea of the workstations model is to interconnect all these workstations by a high#speed ,A* so that idle workstations may be used to process 8obs of users who are logged onto other workstations and do not ha e sufficient processing power at their own workstations to get their 8obs processed efficiently.
-orkstation

-orkstation

-orkstation

-orkstation

9ommunication *etwork

-orkstation

-orkstation

-orkstation

-orkstation

>i! -.3. A distributed computing system based on the workstation model

32

7n this model, a user logs onto one of the workstations called his or her home workstation and submits 8obs for execution. -hen the system finds that the user<s workstation does not ha e sufficient processing power for executing the processor of the submitted 8obs efficiently, it transfers one or more of the processors from the user<s workstation to some other workstation that is currently idle and gets the process executed there, and finally the result of execution is returned to be users workstation -./.3 ?or7station5Ser1er 'ode%: The workstation model is a network of personal workstations, each with its own disk and a local file system. A workstation with its own local disk is usually called a diskful workstation and a workstation without a local disk is called a diskless workstation. -ith the in ention of high#speed networks, diskless workstations ha e been more popular in network en ironments than diskful workstations, making the workstation#ser er model path popular than the workstation mode for building distributed computing systems. As shown in %ig 2.4 a distributed computing system based on the workstation ser er model consists of a few minicomputers and se eral workstations &most of which are diskless, but a few of which may be diskful' interconnected by a communication network. %or a number of reasons, such as higher reliability and better scalability, multiple ser ers are often used for managing the resources of a particular type in a distributed computing system. %or example, there may be multiple file ser ers, each running on a separate minicomputer and cooperating ia the network, for managing the files of all the users in the system. .ue to this reason, a distinction is often is made between the ser ices that are pro ided to clients and the ser ers that pro ide them. That is, a ser er is an abstract entity that is pro ided by one or more ser ers. %or example, one or more file ser ers may be used in a distributed computing system to pro ide file ser ice to the users. 7n this model, a user logs onto a workstation called his or her home workstation. *ormal computation acti ities re)uired by the user<s process are performed at the users home workstation, but re)uests for ser ices pro ided by special ser ers &such as a file ser er or a database ser er' are sent to a ser er pro iding that type of ser ice that performs the user<s re)uested acti ity and returns the result of re)uest processing to the

33

user<s workstations. Therefore, in this model, the users processes need not to be migrated to the ser er machines for getting the work done by those machines.
-orkstation -orkstation -orkstation

-orkstation

9ommunication *etwork

-orkstation

0ini# computer used as file ser er

0ini# computer used as database ser er

H.

0ini# computer used as print ser er

>i!. -./. A distributed computing system based on the workstation#ser er model As compared to the workstation model, the workstation#ser er model has se eral ad antages+ 1. 7n general, it is much cheaper to use a few minicomputers e)uipped with large, fast disk that are accessed o er the network than a large number of diskful workstations, with each workstation ha ing a small, slow disk. 2. .iskless workstations are also preferred to diskful workstations from a system maintenance point of iew. ;ackup and hardware maintenance are easier to perform with a few large disks than with many small disks scattered all o er a building or campus. %urthermore, installing new releases of software &such as file ser er with new functionalities' is easier when the software is to be installed on a few file ser er machines than on e ery workstation.

34

3.

7n the workstation#ser er model, since the file ser ers manage all files, users ha e the flexibility to use any workstation and access the files in the same manner irrespecti e of which workstation the user is currently logged on. *ote that this is not the true with the workstation model, in which the workstation model, in which each workstation has its local file system, because different mechanisms are needed to access local and remote files.

4.

7n the workstation#ser er model, the re)uest#response protocol is mainly used to access the ser ices of the ser er machines. Therefore, unlike the workstation model, this model does not need a process migration facility, which is difficult to implement. The re)uest#response protocol is known as the client#ser er model of communication. 7n this model, a client process &which in this case resides on a workstation' sends a re)uest to ser er process &which in this case resides on a minicomputer' for getting some ser ices such as reading a unit of a file. The ser er executes the re)uest and sends back a reply to the client that contains the result of processing.

$.

A user has guaranteed response time because workstations are not used for executing remote processes. (owe er, the model does not utili!e the processing capability of idle workstations.

-././ Processor5Poo% 'ode%: The processor#pool model is based on the obser ation that most of the time a user does not need any computing power but once in a while he or she may need a ery large amount of computing power for a short time. Therefore, unlike the workstation#ser er model in which a processor is allocated to each user, in the processor# pool model the processors are pooled together to be shared by the users as needed. The pool of processors consists of a large number of microcomputers and minicomputers attached to the network. :ach processor in the pool has its own memory to load and run a system program or an application program of the distributed computing system. As shown in %ig 2.$, in the pure processor#pool model, the processors in the pool ha e no terminals attached directly to them, and users access the system from terminals that are attached to the network ia special de ices. A special ser er &called a run ser er' manages and allocates the processors in the pool to different users on a demand basis.

3$

Termina%s

9ommunication *etwork

>un "er er

HH

%ile "er er

Poo% of processors >i! -.+ The Crocessor pool model -hen a user submits a 8ob for computation, the run ser er temporarily assigns an appropriate number of processors to his or her 8ob. %or example, if the user<s computation 8ob is the compilation of a program ha ing n segments, in which each of the segments can be compiled independently to produce separate relocatable ob8ect files, n processors from the pool can be allocated to this 8ob to compile all the n segments in parallel. -hen the computation is completed the processors are returned to the pool for use by other users. 7n the processor#pool model there is no concept of a home machines. That is, a user does not log onto a particular machines but to the system as a whole. This is in contrast to other models in which each user has a home machine &e.g. a workstation or

35

minicomputer' onto which he or she logs and runs most of his or her programs there by default. Amoeba and the 9ambridge .istributed 9omputing "ystems are examples of distributed computing systems based on the processor#pool model.
-./.+ &ybrid 'ode%: 6ut of the four models abo e, the workstation#ser er model is the

most widely used model for building distributed computing system. This is because a large number of computer users only perform simple interacti e tasks such as editing 8obs, sending electronic mails, and executing small programs. The workstation#ser er model is ideal for such simple usage. (owe er, in a workstation en ironment that has groups of users who often perform 8obs needing massi e computation. The processor# pool model is more attracti e and suitable. To combine the ad antages of both the workstation#ser er and processor#pool models, a hybrid model may be used to build a distributed computing system. This hybrid model is based on the workstation#ser er model but with the addition of a pool of processors. The processors in the pool can be allocated dynamically for computing that are too large for workstations or that re)uires se eral computers concurrently for efficient execution. 7n addition to efficient execution of computation#intensi e 8obs, the hybrid model gi es guaranteed response to interacti e 8obs by allowing them to the processed on local workstations of the users. (owe er, the hybrid model is more expensi e to implement than the workstation#ser er model or the processor#pool model. %rom the models of distributed computing systems presented abo e, it is ob ious that distributed computing systems are much more complex and difficult to build than traditional centrali!ed systems &those consisting of a single 9CA, its memory, peripherals, and one or more terminals'. The increased complexity is mainly due to the fact that in addition to being capable of effecti ely using and managing a ery large number of distributed resources, the system software of a distributed computing system should also be capable of handling the communication and security problems that are ery different from those of centrali!ed systems. %or example, the performance and reliability of a distributed computing system depends to a great extent on the performance and reliability of the underlying communication network. "pecial software is usually needed to handle loss of messages, during transmission across the network or to pre ent o erloading of the

34

network that degrades the performance and responsi eness to the users. "imilarly, special software security measures are needed to protect the widely distributed shared resources and ser ices against intentional or accidental iolation of access control and pri acy constraints. .espite the increased complexity and the difficulty of building distributed computing systems, the installation and use of distributed computing systems o erweigh their disad antages. The technical needs, the economic pressures, and the ma8or ad antages that ha e led to the emergence and popularity of distributed computing systems are described here. Inherent%y Distributed App%ications: .istributed computing systems come into existence in some ery natural easy. %or example, se eral applications are inherently distributed in nature and re)uire a distributed computing system for their reali!ation. %or instance, in an employee database of a nationwide organi!ation, the data pertaining to a particular employee are generated at the employee<s branch office, and in addition to the global need to iew the entire database= there is a local need for fre)uent and immediate access to locally generated data at each branch office. "uch applications re)uire that some processing power be a ailable at the many distributed locations for collecting, preprocessing, and accessing data, resulting in the need for distributed application are a computeri!ed worldwide airline reser ation system, a computeri!ed banking system in which a customer can deposit@withdraw money from his or her account from any branch of the bank, and a factory automation system controlling robots and machines all along an assembly line. Information Sharin! amon! Distributed Users: :fficient person#to#person communication facility by sharing information o er great distances is the one more ad antage. 7n a distributed computing system, the users working at other nodes of the system can easily and efficiently share information generated by one of the users. This facility may be useful in many ways. %or example, two or more users who are geographically far off from each other can perform a pro8ect but whose computers are the parts of the same distributed computing system.

32

.he use of distributed computing systems by a group of users to work cooperatively is known as computer%supported cooperative working C-CW$, or groupware. Resource Sharin!: 7nformation is not the only thing that can be shared in a distributed computing system. "haring of software resources such as software libraries and databases as well as hardware resources such as printers, hard disks, and plotters can also be done in a ery effecti e way among all the computers and users of a single distributed computing system Better Price Performance Ration: This is one of the most important reasons for the growing popularity of distributed computing system. -ith the rapidly increasing power and reduction in the price of microprocessors, combined with the increasing speed of communication networks, distributed computing systems potentially ha e a much better price#performance ratio than a single large centrali!ed system. Another reason for distributed computing systems to be more cost effecti e than centrali!ed systems is that they facilitate resource sharing among multiple computers. Shorter Response Times and &i!her Throu!hput: .ue to multiplicity of processors, distributed computing systems are expected to ha e better performance than single# processor centrali!ed systems. The two most commonly used performance metrics are response time and throughput of user processes. That is, the multiple processors of distributed computing systems can be utili!ed properly for pro iding shorter response times and higher throughput than a single processor centrali!ed system. Another method often used in distributed computing systems for achie ing better o erall performance is to distribute the load more e enly among the multiple processors by mo ing 8obs from currently o erloaded processors to lightly loaded ones. &i!her Re%iabi%ity: +eliability refers to the degree of tolerance against errors and component failures in a system. A reliable system pre ents loss of information e en in the e ent of component failures. The multiplicity of storage de ices and processors in a distributed computing system allows the maintenance of multiple copies of critical information within the system. -ith this approach, if one of the processors fails, the computation can be successfully completed at the other processor, and if one of the storage de ices fails, the computations can be successfully completed at the other

3?

processors, and if one of the storage de ices fails, the information can still be used from the other storage de ice. A1ai%abi%ity: An important aspect of reliability is availability, which refers to the fraction of time for which a system is a ailable for use. 7n comparison to a centrali!ed system, a distributed computing system also en8oys the ad antage of increased a ailability. EItensibi%ity and Incrementa% )ro6th: Another ma8or ad antage of distributed computing systems is that they are capable of incrementa% !ro6th. That is, it is possible to gradually extend the power and functionality of a distributed computing system by simply adding additional resources &both hardware and software' to the system as and when the need arises. %or example, additional processors can be easily added to the system to handle the increased workload of an organi!ation that might ha e resulted from its expansion. EItensibi%ity is also easier on a distributed computing system because addition of new resources to an existing system can be performed without significant disruption of the normal functioning of the system. Croperly designed distributed computing systems that ha e the property of extensibility and incremental growth are called open distributed systems. Better >%eIibi%ity in 'eetin! Users #eeds: .ifferent types of computers are usually more suitable for performing different types of computations. %or example, computers with ordinary power are suitable for ordinary data processing 8obs, whereas high# performance computers are more suitable for complex mathematical computations. 7n a centrali!ed system, the users ha e to perform all types of computations on the only a ailable computer. -.+ Distributed (peratin! System DD(SE: Definition: 7t is defined as a program that controls the resources of a computer system and pro ides its users with an interface or irtual machine that is more con enient to use than the bare machine. According to this definition, the two primary tasks of an operating system are as follows+ o To present users with a underlying hardware. irtual machine that is easier to program than the

4/

o To manage the arious resources of the system. This in ol es performing such tasks as keeping track of who is using which resource, granting resource re)uests accounting for resource usage, and mediating conflicting re)uests from different programs and users. The $%assification: The operating systems commonly used for distributed computing systems can be broadly classified into two types # network operating systems and distributed operating systems. The three most important features -ystem Image' The most important feature used to differentiate between the two types of operating systems is the image of the distributed computing system from the point of iew of its users. 7n case of a network operating system, the users iew the distributed computing systems as a collection of distinct machines connected by a communication subsystem. That is, the users are aware of the fact that multiple computers are being used. 6n the other hand, a distributed operating system hides the existence of multiple computers and pro ides a single# system image to its users. That is, it makes a collection of networked machines act as a virtual uniprocessor. "utonomy' A network operating system is built on a set of existing centrali!ed operating systems and handles the interface and coordination of remote operations and communications between these operating systems. That is, in the case of a network operating system, each computer of the distributed computing system has its own local operating system &the operating system of different computers may be the same or different', and there is essentially no coordination at all among the computers except for the rule that when two processes of different computers communicate with each other, they must use a mutually agreed on communication protocol. :ach computer functions independently of other computers in the sense that each one makes independent decisions about the creation and termination of their own processes and management of local resources. *otice that due to the possibility of difference in local operating systems, the system calls for different computers of the same distributed computing system may be different in this case. 6n the other hand, with a distributed operating system, there is a single system

41

wide operating system and each computer of the distributed computing system runs a part of this global operating system. The distributed operating system tightly interwea es all the computers of the distributed computing system in the sense that they work in close cooperation with each other for the efficient and effecti e utili!ation of the arious resources of the system. That is, processes and se eral resources are managed globally &some resources are managed locally'. 0oreo er, there is a single set of globally alid system calls a ailable on all computers of the distributed computing system. 7n short, it can be said that the degree of autonomy of each machine of a distributed computing system that uses a network operating system is considerably high as compared to that of machines of a distributed computing system that uses a distributed operating system. 2ault tolerance capability' A network operating system pro ides little or no fault tolerance capability in the sense that if 1/H of the machines of the entire distributed computing system are down at any moment, at least 1/H of the users 11are unable to continue with their work. 6n the other hand, with a distributed operating system, most of the users are normally unaffected by the failed machines and can continue to perform their work normally, with only a 1/H loss in performance of the entire distributed computing system. Therefore, the fault tolerance capability of a distributed operating system is usually ery high as compared as that of a network operating system. Some Important Points to be noted 6ith respect to D(S: A distributed operating system is one that looks to its users like an ordinary centrali!ed operating system but runs on multiple, independent central processing unit &9CAs'. The key concept here is transparency. 7n other words, the multiple processors should be in isible &transparent' to the user. Another way of expressing the same idea is to say that the user iews the system as a I irtual uniprocessorJ, not as a collection of distinct machines. A distributed computing system that uses a network operating system is usually referred to as a network system, whereas one that uses a distributed

42

operating system is usually referred to as a true distributed system &or simply a distributed system'. -., Issues in Desi!nin! a Distributed (peratin! System: 7n general, designing operating system is more difficult than designing a centrali!ed operating system for se eral reasons. 7n the design of a centrali!ed operating system, it is assumed that the operating system has access to complete and accurate information about the en ironment in which it is functioning. 7n a distributed system, the resources are physically separated, there is no common clock among the multiple processors, deli ery of messages is delayed, and messages could e en be lost. .ue to all these reasons, a distributed operating system doest not ha e up#to#data, consistent knowledge about the state of the arious components of the underlying distributed system .espite these complexities and difficulties, a distributed operating system must be designed to pro ide all the ad antage of a distributed system to its users. That is, the users should be able to iew a distributed system as irtual centrali!ed system that is flexible, efficient, reliable, secure and easy to use. To meet this re)uirement, the designers of a distributed operating system must deal with se eral design issues. "ome of the key design issues are described below.

Transparency: -e saw that one of the main goals of a distributed operating system is to make the existence of multiple computers invisible transparent$ and pro ide a single system image to its users. That is, a distributed operating system must be designed in such a way that a collection of distinct machines connected by a communication subsystem appears to its users as a virtual uni%processor. The eight forms of transparency identified by the 7nternational "tandards 6rgani!ation<s >eference 0odel for 6pen .istributed Crocessing K7"6 1??2L are access transparency, location transparency, replication transparency, failure transparency, migration transparency, location transparency, concurrency transparency performance transparency, and scaling transparency . These transparency aspects are described below. o Access Transparency: 7t means that users should not need or be able to recogni!e whether a resource &hardware or software 'is remote or local. This implies that the distributed operating system should allow users to

43

access remote resources in the same way as local resources. That is, the user interface, which takes the form of a set of system calls, should not distinguish between local and remote resources, and it should be the responsibility of the distributed operating system to locate the resources and to arrange for ser icing user re)uests in a user#transparent manner. o ,ocation Transparency+ The two main aspects of location transparency are as follows+ "ame transparency+ This refers to the fact that the name of a resource &hardware or software' should not re eal any hint as to the physical location of the resource. That is, the name of a resource should be independent of the physical connecti ity or topology of the system or the current location of the resource. %urthermore, such resources, which are capable of being mo ed from one node to another in a distributed system &such as file' must be allowed to mo e without ha ing their names changed. Therefore, resource names must be uni)ue system wide ,ser mobility: This refers to the fact that no matter which machine a user is logged onto, he or she should be able to access a resource with the same name. That is, the user should not be re)uired to use different names to access the same resource from two different nodes of the system. o Rep%ication Transparency: %or better performance and reliability, almost all distributed operating systems ha e the pro ision to create replicas &additional copies' of files and other resources on different nodes of the distributed system. 7n these systems, both the existence of multiple copies of a replicated resource and the replication acti ity should be transparent to the users. That is, two important issued related to replication transparency are naming of replicas and replication control. 7t is the responsibility of the system to name the arious copies of a resource and to map a user#supplied name of the resource to an appropriate replica of the resource.

44

o >ai%ure Transparency: %ailure transparency deals with masking from the users partial failures in the system, such as a communication link failure, a machine failure, or a storage de ice crash. A distributed operating system ha ing failure transparency property will continue to function, perhaps in a degraded form, in the face of partial failures. (owe er, in this type of design, care should be taken to ensure that the cooperation among multiple ser ers does not add too much o erhead to the system. o 'i!ration Transparency: %or better performance, reliability and security reasons, an ob8ect that is capable of being mo ed &such as a process or a file' is often migrated from one node to another in a distributed system. The aim of migration transparency is to ensure that the mo ement of the ob8ect is handled automatically by the system in a user#transparent manner. Three important issues in achie ing this goals are as follows+ 0igration decisions such as which ob8ect is to be mo ed from where to where should be made automatically by the system. 0igration of an ob8ect from one node to another should not re)uire any change in its name -hen the migrating ob8ect is a process, the inter process communication mechanism should ensure that a message sent to the migrating process reaches it without the need for the sender process to resend it if the recei er process mo es to another node before the message is recei ed. o $oncurrency Transparency: 7n a distributed system, multiple users who are spatially use the system concurrently. 7n such a situation, it is economical to share the system resources &hardware or software' among the concurrently executing user processes. (owe er, since the number of a ailable resources in a computing system is restricted, one user process must necessarily influence the action of other concurrently executing user processes, as it competed for resources. %or pro iding concurrency

4$

transparency, the resource sharing mechanisms of the distributed operating system must ha e the following four properties. An e ent#ordering property ensures that all access re)uests to arious system resources are properly ordered to pro ide a consistent iew to all users of the system. A mutual#exclusion property ensures that at any time at most one process accesses a shared resource, which must be used simultaneously by multiple processes if program operation is to be correct. A no#star ation property ensures that is e ery process that is granted a resource, which must not be used simultaneously by multiple processes, e entually, releases it, e ery re)uest for that resource is e entually granted. A no#deadlock property ensures that a situation will ne er occur in which competing processes pre ent their mutual progress e en though no single one re)uests more resources than a ailable in the system. o Performance Transparency: The aim of performance transparency is to allow the system to be automatically reconfigures to impro e performance, as loads ary dynamically in the system. As far as practicable, a situation in which one processors of the system is o erloaded with 8obs while another processor is idle should not be allowed to occur. That is, the processing capability of the system should be uniformly distributed among the currently a ailable 8obs in the system. This re)uirement calls for the support of intelligent resource allocation and process migration facilities in distributed operating systems. o Sca%in! Transparency: The aim of scaling transparency is to allow the system to expand in scale without disrupting the acti ities of the users. This re)uirement calls for open#system architecture and the use of scalable algorithms for designing the distributed operating system components. 6n

45

the other hand, since e ery component of a distributed operating system must use scalable algorithms.

Re%iabi%ity: 7n general, distributed systems are expected to be more reliable than centrali!ed systems due to the existence of multiple instances of resources. (owe er, the existence of multiple instances of the resources alone cannot increase the systems reliability. >ather, the distributed operating system, which manages these resources, must be designed properly to increase the systems reliability by taking full ad antage of this characteristic feature of a distributed system. %or higher reliability, the fault#handling mechanisms of a distributed operating system must be designed properly to a oid faults, to tolerate faults, and to detect and reco er from faults. 9ommonly used methods for dealing with these issues are briefly described here. o >au%t A1oidance: %ault A oidance deals with designing the components of the system in such a way that the occurrence of faults is minimi!ed. 9onser ati e design practices such as using high reliability components are often employed for impro ing the system<s reliability based on the idea of fault a oidance. Although a distributed operating system often has little or no role to play in impro ing the fault a oidance capability of a hardware component, the designers of the arious software components of the distributed operating system must test them thoroughly to make these components highly reliable. o >au%t To%erance: %ault tolerance is the ability of a system to continue functioning in the e ent of partial system failure. The performance of the system might be degraded to partial failure, but otherwise the system functions properly. "ome of the important concepts that may be used to impro e the fault tolerance ability of a distributed operating system are as follows+ +edundancy techni3ues' The basic idea behind redundancy techni)ues is to a oid single points of failure by replicating critical hardware and software components, so that if one of them fails, the

44

others can be used to continue. 6b iously, ha ing two or more copies of a critical component makes it possible, at least in principle, to continue operations in spite of occasional partial failures. Distributed control' %or better reliability, many of the particular algorithms or protocols used in a distributed operating system must employ a distributed control mechanism to a oid single points of failure. o >au%t Detection and Reco1ery: The fault detection and reco ery method of impro ing reliability deals with the use of hardware and software mechanisms to determine the occurrence of a failure and then to correct the system to a state acceptable for continued operation. "ome of the commonly used techni)ues for implementing this method in a distributed operating system are as follows+ "tomic transactions' An atomic transaction &or 8ust transaction for short' is a computation consisting of a collection of operations that take place indi isibly in the presence of failures and concurrent computations. That is either all of the operations are performed successfully or none of their effects pre ails, and other processes executing concurrently cannot modify or obser e intermediate states of the computation. Transactions help to preser e the consistency of a set of shared data ob8ects &e.g. files' in the face of failures and concurrent access. They make crash reco ery much easier, because a transaction can only end in two states+ :ither all the operations of the transaction are performed or none of the operations of the transaction is performed. -tateless servers+ The client#ser er model is fre)uently used in distributed systems to ser ice user re)uests. 7n this model, a ser er may be implemented by using any one of the following two ser ice

42

paradigms#stateful

or

stateless.

The

two

paradigms

are

distinguished by one aspect of the client#ser er relationship, whether or not the history of the ser ed re)uests between a client and a ser er affects the execution of the next ser ice re)uests. The stateful approach does depend on the history of the ser iced re)uests, but the stateless approach does not depend on it. "cknowledgements and timeout%based retransmissions of messages' 7n a distributed system, e ent such as a node crash or a communication link failure may interrupt a communication that was in progress between two processes, resulting in the loss of a message. Therefore, a reliable inter#process communication mechanism must ha e ways to detect lost messages so that they ca be retransmitted. (andling of lost messages for e ery message recei ed, and if the sender does not recei e any acknowledgement for a message. .uplicate messages may be sent in the e ent of failures or because of timeouts.

>%eIibi%ity:

Another important issue in the design of distributed operating

systems is flexibility. The design of a distributed operating system should be flexible due to the following reasons+ o (ase of modification' %rom the experience of system designers, it has been found that some parts of the design often need to be replaced@modified either because some bug is detected in the design or because the design is no longer suitable for the changed system en ironment or new#user re)uirements. Therefore, it should be easy to incorporate changes in the system in a user#transparent manner or with minimum interruption caused to the users. o (ase of enhancement' 7n e ery system, new functionalities ha e to be added from time to time to make it more powerful and easy to use. Therefore, it should be easy to add new ser ices to the system. The most important design factor that influences the flexibility of a distributed operating system is the model used for designing its kernel. The kernel of an

4?

operating system is its central controlling part that pro ided basic system facilities. 7t operates in a separate address space that a user cannot replace or modify. The two commonly used models for kernels design in distributed operating systems are mono%ithic 7erne% and the micro 7erne%. o 7n monolithic kernel model, the kernel pro ides most operating system ser ices such as process management, and inter#process communication. As a result, the kernel has a large, monolithic structure. 0any distributed operating systems that are extensions or imitations of the A*7E operating system use the monolithic kernel model. This is mainly because A*7E itself has a large, monolithic kernel. o In the micro kernel model, the main goal is to keep the kernel as small as possible. Therefore, in this model, the kernel is a ery small nucleus of software that pro ides only the minimal facilities necessary for implementing additional operating system ser ices. The only ser ices pro ided by the kernel in this model are inter#process communication, low3le el de ice management and some memory management. All other operating system ser ices, such as file#management, name management, additional process and memory management acti ities, and much system call handling are implemented as a user#le el ser er processes. #ode 4
Aser Applications 0onolithic kernel &includes
most 6" ser ices'

#ode Aser Applications 0onolithic kernel &includes


most 6" ser ices'

#ode n
Aser Applications 0onolithic kernel &includes most
6" ser ices'

HH.

*etwork hardware

>i! -.:DaE The monolithic kernel model.

Performance: 7f a distributed system is to be used, its performance must be at least as good as a centrali!ed system. That is, when a particular application is run

$/

on a distributed system, its o erall performance should be better than or at least e)ual to that of running the same applications on a single#processor system. (owe er, to achie e this goal, it is important that the arious components of the operating system of a distributed system be designed properly= otherwise, the o erall performance of the distributed system may turn out to be worse than a centrali!ed system. #ode 4
Aser Applications "er er@ 0anager modules 0icro kernel
&has only minimal facilities'

#ode Aser Applications "er er@ 'ana!er modu%es

#ode n
Aser applications "er er@ 'ana!er modu%es

M.
0icro kernel
&has only minimal facilities'

0icro kernel
&has only minimal facilities'

*etwork hardware

>i! -.:DbE The micro kernel model.

Sca%abi%ity: "calability refers to the capability of a system to adapt to increased ser ice load. 7t is ine itable that a distributed system will grow with time since it is ery common to add new machines or an entire sub#network to the system to take care of increased workload or organi!ational changes in a company. Therefore, a distributed operating system should be designed to easily cope with the growth of nodes and users in the system. That is, such growth should not cause serious disruption of ser ice or significant loss of performance to users. &etero!eneity: A heterogeneous distributed system consists of interconnected sets of dissimilar hardware or software systems. ;ecause of the di ersity, designing heterogeneous distributed systems is far more difficult than designing homogenous distributed systems in which each system is based on the same, or

$1

closely related, hardware and software. (owe er, as a conse)uence of large scale, heterogeneity is often ine itable in distributed systems. %urthermore, many users prefer often heterogeneity because heterogeneous distributed systems pro ide the flexibility to their users of different computer platforms for different applications. Security: 7n order that the users can trust the system and rely on it, the arious resources of a computer system must be protected against destruction and unauthori!ed access. :nforcing security in a distributed system is more difficult than in a centrali!ed system because of the lack of a single point of control and the use of insecure networks for data communication. 7n a centrali!ed system, all users are authenticated by the system at login time, and the system can easily check whether a user is authori!ed to perform the re)uested operation on an accessed resource. 7n a distributed system, howe er, since the client#ser er model is often used for re)uesting and pro iding ser ices, when a client sends a re)uest message to a ser er, the ser er must ha e some way of knowing who is the client. This is not so simple as it might appear because any client identification field in the message cannot be trusted. This is because an intruder &a person or program trying to obtain unauthori!ed access to system resources' may pretend to be an authori!ed client or may change the message contents during transmission. Therefore, as compared to a centrali!ed system, enforcement of security in a distributed system has the following additional re)uirements+ 1. 7t should be possible for the sender of a message to know that the intended recei er recei ed the message. 2 7t should be possible for recei er of a message to know that the message was sent by the genuine sender 3. 7t should be possible for both the sender and recei er of a message to be guaranteed that the contents of the message were not changed while it was in transfer.

Emu%ation of EIistin! (peratin! Systems: %or commercial success, it is important that a newly designed distributed operating system be able to emulate existing poplar operating systems such as A*7E. -ith this property, new

$2

software can be written using the system call interface of the new operating system to take full ad antage of its special features of distribution, but a ast amount to already existing old software can also be run on the same system without the need to rewrite them. Therefore, mo ing to the new distributed operating system will allow both types of software to be run side by side. -.9 Introduction to Distributed $omputin! En1ironment DD$EE: The 6pen "oftware %oundation &6"%', a consortium of computer manufactures including 7;0, .:9, and (ewlett Cackard defined a endor independent distributed computing en ironment &.9:'. -.9.4.D$E: It is not an operating system, nor it is an application. .o a certain extent, it is an integrated set of services and tools that can be installed as a coherent environment on top of existing operating systems and serve as a platform for building and running distributed applications. A primary goal of .9: is endor independence. 7t runs on many different kinds of computers, operating systems, and networks produced by different endors. %or example, some operating system to which .9: can be easily ported include 6"%@1, A7E, .60A7*", 6", A,T>7E, (C#AE, "7*7E, "un6", A*7E "ystem D, D0", -7*.6-" and 6"@2. 6n the other hand, it can be used with any network hardware and transport software, including T9C@7C, E 2.$ as well as other similar products. As shown in the below figure, .9: is a middleware software layered between the .9: applications layer and the operating system and networking layer. The basic idea is to take a collection of existing machines &possibly from different endors', interconnect them by a communication network, add the .9: software platform on top of the nati e operating systems of the machines, and then be able to build and run distributed applications. :ach machine has its own local operating system, which may be different from that of other machines. The .9: software layer on top of the operating system and networking layer hides the differences between machines by automatically performing data#type con ersions when necessary. -.9.- D$E $omponents: 7t is a mix of arious technologies de eloped independently and nicely integrated by 6"%. :ach of these technologies forms a component of .9:. The main components of .9: are as follows+

$3

.9: applications .9: "oftware 6perating systems and networking

>i! -., Position of D$E soft6are in a D$E based distributed system 1. .hreads package' 7t pro ides a simple programming model for building concurrent applications. 7t includes operations to create and control multiple threads of execution in a single process and to synchroni!e access to global data within an application. 2. +emote ,rocedure Call +,C$ facility' 7t pro ides programmers with a number of powerful tools necessary to build client#ser er applications. 7n fact, the .9: >C9 facility is the basis for all communication in .9: because the programming model underlying all of .9: is the client#ser er model. 3. Distributed .ime -ervice D.-$' 7t closely synchroni!es the clocks of all the computers in the system. 7t also permits the use of time alues from external time sources, such as those of the A.". *ational 7nstitute for "tandards and Technology &*7"T', to synchroni!e the clocks of the computers in the system with external time. This facility can also be used to synchroni!e the clocks of the computers of one distributed en ironment with the clocks of the computers of another distributed en ironment. 4. #ame -ervices' The name ser ices of .9: include the cell directory ser ice &9."', the 1lobal .irectory "er ice &1."', and 1lobal .irectory Agent &1.A'. These ser ices allow resources such as ser ers, files, de ices, and so on, to be uni)uely named and accessed in a location#transparent manner. $. -ecurity -ervice' 7t pro ides the tools needed for authentication and authori!ation to protect system resources against illegitimate access. 5. Distributed 2ile -erver D2-$. 7t pro ides a system#wide file system that has such characteristics as location transparency, high performance, and high

$4

a ailability. A uni)ue feature of .9: .%" is that it can also pro ide file ser ices to clients of other file systems. -.9.3.D$E $e%%s: The .9: system is highly scalable in the sense that a system running .9: can ha e thousands of computers and millions of users spread o er a worldwide geographic area. To accommodate such large systems, .9: uses the concept of cells. This concept helps break down a large system into smaller, manageable units called cells. 7n a .9: system, a cell is a group of users, machines, or other resources that typically ha e a common purpose and share common .9: ser ices. The minimum cell configuration re)uired a cell directory server, a security server, a distributed timeserver, and one or more client machines. :ach .9: client machine has client processes for security ser ice, cell directory ser ice, distributed time ser ice, >C9 facility, and threads facility. A .9: client machine may also ha e a process for distributed file ser ice if a cell configuration has a .9: distributed file ser er. .ue to the use of the method of intersection for clock synchroni!ation, it is recommended that each cell in a .9: system should ha e at least three distributed timeser ers. $heck %our Progress - &: Answer the following questions in one or two sentences. a$ Define a distributed computing system. b$ !ist the different computing system models. c$ What is a DC(4 d$ Define transparency in DC-. e$ !ist the various forms of transparencies expected by a DC-. $heck %our Progress - (: Answer the following questions 1. :xplain the different computing system models. 2. -hy are distributed computing system gain popularityF :xplain. 3. .iscuss the arious design issues of .istributed operating system. 4. :xplain .istributed 9omputing :n ironment. $. .iscuss the relati e ad antages and disad antages of the arious commonly used models for configuring distributed computing systems.

$$

-.< Summary: ,et us sum up the different concepts we ha e studied till here. A distributed computing system is a collection of processors interconnected by a communication network in which each processor has its own local memory and other peripherals and communication between any two processors of the system takes place by message passing o er the communication network. The existing models for distributed computing systems can be broadly classified into fi e categories, minicomputer, workstation#ser er, processor#pool and hybrid. .istributed computing system are much more complex and difficult to build than the traditional centrali!ed systems. .espite the increased complexity and the difficulty of buildings, the installation and the use of distributed computing system are rapidly increasing. This is mainly because the ad antages of distributed computing systems outweigh its disad antages. The main ad antages of distributed computing systems are &a' suitability for inherently distributed applications. &b' "haring of information among distributed users and sharing of resources &d' better price performance ratio &e' shorter response times and higher throughout &f' higher reliability &g' extensibility and incremental growth and &h' better flexibility in meeting users needs. The operating systems commonly used for distributed computing systems can be broadly classified into two types+ network operating systems and distributed operating systems. As compared to a network operating system, a distributed operating system has better transparency and fault capability and pro ides the image of a irtual uniprocessor to the users. The main issue in ol ed in the design of a distributed operating system is transparency, reliability, flexibility, performance, scalability, heterogeneity, security and emulation of existing operating systems.

$5

U#IT* 3 DISTRIBUTED DATABASES5A# (VERVIE? Structure 3.. (bAecti1es 3.4 Introduction 3.- Re1ie6 of Databases 3.-.4 The Re%ationa% 'ode% 3.-.- The Re%ationa% (perations 3.3 Distributed Processin!5 An Introduction 3./ >eatures of Distributed Versus $entra%i=ed Databases 3.+ Uses of Distributed Databases 3.: Distributed Database 'ana!ement Systems 3., Summary ..) *b+ectives: .he main ob5ectives of this unit are' 2amiliarization of /asic Database Concepts Distributed processing Distributed Databases Distributed Database *anagement -ystems 3.4 Introduction: 7n recent years databases ha e become an important area of information processing, and it is easy to predict that importance will rapidly grow. There are both organi!ational and technological reasons for this trend. The distributed databases eliminate many of the problems of centrali!ed databases and fit more naturally in the decentrali!ed structures of many organi!ations. A ery ague definition of distributed database is that I7t is a collection of data which belong logically to the same system but spread o er the sites of a computer networkJ. To understand this new way of data processing, this Anit first introduces the basic concepts of databases. ,ater an o er iew of distributed data processing is discussed. 7n the next section, we differentiate the con entional centrali!ed

$4

database system and distributed database system. %inally, a .istributed .ata ;ase 0anagement "ystem &..;0"' is discussed. 3.- Re1ie6 of Databases: 7n this section we discuss some basic concepts of databases, which will be ery much necessary for understanding the further topics of this unit. A database model known as the Re%ationa% mode% is used. ;asically, the relational model allows the use of powerful, set#oriented, associati e expressions instead of the one#record#at#a#time primiti es of more procedural models like the 9odasyl model. -e use in this course two different data manipulation languages+ Re%ationa% a%!ebra and S;8. S;8, which is a Iuser#friendlyJ language, is used for dealing with the problems of writing application program for distributed databases. >elational algebra is instead used for describing and manipulating access strategies to distributed databases. 3.-.4 The Re%ationa% 'ode%: (ere some basic terminologies are defined. They are as follows+ o Re%ations: These are the tables that are used for storing the data in relational databases. o Attributes: :ach relation has a &fixed' number of columns, called Attributes. o Tup%es: These are the dynamic, time# arying number of rows. o The number of attributes of a relation is called its !rade. o The number of tuples is called its cardina%ity. o The set of possible alues of a gi en attribute is called its domain An EIamp%e for demonstratin! the abo1e termino%o!ies: :0C*A0 2 4 11 12 14 *A0: Dasu Darma >ama Grishna (ari &a' >elation :0C A1: 2$ 34 12 22 3/ .:CT*A0 1 2 1 3 1

$2

:0C*A0 2 4 11 12 14

A1: 2$ 34 12 22 3/

.:CT*A0 1 2 1 3 1

*A0: Dasu Darma >ama Grishna (ari

&b' >elation :0C< >i! 3.4 EIamp%e of re%ation %igure 3.1&a' shows a relation (*, &employee' consisting of four attributes+ (*,#6*, #"*(, "&( and D(,.#6*. The relation has fi e tuples= for example, &2, Dasu , 2$, 1' is a tuple of the :0C relation. The grade of relation :0C is 4 and the cardinality is $. o Re%ation Schema+ The relation name and the name of attributes appearing in it are called the relation schema of the relation= for example, (*, (*,#6*, #"*(, "&(, D(,.#6*$ is the relation schema of relation (*, in the abo e example. o "trictly speaking the relations are treated as sets of tuples. "o, the following points are to be noted+ 1. There cannot be two identical tuples in the same relation. 2. There is no defined order of the tuples of a relation. o The 7eys of the relation are the subsets of the attributes of a relation schema whose alues are uni)ue within the relation, and thus can be used to uni)uely identify the tuples of the relation. 1i en that relations are sets of tuples, a key must exist= at least, the set of all attributes of the relation constitutes the key. Thus, in the abo e examples of %igure 2.1, /0P",0 is an appropriate 7ey of /0P. Also in the relation there would be more than one key like the pair of attributes+ (*,#6*, D(,.#6*. .ypically, one of them is selected as the primary 7ey. o The re%ationa% a%!ebra is a collection of operations on relations, each of which takes one or two relations as operands and produces one relation as result.

$?

3.-.- The Re%ationa% (perations: The operations of relational algebra can be composed into arbitrarily complex eIpressions= :xpressions allow specifying the manipulations that are re)uired on relations in order to retrie e information from them. o The different operations that are possib%e on re%ations: (ere fi e basic operations are defined+ Se%ectionG ProAectionG $artesian productG UnionG and Difference. %rom these operations, some other operations are deri ed, such as IntersectionG Di1isionG "oin and Semi Aoin. -e ha e highlighted those operations, which ha e an important application in distributed databases, such as "oin and Semi5Aoin. The meaning of each operation is described referring to an example. Unary operations take only one relation as operand= they include selection and pro8ection. The Se%ection ",% > where > is the operand to which the selection is applied and % is a formula that expresses a selection predicate produces a result relation with the same relation schema as the operand relation, and containing the subset of the tuples of the operand, which satisfy the predicate. The formula in ol es attribute names or constants as operand, arithmetic comparison operators, and logical operators. ProAection CNAttr where Attr denotes a subset of the attributes of the operand relation produces a result ha ing these attributes as relation schema. Binary operations take two relations as operand= we re iew union, difference, 9artesian product, 8oin and semi#8oin. The union R U# S is meaningful only between two relations > and " with the same relation schema= it produces a relation with the same relation schema as its operands and the union of the tuples of > and " &i.e all tuples appearing either in > or in " or in both' The difference R D> S is meaningful only between two relations > and " with the same relation schema= it produces a relation with the same relation schema as its operands and the difference between the tuples of > and " &i.e all tuples appearing in > but not in "'

5/

The $artesian product R $P S produces a relation whose relation schema includes all the attributes of > and ". 7f two attributes with the same name appear in > and ", they are ne ertheless considered as different attributes= in order to a oid ambiguity, the name of each attribute is prefixed with the name of its IoriginalJ relation. : ery tuple of > is combined with e ery tuple of " to form one tuple of the result. EIamp%e: The different operations explained already are illustrated using an example below. > A a b a b ; 1 1 1 2 9 A ; . % A A A " ; 1 3 9 a f ; 1 3 3 1 2 T 9 a b c d a . 1 1 2 4 3

DaE (perand re%ations A A A ; 1 1 9 A . A A ; ; ; 1 1 2 A ; 9 a 1 A b 1 b a 1 d b 2 f a 3 f dE Union R U# S

bE Se%ection S8AJa R

cE ProAection P" A G B R

The Aoin of two relations > and " is denoted as > N* % ", where % is a formula, which specifies the 8oin predicate. A 8oin is deri ed from selection and 9artesian product as follows+ > "# % "O S8% &> $P "'
The natura% Aoin > *N* " of two relations > and " is a 8oin in which all

attributes with the same names in the two relations are compared. "ince these attributes ha e both the same name and the same alues in all the tuples of the result, one of the two attributes is omitted from the result. >.A a b a >.; 1 1 1 >.9 a b d ".A a a a ".; 1 1 1 ".9 a a a

51

b a b a b A ; A A ; 1 1 2

2 1 1 1

f a b d

a a a a

1 3 3 3

a f f f f T.9 a a b d . 1 3 1 4

2 f a 3 fE $artesian Product R $P S 9 ; . % A A A ; A

eE Difference R D> S A A A ; 1 1 9 a d . 1 4

>.; >.9 T.; 1 a 1 1 a 2 1 b 3 1 d 1 !E "oin R "# R.$JT.$ T

hE #atura% Aoin R #"# T

A ; 9 A ; 9 A 1 a a 1 a ; 1 b a 1 d A 1 d iE Semi Aoin RS" R.$JT.$ T AE #atura% Aoin R#S" T

>i!ure 3.- (perations of re%ationa% a%!ebra. The semi5Aoin of two relations > and " is denoted as > "N % ", where % is a formula that specifies a 8oin predicate. A semi#8oin is deri ed from pro8ection and 8oin as follows: > S"% " O P" Attr&>' &>"#% "' -here Attr&>' the set of all attributes of >. The natura% semi5Aoin of two relations > and ", denoted > *"N ", is obtained by considering a semi#8oin with the same 8oin predicate as in the natural 8oin. o S;8 Statement: A simp%e statement in S;8 has the fo%%o6in! structure: "elect &attribute list' %rom &relation name' -here &predicates' The interpretation of this statement is e)ui alent to performing a selection operation using the predicates of the Iwhere clauseJ on the relation specified in

52

the Ifrom clauseJ and then pro8ecting the result on the attributes of the Iselect clause.J consider the relation of %igure 2.1a, the statement+ EIamp%e 4: 9onsider the relation :0C of the %ig "elect #"*(, "&( %rom (*, -here "&( P 2/ A*. D(,.#6* O 1 >eturns the following relation+ #"*( Derma (ari A1: 2$ 3/

EIamp%e -: >eferring to the relation of %igure, the "Q, statement "elect ", + .C %rom +, . -here +. / 7 .. / and D 7 3 >eturns as a result the relation+ A +. C ; % The execution of a statement of this type can be interpreted in the following way+ 1. ./ 2. 3. Cerform a selection with predicate D O 3 on the result of the 8oin operation. Cro8ect the result on attributes " and + . C. Cerform the 8oin of relation + and . using the 8oin clause + . / 7 .

3.3 Distributed Processin!5 An Introduction: A .istributed processing system is one in which se eral autonomous processors and data stores supporting processes and@or databases interact in order to cooperate to achie e an o erall goal. The processes coordinate their acti ities and exchange information transferred o er a communication network.

53

*owadays distributed processing is the important area of information processing and also it is implemented using the concept of .istributed databases. A .istributed database is a collection of data, which belong logically to the same system but are spread o er the sites of computer network. The two important aspects of a distributed database+ Distribution+ The data are not resided in centrali!ed place 8o!ica% corre%ation+ The data are tied themsel es with some properties

,et us understand the abo e terminologies by taking an example. 9onsider a ;ank Transaction. (ere the bank has three branches at different locations .At each branch the computer controls the teller terminals of the branch and the account database of the branch &%igure 2 .1'. :ach local branch database is termed as "7T: of the distributed database connected by a communication network. All the local transactions are managed by these local computers and will therefore be called as %oca% app%ications. An example of local application is a debit or a credit application performed on an account stored at the same branch at which the application is re)uested. Also there are some applications which accesses data at more than one branch known as )%oba% app%ications or Distributed app%ications. i.e., for example the transfer of funds from account of one branch to an account of other branch. This is not that simple issue as it in ol es updating the databases at two different places. -e can now summari!e the abo e aspects and let us reform the definition of the distributed database as a collection of data, which are distributed over different computers of a computer network. (ach site of the network has autonomous processing capability and can perform local applications and also it participates at least in one global application, which re3uires accessing data at several sites using a communication subsystem. The most important aspect expected from the system is the cooperation between the autonomous sites. 3./ >eatures of Distributed Versus $entra%i=ed databases: 7t is better if we look at the typical features of the centrali!ed database and compare them with the corresponding features of distributed databases. The fo%%o6in! tab%e !i1es the comparati1e study of the main features.

54

"l.*o %eatures 1 9entrali!ed control

9entrali!ed databases The idea of centrali!ation is much emphasi!ed here. (ere a .atabase administrator &.;A' takes care about the safety of the data .ata .ata independence means independence that the actual organi!ation of the data is transparent to the programmer. (ere a notion called 9onceptual schema is used which gi es the conceptual iew of the data. >eduction of (ere the redundancy is redundancy reduced as far as possible.

.istributed databases (ere the idea of distribution of the data is considered. A hierarchical approach of administration like 1lobal administrator, local administrator is incorporated. (ere a notion called .istribution transparency is used which makes the program unaffected by the mo ement of the data from one site to another.

9omplex physical structures and efficient access

The efficient access of the data is supported by the complex physical data structures like secondary indexes, interfile chains and so on.

7ntegrity, >eco ery and 9oncurrency control

Cri acy "ecurity

(ere this problem can be sol ed easily as it is controlled at one point. Transaction atomicity is the concept used for this purpose i.e the se)uence of operations performed either completely or not performed at all. and 0aintained by centrali!ed 0aintained by ,ocal .ata data base administrator. Administrators at different sites

(ere the redundancy is an added feature as the locality of applications is increased if the data is replicated at all sites where the applications needed it and also the reliability of the application can be increased . (ere 9omplex physical structures will not support the efficient data access as the data is distributed. This can be sol ed using the concepts like local and global optimi!ation, which determines the optimum procedure for accessing the data at different sites. (ere it is difficult as the transactions may be initiated at different sites simultaneously. "pecial algorithms ha e to be used to take care this aspect.

;ase

3.+ Uses of Distributed databases: 6rgani!ational and economic reasons Asage and interconnection of existing databases 7ncremental growth of an organi!ation >educed communication o erhead

5$

Cerformance aspects 7ncreased reliability and a ailability 3.: Distributed Data Base 'ana!ement Systems DDDB'SE: A .istributed .atabase 0anagement "ystem helps in the creation and management of distributed databases .The software re)uirements for building a distributed database are The database management component &.;' The data communication component &.9' The data dictionary &..', which gi es the idea of data distribution in the entire network. The distributed database component &..;' The fig 3.3 can show the interaction between the abo e components. The ser1ices supported by the abo1e components are: >emote database access to an application program. "ome degree of distribution transparency "upport for database administration and control "upport for concurrency control and reco ery of distributed transactions

The different types of the Distributed database accesses a1ai%ab%e: a. Remote access 1ia DB'S primiti1es: (ere the application issues a re)uest, which refers to remote data. %ig 3.4 shows this scenario. This re)uest is automatically routed by the ..;0" to the site where the data is located= then it is executed and the corresponding result is returned. b. Remote access 1ia auIi%iary pro!ram: (ere the application re)uires the auxiliary program to be executed at the remote site, which accesses the remote database and returns the result to re)uesting application. %ig 3.$ gi es the exact picture about the concept.

55

.;

.9 ..;

,ocal .atabase 1

.. "ite 1 "ite 2

,ocal .atabase 2

..

..; .; .9

T T

>i!. 3.3 9omponents of a commercial ..;0"

54 .ATA;A": A99:"" C>707T7D:

Application program

.;0"1 "ite 1 "ite 2

.;0"2 >:"A,T

.atabase 2

>i! 3./ >emote access ia .;0" primiti es

Application program

>:QA:"T %6> :E:9AT76* 6% AAE7,,A>R C>61>A0

.;0"1

"ite 1 "ite2 Auxiliary program 1,6;A, >:"A,T

.;0"2

.ATA;A": A99:"" C>707T7D:" A*. >:"A,T"

.atabase 2

>i! 3.+ >emote access ia an auxiliary pro!ram $heck %our Progress - &: Answer the following A. 7n relational databases, the data are stored in tables called #############. ;. The number of attributes of a relation is ##########. 9. The number of tuples of a relation is ############. .. -hat is a )ueryF :. -hat is a transactionF

52

$heck %our Progress - (: Answer the following 1. The collection of data that belong to the same system but are spread o er the sites of a computer network is #############. 2. -ho guarantees the safety of the dataF 3. 7n ..; which are the two possible .;As a ailableF 4. .efine global optimi!ation and local optimi!ation. $. 1i e the reasons for the necessity of .istributed databases. 5. 1i e the different components of ..;0". Practical /1ercise: .emonstrate all the relational operations using an organi!ational structure of ,79. Ase "Q, statements. 3., Summary: ;y the discussions made in this unit you ha e come to know the importance of data distribution and distributed processing. Also we ha e discussed about the software re)uirement for managing the distributed database. 7n the next Anit let us discuss the architectural and conceptual re)uirements of the distributed databases.

5?

U#IT5/ 8EVE8S (> DISTRIBUTI(# TRA#SPERE#$F Structure /.. (bAecti1es /.4 Introduction /.- Reference Architecture for distributed databases /.3 Types of data fra!mentation /.3.4 &ori=onta% >ra!mentation /.3.- Deri1ed &ori=onta% >ra!mentation /.3.3 Vertica% >ra!mentation /.3./ 'iIed >ra!mentation /./ Inte!rity $onstraints in Distributed Databases /.+ Summary 2.) *b+ectives: In this unit we will learn the following topics' +eference "rchitecture for Distributed Databases .ypes of Data 2ragmentation Integrity constraints in Distributed databases

/.4 Introduction+ 7n this unit we suggest reference architecture for the distributed database. This architecture allows us to determine the different le els of transparency, which are conceptually rele ant to understand distributed databases. Also the mapping between the different le els is defined. (ere we ha e used the relational model and relational algebra for this purpose. The distributed access primiti es are represented using the "Q, statements, as it is user friendly. The emphasis is gi en to the fact that how the "Q, primiti es reference the ob8ects which constitute the database. The different types of

4/

fragmentation methods are discussed. Also the integrity constraints in the distributed transaction are explained. /.- Reference Architecture for Distributed Databases: (ere we ha e suggested reference architecture for the distributed databases as shown in the fig.4.1.The different le els are conceptually helpful to understand the functioning of the whole system. The arious stages of the architecture are as follows. )%oba% Schema: defines all the data, which are contained in the distributed database as if it is a centrali!ed system. (ere a set of global relations is used. >ra!mentation Schema: :ach global relation is split into se eral non# o erlapping portions that are called as >ra!ments. The mapping between the global relations and fragments is defined in the %ragmentation "chema. 7t is a one to many relation such that se eral fragments correspond to one global relation but only one global relation corresponds to one fragment. They are indicated as >i, the ith fragment of the global relation >. A%%ocation Schema: The fragments are really the logical portions of the global relation, which are physically dispersed at different sites of the network. This schema defines at which site&s' a fragment is allocated. 7t is to be noted that depending upon the re)uirement more than one fragment may be allocated at a site. "o this mapping determines whether the system is a >edundant system or a #on redundant system. 8oca% 'appin! Schema: -e ha e already described the relationships between the ob8ects at the three top le els of this architecture. These three le els are site independent= therefore, they do not depend on the data model of the local .;0"s. At a lower le el, it is necessary to map the physical images to the ob8ects that are manipulated by the local .;0" s. This mapping is called a %oca% mappin! schema and depends on the type of local .;0"= therefore in a heterogeneous system we ha e different types of local mappings at different sites.

41

1lobal "chema

%ragmentation "chema

"ite 7ndependent "chemas

Allocation "chema

:&ther sites;
,ocal 0apping "chema 1 ,ocal 0apping "chema 2 . . . .

.;0" of "ite 1

.;0" of "ite 2

,ocal database at site *

,ocal database

at site + >i!. /.4 A reference architecture for distributed databases

42

This architecture pro ides a

ery general conceptual framework for understanding

distributed databases. The three most important ob8ecti es that moti ate the features of this architecture are the separation of data fragmentation and allocation, the control of redundancy, and the independence from local .;0"s. Separatin! the concept of data fra!mentation from the concept of data a%%ocation: This separation allows us to distinguish two different le els of distribution transparency, namely fra!mentation transparency and %ocation transparency. %ragmentation transparency is the highest degree of transparency and consists of the fact that the user or application programmer works on global relations. ,ocation transparency is a lower degree of transparency and re)uires the user or application programmer to work on fragments instead of global relations= howe er, he or she does not know where the fragments are located. EIp%icit contro% of redundancy: The reference architecture pro ides explicit control of redundancy at the fragment le el. Independence from %oca% DB'Ss+ This feature, called %oca% mappin! transparency, allows us to study se eral problems of distributed database management without ha ing to take into account the specific data models of local .;0"s. Another type of transparency, which is strictly related to location transparency, is rep%ication transparency. >eplication transparency means that the user is unaware of the replication of fragments. /.3 Types (f Data >ra!mentation: T6o different types fra!mentation+ (ori!ontal and Dertical fragmentation can decompose the global relations into fragments. -e will first consider these two types of fragmentation separately and then consider the more complex fragmentation, which can be obtained by applying a composition of both. 7n all types of fragmentation, a fragment can be defined by an expression in a relational language &we will use relational algebra', which takes global relation as operands and produces the fragment as result. %or example, if a global relation contains data about employees, a fragment which contains only data about employees who work at department D1 can be ob iously defined by a selection operation on the global relation.

43

Some ru%esG 6hich must be fo%%o6ed 6hen definin! fra!ments: $ompleteness condition: All the data of the global relation must be mapped into the fragments= i.e., it must not happen that a data item that belongs to a global relation does not belong to any fragment. 3econstruction condition: 7t must always be possible to reconstruct each global relation from its fragments. The necessity of this condition is ob ious in fact, only fragments are stored in the distributed database, and global relation ha e to be built through this reconstruction operation if necessary. !is+oint condition: 7t is con enient that fragments be dis8oint, so that the replication of data can be controlled explicitly at the allocation le el. /.3.4 &ori=onta% >ra!mentation: (ori!ontal fragmentation consists of partitioning the tuples of a global relation into subsets= this is clearly useful in distributed databases, where each subset can contain data that ha e common geographical properties. 7t can be defined by expressing each fragment as a selection operation on the global relation. EIamp%e: let a global relation be -6,,!I(+ -#6*, #"*(, CI.8$ Then the hori!ontal fragmentation can be defined in the following way+ -6,,!I(+1 O S8CI.8 71*ysore1 SUPP8IER -6,,!I(+9 O S8CI.8 71-himoga1 SUPP8IER *ow let us erify whether this fragmentation fulfills the conditions stated earlier. The comp%eteness condition: 7f I0ysoreJ and I"himogaJ are the only possible alues of the CI.8 attribute, then it satisfies this condition. The reconstruction condition: can be erified easily, because it is always possible to reconstruct the -6,,!I(+ global relation through the following operation. #,PP 4/3 5 #,PP 4/31 U# -6,,!I(+2 The disAoint ness condition is clearly erified. ;ua%ification+ The predicate, which is used in the selection operation and defines a fragment, is called as Qualification. %or instance, in the abo e example the )ualifications )1 + CI.8 O I0ysoreJ )2 + CI.8 O I"himogaJ

44

-e can generali!e from the abo e example that in order to satisfy the completeness condition, the set of )ualifications of all fragments must be complete, at least with respect to the set of allowed alues. The reconstruction condition is always satisfied through the union operation, and the dis8oint ness condition re)uires that )ualifications be mutually exclusi e. /.3.- Deri1ed &ori=onta% >ra!mentation: This is a type of fragmentation, which is deri ed from the hori!ontal fragmentation of another relation. EIamp%e+ 9onsider a global relation -6,,!8 -#6*, ,#6*, D(,.#6*, :6"#$ where -#6* is a supplier number. 7f it is re)uired that a fragment has to contain the tuples for suppliers, which are in a gi en city, and then we ha e to go for deri ed fragmentation. A semi#8oin operation with the fragments "AC,7:>1 and "AC,7:>2 is needed in order to determine the tuples of -6,,!8, which correspond to the suppliers in a gi en city. The deri ed fragmentation of -6,,!8 can be therefore defined as follows+ -6,,!81 O -6,,!8 S""*A0O"*A0-6,,!I(+1 -6,,!89 O -6,,!8 S""*A0O"*A0-6,,!I(+2 The reconstruction of the global relation -6,,!8 can be performed through the union operation as was shown for -6,,!I(+. The comp%eteness of the abo e fragmentation re)uires that there be no supplier numbers in the "ACC,R relation, which are not contained also in the "ACC,7:> relation. This is a typical, and reasonable, integrity constraint for this database and usually is called as the referentia% inte!rity constraint. The disAoint ness condition is satisfied if a tuple of the "ACC,R relation does not correspond to two tuples of the "ACC,7:> relation that belong to two different fragments. 7n this case this condition is easily erified, because the supplier numbers are uni)ue keys of the "ACC,7:> relation. /.3.3 Vertica% >ra!mentation: The ;ertical fragmentation of a global relation is the subdi ision of its attributes into groups= fragments are obtained by proAectin! the global relation o er each group. This can be useful in distributed databases where each group of attributes can contain data that ha e common geographical properties. The fragmentation is correct= if each attribute is

4$

mapped into at least one attribute of the fragments= moreo er, it must be possible to reconstruct the original relation by 8oining the fragments together. EIamp%e: 9onsider a global relation :0C &:0C*A0, *A0:, "A,, TAE, 01>*A0, .:CT*A0' A ertical fragmentation of this relation can be defines as :0C1OP" :0C*A0,*A0:, 01>*A0, .:CT*A0 :0C :0C2OP" :0C*A0, "A,,TAE :0C The reconstruction of relation :0C can be obtained as :0CO:0C1"# :0C*A0O:0C*A0 :0C2 This is because= :0C*A0 is a key of :0C. ,et us draw some important points to be noted from this example. The purpose of including the 7ey of the global relation into each fragment is to ensure the reconstruction property. An alterati e way to pro ide the reconstruction property is to generate tup%e identifiers that are used as system#controlled keys. This can be con enient in order to a oid the replication of large keys= moreo er, users cannot modify tuple identifiers. ,et us finally consider the problem of fragment dis8oint ness. %irst, we ha e seen that at least the key should be replicated in all fragments in order to allow reconstruction. 7n fact, if we include the same attribute in two different ertical fragments, we know exactly that the column that corresponds to this attribute. %or example, consider the following ertical fragmentation of relation (*,+ (*,1 O P"(*,#6*,#"*(,*&+#6*,D(,.#6* (*, (*,2 O P"(*,#6*,#"*(,-"!,."< (*, The attribute #"*( is replicated in both fragments. -e can remo e this attribute when we reconstruct relation (*, through an additional pro8ection operation. (*, O (*,1 "#(*,#6*7(*,#6* P"(*,#6*, -"!, ."< (*,2 /.3./ 'iIed >ra!mentation: The fragments that are obtained by the abo e fragmentation operations are relations themsel es, so that it is possible to apply the fragmentation operations recursi ely, pro ided that the correctness conditions are

45

satisfied each time. The reconstruction can be obtained by applying the reconstruction rules in re erse order. EIamp%e: 9onsider the same global relation (*, (*,#6*, #"*(, -"!, ."<, *&+#6*, D(,.#6*$ The following is a mixed fragmentation, which is obtained by applying the ertical fragmentation of the pre ious example, followed by a hori!ontal fragmentation on D(,.#6*' (*,1 O S8!/PT",0&) P"(*,#6*, #"*(, *&+#6*, D(,.#6* (*, (*,2 O # (*,3 O #
&)6!/PT",0 () !/PT",07()

P"(*,#6*, #"*(, *&+#6*, D(,.#6* (*,

P"(*,#6*, #"*(, *&+#6*, D(,.#6* (*,

(*,4 O P"(*,#6*, #"*(, -"! , ."< (*,

:0C

:0C4

:0C1 :0C2 :0C3

>i!./.3 The fragmentation tree of relation (*,


The reconstruction of relation (*, is defined by the following expression+

(*, O U# &(*,1, (*,9, (*,=' "#/0P",0 O (*,#6* P"(*,#6*, -"!, ."< (*,> A fragmentation tree can con eniently represent mixed fragmentation &as shown in the abo e figure'. 7n a fragmentation tree, the root corresponds to a global relation, the lea es corresponds to the lea es correspond to the fragments, and the intermediate nodes correspond to the intermediate results of the fragment#defining expressions.

44

The EKA'P8E0DDB+ The following codes shows the global and fragmentation schemata of :EA0C,:S..;. 0ost of the global relations of :EA0C,:S..; and their fragmentation ha e been already introduced. A D(,. relation, hori!ontally fragmented into three fragments on the alue of the D(,.#6* attribute, is added. &lobal schema (*, (*,#6*, #"*(, -"!, ."<, *&+#6*, D(,.#6*$ D(,. D(,.#6*, #"*(, "+(", *&+#6*$ -6,,!I(+ -#6*, ,#6*, D(,.#6*, :#6*$ 2ragmentation schema (*,1 O S8D(,.#6* ?@ P"(*,#6*, #"*(, *&+#6*, D(,.#6* (*,$ (*,2 O S8?@AD(,.#6* 2@ P"(*,#6*, #"*(, *&+#6*, D(,.#6* (*,$ (*,3 O S8D(,.#6* P2@ P"(*,#6*, #"*(, *&+#6*, D(,.#6* (*,$ (*,4 O P"(*,#6*, #"*(, -"!, ."< (*,$ D(,.1O S8D(,.#6*?@ D(,.$ D(,.2O S8?@AD(,.#6*9@ D(,.$ D(,.3O S8D(,.#6*B9@ D(,.$ -6,,!I(+1 O S8CI.8 7 0-21 -6,,!I(+$ -6,,!I(+2 O S8CI.8 7 0!"1 -6,,!I(+$ -6,,!81 O -6,,!8 S"#",05#",0-6,,!I(+1 -6,,!82 O -6,,!8 S"#",05#",0-6,,!I(+2 /./ Inte!rity $onstraints In Distributed Databases: -hen an update performed by a database application iolates an integrity constraint, the application is re8ected and thus the correctness of data is preser ed. A typical example of integrity constraint is referential integrity, which re)uires that all alues of a gi en attribute of a relation exist also in some other relation. This constraint is particularly useful in distributed databases, for ensuring the correctness of deri ed fragmentation. %or example, since the -6,,!8 relation has a fragmentation which is deri ed from that of -6,,!I(+ relation by means of a semi#8oin on the -6,#6* attribute, it is re)uired that all alues of -6,#6* in -6,,!8 be present also in -6,,!I(+.

42

7ntegrity constraints can be enforced automatically by adding to application programs some code for testing whether the constraint is iolated. 7f so, the program execution is suspended and all actions already performed by it are cancelled, if necessary. 6ne of the most serious disad antages of integrity constraints is the loss in performance that is due to the execution of the integrity tests= this loss is ery important in distributed databases. The ma8or problems in applying integrity checking might increase the need of accessing remote sites. 7t is necessary to consider also integrity checking in the design of the distribution of database. $heck %our Progress-&: Answer the following: 1. .efine 1lobal schema, %ragmentation schema, Allocation schema and ,ocal schema. 2. -hat is the concept of redundancyF 3. ,ist the different types of %ragmentation techni)ues a ailableF 4. .efine referential integrity constraint. $. -hat are the rules that the fragmentation techni)ue should followF $heck %our Progress-(: Answer the following: 1. :xplain the reference architecture for distributed databases. 2. :xplain the different fragmentation techni)ues with an example. 3. .iscuss the integrity constraints in distributed databases. $heck %our Progress-.: #olve the following: .esign a 1lobal schema for the organi!ation structure of I,79 6f 7ndiaJ. -rite the fragmentation schema for the same. /.+Summary: 7n this unit we ha e studied reference architecture for distributed database. Also the different types of fragmentation techni)ues are discussed. -e ha e also seen some demonstration examples. "ome ideas about integrity constraints are gi en.

4?

U#IT5+ DISTRIBUTED DATABASE DESI)# Structure +.. (bAecti1es +.4 Introduction +.- A >rame6or7 for Distributed Database Desi!n +.-.4 (bAecti1es of the Desi!n of Data Distribution +.-.- Top * Do6n and Bottom * Up Approach *A c%assica% Desi!n 'ethodo%o!ies +.3 The Desi!n of Database >ra!mentation +.3.4 &ori=onta% >ra!mentation +.3.4.4 Primary >ra!mentation +.3.4.- Deri1ed &ori=onta% >ra!mentation +.3.- Vertica% >ra!mentation +.3.3 'iIed >ra!mentation +./ The A%%ocation of >ra!ments +.+ Summary 8.) *b+ectives: In this unit you will come to know the different design aspects of distributed databases. "t the end of this unit you will be able to describe the topics like " framework for distributed database design .he ob5ectives of design of data distribution .op C Down and /ottom C 6p design approaches .he design of database fragmentation Dorizontal 2ragmentation ;ertical 2ragmentation *ixed 2ragmentation .he allocation of fragments &eneral Criteria for 2ragment allocation

2/

+.4 Introduction: The concept of data distribution itself is difficult to design and implement because of arious technical and organi!ational issues. "o we need to ha e an efficient design methodology. %rom the technical aspect, the interconnection of sites and appropriate distribution of the data and applications to the sites depending upon the re)uirement of applications and for optimi!ing performances. %rom the organi!ational point, the issue of decentrali!ation is crucial and distributing an application has a greater effect on the organi!ation. 7n recent years, lot of research work has taken place in this area and the ma8or outcome of this are+ o .esign criteria for effecti e data distribution o 0athematical background of the design aids 7n the section $.2 you will learn a framework of the design including the design approaches like Top 3 .own and ;ottom 3 Ap. The section $.3 explains about the design of (ori!ontal and Dertical %ragmentation. 7n the section $.4 we will gi e principles and concepts in %ragment allocation. +.- A >rame6or7 for Distributed Database Desi!n: The design of a centrali!ed database concentrates on+ .esigning the conceptual schema that describes the complete database .esigning the Chysical database, which maps the conceptual schema to the storage areas and determines the appropriate access methods. The abo e two steps contributes in distributed database towards the design of )%oba% schema and the desi!n of %oca% databases. The added steps are+ Desi!nin! the >ra!mentation+ # The actual procedure of di iding the existing global relations into hori!ontal, ertical or mixed fragments Desi!nin! the a%%ocation of fra!ments+ #Allocation of different fragments according to the site re)uirements ;efore designing the .istributed database a thorough knowledge about the application is a must. 7n this case we expect the following things from the designer.

"ite of 6rigin+ The site from which the application is issued. The fre)uency of in oking the re)uest at each site The number, type and the statistical distribution of accesses made by each application to each re)uired data.

21

7n the coming section let us try to know the actual need of design of data distribution. +.-.4 (bAecti1es of the Desi!n of Data Distribution: 7n the design of data distribution the following ob8ecti es should be considered. Processin! %oca%ity: >educing the remote references in turn maximi!ing the local references is the primary aim of the data distribution. This can be achie ed by ha ing redundant fragment allocation meeting the site re)uirements. $omp%ete %oca%ity is an extended idea, which simplifies the execution of application. A1ai%abi%ity and re%iabi%ity of distributed data: A ailability is achie ed by ha ing multiple copies of the data for read only applications. >eliability is achie ed by storing the multiple copies of the information, as it will be helpful in case of system crashes. ?or7%oad distribution: workload distribution is the ma8or goal to ha e high degree of parallelism. Stora!e costs and Processin! %oca%ity: 9ost criteria and A ailability of storage areas should be intelligently handled for effecti e data distribution. Asing the all abo e criteria may increase the design complexity. "o important aspects are taken as ob8ecti es depending upon the re)uirement and others are treated as constraints. 7n the next section let us design a simple approach for maximi!ing the processing locality. +.-.- Top * Do6n and Bottom * Up Approach *A c%assica% Desi!n 'ethodo%o!ies: There are two classical approaches as far as distributed databases design is concerned. They are+ 1. Top * Do6n Approach: This may be )uite useful when the system has to be designed from the scratch. (ere we follow the following steps+ .esign of 1lobal "chema. .esign of %ragmentation "chema. .esign of Allocation "chema. .esign of ,ocal "chema &.esign of IChysical .atabasesJ'.

22

2. Bottom 5 Up Approach: This can be used for an existing system. This approach is based on the integration of existing schemata into a single, global schema. ;ut re)uires that the following aspects ha e to be fulfilled. The selection of a common database model for describing the 1lobal schema of the database. The translation of each local schema into the common data model. The 7ntegration of common schemata into a common 1lobal schema. i.e the merging of common data definitions and the resolution of conflicts among different representations gi en to the same data. The ;ottom 3 Ap design re)uire sol ing these three problems. Then of course the design steps are 8ust re erse of the pre ious method. +.3 The Desi!n of Database >ra!mentation: (ere we discuss the design of non# o erlapping fragments, which are the logical units of allocation. That is, it is important to ha e an efficient design methodology so that we can o ercome the related problems of allocation. 7n the following, we explain the design of (ori!ontal, Dertical and 0ixed %ragmentations. +.3.4 &ori=onta% >ra!mentation: (ere we discuss two important methods called Crimary and .eri ed. .etermining the hori!ontal fragmentation in ol es knowing+ The logical properties of the data such as fragmentation predicates. The statistical properties of the data such as the number of references of applications to the fragments. +.3.4.4 Primary >ra!mentation: The correctness of Crimary fragmentation re)uires that each global relation be selected in one and only one fragment. Thus, determining the primary hori!ontal fragmentation of a global relation re)uires determining a set of dis8oint and complete selection predicates &we shall define this later in this section'. The property we expect from each fragment is that the elements of them must be referenced homogeneously by all applications. ,et 1 be the global relation for which we want to produce a hori!ontal primary fragmentation. ,et us define some terminologies. A Simp%e Predicate: is a predicate of the type+ Attribute O alue

23

A 'in5term Predicate y for a set C of simple predicates is the con8unction of all predicates appearing in C, either taken in natural form or negated. Thus+ y J pi L pi p -here &piT O pi or pi T O*6T pi' and y false

A fra!ment is the set of all tuples for which a min#term predicate holds. A simple predicate is re%e1ant respect to a set C of simple predicates if there exists at least two min#term predicates of C whose expression differs only in the predicate pi itself such that the corresponding fragments are referenced in a different way by at least one application.

,et us try to understand the abo e terminologies by taking an example. ,et us consider the relations .:CT &.:CT*A0, *A0:, A>:A' and N6;&N6;7.,N6; *A0:'. ,et us assume that only two departments are functioning i.e 1 U 2.*ow some examples for simple predicates are+ .:CT*A0 O1 or .:CT*A0 2, .:CT*A0 O 2 or .:CT*A0 1 N6; O IprogrammerJ or N6; IprogrammerJ. The corresponding min#term predicates are .:CT*A0 O1 A*. N6; O IprogrammerJ .:CT*A0 O1 A*. N6; IprogrammerJ .:CT*A0 1 A*. N6; IprogrammerJ .:CT*A0 1 A*. N6; O IprogrammerJ *ow let us concentrate on some more supporting terminologies. ,et C O Vp 1,p2,M.p nWbe a set of simple predicates. %or correct and efficient fragmentation C must be comp%ete and minima%. -e say that a set C of predicates is comp%ete if and only if any two tuples belonging to the same fragment are referenced with the same probabilities by any application. The set C is minima% if all its predicates are rele ant.

24

EIamp%e: 7n the abo e example, C1 OV.:CT*A0 O 1W is not comp%ete since the application is e en interested in the employees who are IprogrammersJ. "o in this case C2 O V.:CT*A0 O1,N6; O IprogrammerJW is comp%ete and minima%. The set C3 O V.:CT*A0 O1, N6; O IprogrammerJ, "A, P $/W is complete but not minimal since "A, P$/ is not rele ant. ;y knowing the minimum characteristics that are to be considered now let us generali!e the method to be followed while producing fragments of the gi en global relation. 7. 9onsider a predicate pi that partitions the tuples of ,et C O p1. 77. 9onsider a new simple predicate pi which partitions at least one fragment of C into two parts, which are referenced in a different way by at least one application. :liminate non#rele ant predicates from C. >epeat this step until the set of min#term fragments of C is complete. EIamp%e: ,et us take two cities of Garnataka+ "himoga and 0ysore. The example application considered is the marketing of medical goods. The global schema for this application includes the relations :0C,, .:CT, "ACC,7:> and "ACC,R. These relations look as follows+ :0C, &:0C*A0, *A0:, "A,, TAE, 01>*A0, .:CT*A0' .:CT &.:CT*A0, *A0:, A>:A, 01>*A0' "ACC,7:> &"*A0, *A0:, 97TR' "ACC,R &"*A0, C*A0, .:CT*A0, QAA*' -e design the fragmentation of "ACC,7:> and .:CT with a Crimary %ragmentation. *ow let us take a )uery. >ind the names of supp%iers 6ith a !i1en number S#U'. As you ha e already come across a popular )uery language S;8 can be used for representing this )uery. "elect *A0: %rom "ACC,7:> -here "*A0 O XR the global relation 1 into two parts, which are referenced differently at least by one application.

2$

This )uery issued at any one of the sites. ,et us assume that we ha e three sites in our pur iew. "ite 1 is in "himoga, "ite 2 is in 0ysore and "ite 3 is in between "himoga and 0ysore. "o, if the )uery is issued at "ite 1 it references "ACC,7:>" whose 97TR is I"himogaJ with almost ?/H probability= if it is issued at "ite 2 it references "ACC,7:>" of I"himogaJ and I0ysoreJ with e)ual probability= if it is issued at "ite3 it references "ACC,7:>" whose 97TR is I0ysoreJ with almost ?/H probability. This is because the ob ious fact that department around one city tends to use suppliers, which are close to them. -e can write the predicates for the abo e application, C1+ 97TR O I"(7061AJ C2+ 97TR O I0R"6>:J

"ince the set VC1, C2W is complete and minimal, the search is terminated. ,et us now consider the global relation .:CT+ DEPT DDEPT#U'G #A'EG AREAG ')R#U'E. "ome example predicates that are suitable for administrati e applications are considered. C1+ .:CT*A0 Y O 1/ C2+ &1/ Y .:CT*A0 Y O 2/' C3+ .:CT*A0 P 2/ C4+ A>:A O I*6>T(J C$+ A>:A O I"6AT(J 7f we assume that in the northern area the departments with .:CT*A0 P 2/ will ne er be there, then A>:A O I*6>T(J implies that .:CT*A0 P 2/ is false. Thus the fragments are reduced to the following four+ R1+ .:CT*A0 Y O 1/ R2+ &1/ Y .:CT*A0 Y O 2/' A*. &A>:A O I*6>T(J ' R3+ &1/ Y .:CT*A0 Y O 2/' A*. &A>:A O I"6AT(J ' R4+ .:CT*A0 P 2/ 7f we now concentrate about the fragment allocation we can easily allocate fragments corresponding to y1 and y4 at sites 1 and 3.;ut depending upon the re)uirement fragments y2 and y3 will be allocated to either sites 1 or 3.

25

+.3.4.- Deri1ed &ori=onta% >ra!mentation: This is not based on the properties of its own attributes, but it is deri ed from the hori!ontal fragmentation of another relation. 7t is used to make the 8oin between the fragments. A distributed Aoin is a 8oin between hori!ontally fragmented relations. That is when you want to 8oin the two relations 1 and ( you ha e to compare their fragments. "oin )raphs can efficiently represent it. The fig $.1 represents the different possible 8oin graphs. o Tota%: The 8oin graph is total when it contains all possible edges between fragments of 1 and (. o Reduced: The 8oin graph is reduced when some of the edges between 1 and ( are missing. (ere we ha e two types+ Partitioned: A reduced graph is partitioned if the graph is composed of two or more sub graphs without edge between them. Simp%e: A reduced graph is simple if it is partitioned and each sub graph has 8ust one edge. EIamp%e: 9onsider the relation "ACC,R &"*A0, C*A0, .:CT*A0, QAA*'. ,et us take the following case. "ome application o >e)uires the information about supplies of gi en suppliers= thus they 8oin between "ACC,R and "ACC,7:> in the "*A0 attribute. o >e)uires the information about supplies at a gi en department= then they perform 8oin between "ACC,R and .:CT on the .:CT*A0 attribute. ,et us assume that the relation .:CT is hori!ontally fragmented on the attribute .:CT*A0 and that "ACC,7:> is hori!ontally fragmented on the attribute "*A0. The deri ed hori!ontal fragmentation can be obtained for relation "ACC,R by either performing a "emi # 8oin operation with "ACC,7:> on "*A0 or with .:CT on .:CT*A0= both of them are correct. +.3.- Vertica% >ra!mentation: This re)uires grouping the attributes into sets, which are referenced in the similar manner by applications. This method has been discussed by considering two separate types of problems+ The Vertica% Partitionin! Prob%em: (ere set must be dis8oint. 6f course one attribute must be common. %or example assume that a relation " is ertically fragmented using this concept into "1 and "2.This can be useful

24

where an application can be executed using either "1 or "2.6therwise ha ing the complete " at a particular site may be a unnecessary burden. T6o possib%e desi!n approaches: 1. The sp%it approach: The global relations are progressi ely split into fragments 2. The )roupin! approach: The attributes are progressi ely aggregated to constitute fragments. ;oth are (euristic approaches as each iteration steps look for best choice. 7n both the cases formulas are used to indicate the best possible splitting or grouping.

>1 "1

>1

"1

>1

"1

>2 "2

>2 >2 >3 >3 "1

"2 "3

>3 "3 >4

>4

"4

>4

"2

>$

>i!ure +.4 The different possible 8oin graphs The Vertica% $%usterin! Prob%em: (ere sets can o erlap. (ere depending upon the re)uirement you may ha e more than one common attribute in the two different fragments of a global relation. 7t introduces Rep%ication within fragments, as some common attributes are present in the fragments. 7t is suitable only for >ead#6nly applications= because for

22

applications, which in ol e fre)uent updating of these common attributes needs to be referred to the sites where all these attributes are present. Therefore, Dertical clustering is suggested where o erlapping attributes are not hea ily updated. EIamp%e: 9onsider the global relation :0C, &:0C*A0, *A0:, "A,, TAE, 01>*A0, .:CT*A0'. The following are made+ Administrati e applications, re)uires *A0:, "A,, TAE of employees. The department, re)uires *A0:, 01>*A0 and .:CT*A0

(ere Dertical clustering is suggested as the attribute *A0: is re)uired in both the fragments. "o the fragments may be+ :0C,1 &:0C*A0, *A0:, "A,, TAE' :0C,2 &:0C*A0, *A0:, 01>*A0, .:CT*A0' +.3.3 'iIed >ra!mentation: The simple way for performing this is+ A1 Apply (ori!ontal fragmentation to Dertical fragments Apply Dertical fragmentation to (ori!ontal fragments A2 A3 A4 A$

;oth these aspects are illustrated using the following diagrams $.2 and $.3.

>i!: +.- Dertical fragmentations followed by hori!ontal fragmentation. A1 A2 A3 A4 A$

>i!: +.3 Dertical fragmentations followed by hori!ontal fragmentation +./ The A%%ocation of >ra!ments: 7n this section we explain the different aspects to be considered when you go for allocating a particular fragment to site. This section describes some general criteria that

2?

can be used for allocating fragments. There are two types of allocation methods, which can be followed. They are+ #on5redundant A%%ocation: 7t is simple. A method known as I;est#fit approachJ can be used= i.e a measure is associated with each possible allocation, and the site with the bets measure is selected. 7t a oids placing a fragment at a gi en site where already a fragment is present which is related to this fragment. Redundant A%%ocation: 7t is complex design, since+ o The degree of replication is a ariable of the problem. o The modeling of read applications is complicated as the applications may select any of the se eral alternati es. The following two methods can be used for determining the redundant allocation of fragments+

.etermine the set of all sites where the benefit of allocating one copy of the fragment is higher than the cost, and allocate a copy of the fragment to each element of this site= this method selects Iall beneficial sitesJ. "tart from a non#replicated ersion. Then progressi ely introduce replicated copies from the most beneficial= the process is terminated when no additional replication is beneficial.

;oth the reliability and a ailability of the system increases if there are two or three copies of the fragment, but further copies gi e a less than proportional increase. $heck %our Progress-&: Answer the following: 1' ,ist the ob8ecti es of .istributed databases. 2' -hat do you mean by Top#.own and ;ottom#Ap approachesF 3' .efine the following+ &i' A simple predicate &ii' 0in#term predicate &iii' 9omplete predicate set &i ' 0inimal predicate set. 4' .efine a 8oin graph. $' "tate Dertical partitioning and 9lustering problem. $heck %our Progress-(: Answer the following: 1' 1i e an o erall framework for the design of distributed databases.

?/

2' .escribe Top#.own and ;ottom#Ap design approach. 3' :xplain the general principle of allocation of fragments in distributed databases. 4' .iscuss the general principle of fragmentation design considering all types of techni)ues. +.+ Summary: 7n this unit we ha e discussed the four phases of the design of .istributed databases+ 1lobal schema, %ragmentation schema, Allocation schema and ,ocal schema. "ome important aspects of design of fragmentation and allocation schemas are described in detail. Also some of the practical examples are chosen for familiari!ing the new concepts.

?1

U#IT * : (VERVIE? (> ;UERF PR($ESSI#) Structure :.. (bAecti1es :.4 Introduction :.- ;uery Processin! Prob%em :.3 (bAecti1es of ;uery Processin! :./ $haracteri=ation of ;uery Processors :.+ 8ayers of ;uery Processin! :.+.4 ;uery Decomposition :.+.- Data 8oca%i=ation :.+.- )%oba% ;uery (ptimi=ation :.+.3 8oca% ;uery (ptimi=ation :.: Summary 9.) *b+ectives: In this unit we learn about an overview of 3uery processing in Distributed Data /ase *anagement -ystems DD/*-s$. .his is explained with the help of +elational Calculus and +elational "lgebra because of their generality and wide use in DD/*-s. In this we discuss

;arious problems of 3uery processing "bout an ideal :uery ,rocessor .he concept of layering in 3uery processing -ome related examples of 3uery processing

:.4 Introduction: The increasing success of relational database technology in data processing is suitable, in part, to the a ailability of nonprocedural languages, which can significantly impro e application de elopment and end#user producti ity. ;y hiding the low#le el details about the physical organi!ation of the data, relational database languages allow the expression of complex )ueries in a concise and simple fashion. 7n particular, to construct the answer to the )uery, the user does not exactly specify the

?2

procedure to follow. This procedure is actually de ised by a .;0" module, called as ;uery Processor. This relie es the user from )uery optimi!ation, a time consuming task that is handled properly by the )uery processor. This issue has considerably important both in 9entrali!ed and .istributed processing systems. (owe er, the )uery processing problem is much more difficult in distributed en ironments than in the con entional systems. 7n exact, the relations in ol ed in distributed )ueries may be fragmented and@or replicated, there by inducing communication o erhead costs. "o, in this unit let us discuss the different issues of )uery processing, about an ideal )uery processor for distributed en ironment and finally, a layered software approach for distributed )uery processing. :.- ;uery Processin! Prob%em: The main duty of a relational )uery processor is to transform a high#le el )uery &in relational calculus', into an e)ui alent lower le el )uery &in relational algebra'. The distributed database is of ma8or importance for )uery processing since the definition of fragments is based on the ob8ecti e of increasing reference locality, and sometimes# parallel execution for the most important )ueries. The role of a distributed )uery processor is to map a high le el )uery on a distributed database &a set of global relations' into a se)uence of database operations &of relational algebra' on relational fragments. "e eral important functions characteri!e this mapping+ The calculus )uery must be decomposed into a se)uence of relational operations called an algebraic )uery The data accessed by the )uery must be locali!ed so that the operations on relations are translated to bear on local data &fragments' The algebraic )uery on fragments must be extended with communication operations and optimi!ed with respect to a cost function to be minimi!ed. This cost function refers to computing resources such as disk 7@6s, 9CAs, and communication networks. The low#le el )uery actually implements the execution strategy for the )uery. The transformation must achie e both correctness and efficiency. The well#defined mapping with the abo e said functional characteristics makes the correctness issue easy. ;ut

?3

producing an efficient execution strategy is more complex. A relational calculus )uery may ha e many e)ui alent and correct transformations into relational algebra. "ince each e)ui alent execution strategy can lead to different consumptions of computer resources, the main problem is to select the execution strategy that minimi!es the resource consumption. EIamp%e: -e consider the following subset of engineering database scheme gi en in fig.5./+ / :/"*, /"A0/, T4T /; < :/"*, ="*, 3/#P, !,3; and the simple user )uery+ I %ind the names of employees who are managing a pro8ectJ.
E :*6 :1 :2 :3 :4 :$ :5 :4 :2 " N*6 N1 N2 N3 N4 N*A0: 7nstrumentation .atabase .e elop. 9A.@9A0 0aintenance ;A.1:T 1$//// 13$/// 2$//// 31//// ,69 0ontreal *ew Rork *ew Rork Caris :*A0: A ; 9 . : % 1 ( T7T,: :lect. :ng. "yst. Arial, 0ech. :ng. Crogrammer "yst. Anal. :lect. :ng. 0ech. :ng. "yst. Anal. ) :*6 :1 :2 :2 :3 :3 :4 :$ :5 :4 :2 N*6 N1 N1 N2 N3 N4 N2 N2 N4 N3 N3 S >:"C 0anager Analyst Analyst 9onsultant :ngineer Crogrammer 0anager 0anager :ngineer 0anager .A> 12 24 5 1/ 42 12 24 42 35 4/ "A, 4//// 34/// 24/// 24///

T7T,: :lect. :ng. "yst. Anal. 0ech. :ng. Crogrammer

>i!ure :.. EIamp%e Database

?4

The e)ui alent relational calculus using "Q, syntax is+ SELECT ENAME FROM AND E, G RESP = Manager WHERE E.ENO = G.ENO

Two e)ui alent relational algebra )ueries that are correct transformations of the abo e )uery are+

78

EN<=E

:$L

>E$7 ? !=anager% <N" E.EN&?6.6N&

:E #7 6;;

and

78

EN<=E

:E 8N EN& :$L >E$7 ? !=anager% :6;;;

#(TE: The following obser ations are made from the abo e example+ 7t can be obser ed that the second )uery a oids the 9artesian product &9C' of : and 1, consumes much less computing resource than the first and thus should be retained. That is, we ha e to a oid performing 9artesian product operation on a full table. 7n a centrali!ed en ironment, the role of the )uery processor is to choose the best relational algebra )uery for a gi en )uery among all e)ui alent ones. 7n a distributed en ironment, relational algebra is not enough to express execution strategies. 7t must be supported with operations for exchanging data between sites. The distributed )uery processor has to select the best sites to process the data and the way in which the data should be transformed with the choice of ordering the relations. EIamp%e: This example illustrates the importance of site selection and communication for a chosen relational algebra )uery against a fragmented database. -e consider the following )uery+ 78
EN<=E

:E 8N EN& :$L >E$7 ? !=anager% :6;;;

This )uery is written considering the relations of the pre ious example. -e assume that the relations : and 1 are hori!ontally fragmented as follows+

?$

E * ? $L EN& ! E-% :E; E + ? $L EN& @ ! E-% :E; 6* ? $L EN& ! E-% :6; 6+ ? $L EN& @ ! E-% :6; %ragments 11, 12, :1 and :2 are stored at the sites 1,2,3, and 4, respecti ely, and the result is expected at the site $ as shown in the fig 5.1. %or simplicity, we ha e ignored the pro8ect operation here. 7n the figure two e)ui alent strategies for the abo e )uery are shown. Some of the obser1ations of the Strate!ies:

An arrow from site i to site 8 labeled with > indicates that relation > is transferred from site i to site 8. "trategy A exploits the fact that relations : and 1 are fragmented in the same way in order to perform the select and 8oin operations in parallel. "trategy ; centrali!es all the operations and the data at the result site before processing the )uery.

Resource consumption of these t6o strate!ies:

Assumptions made: 1. 2. 3. respecti ely. 4. $. 5. 4. There are 2/ managers in relation 1. The data is uniformly distributed among sites. : and 1 relations are locally clustered an attributes >:"C and :*6, respecti ely. There is direct access to tuples of 1 &respecti ely, :' based on the alue of attribute >:"C &respecti ely, :*6' Tuple access denoted as tupacc is 1 unit. A tuple transfer, denoted as tuptrans, is 1/ units. >elations : and 1 ha e 4// and 1/// tuples

The $ost Ana%ysis: .he cost of strategy " can be derived as follows'

?5

1. Croduce 1Z by selecting 1 re)uires 2/ T tupacc 3. Croduce :Z by 8oining 1Z and : re)uires &1/T1/'T tupaccT2 4. Transfer :Z to result site re)uires 2/T tuptrans The tota% cost
.he cost of strategy / can be derived as follows'

7 9@

2. Transfer 1Z to the sites of : re)uires 2/ T tuptrans 7 2// O 2// O 2// :-. O 4/// O 1//// O 1/// O 2///

1. Transfer : to site $ re)uires 4// T tuptrans 2. Transfer 1 to site $ re)uires 1/// T tuptrans 3. Croduce 1Z by selecting 1 re)uires 1/// T tupacc 4. Noin : and 1Z re)uires 4// T 2/ T tupacc

The tota% cost

-3...

The strategy A is better by a factor of 34, which is )uite significant. Also it pro ides the better distribution of work among the sites. The difference would be still better if we assume slower communication and@or higher degree of fragmentation.
Resu%t J E4 U# E-

EO4

EO-

Site 3 E4 J E4 "# E#( )4

Site / E- J E- "# E#( )-

)O4 Site + Site 4 ) J S8 )4 4 RESP J M'ana!erN DaE Strate!y A

)O-

Site )- J S8RESP J M'ana!erN )-

?4 Resu%t J DE4 U# E- "# E#( P" RESP JO'ana!erN D)4 U# )-E

)4 Site 4 Site -

)-

E4 Site 3

ESite /

DbE Strate!y B >i!.:.4 ECui1a%ent Distributed EIecution Strate!ies

:.3 (bAecti1es of ;uery Processin!:


The

main ob8ecti es of )uery processing in a distributed en ironment is to form a

high le el )uery on a distributed database, which is seen as a single database by the users, into an efficient execution strategy expressed in a low le el language on local databases.
An

important point of )uery processing is )uery optimi!ation. ;ecause many

execution strategies are correct transformations of the same high#le el )uery, the one that optimi!es &minimi!es' resource consumption should be retained.
The

good measures of resource consumption are+ o The total cost that will be incurred in processing the )uery. 7t is the some of all times incurred in processing the operations of the )uery at arious sites and intrinsic communication. o The resource time of the )uery. This is the time elapsed for executing the )uery. "ince operations can be executed in parallel at different sites, the response time of a )uery may be significantly less than its cost.

6b

iously the total cost should be minimi!ed. o 7n a distributed system, the total cost to be minimi!ed includes 9CA, 7@6, and communication costs. These costs can be minimi!ed by reducing the number of 7@6 operations through fast access methods to the data and efficient use of main memory. The communication cost is the time needed for exchanging the data between sites participating in the execution of the )uery. This cost is incurred in processing the messages and transmitting the data on the communication network. 7n distributed system, the

?2

communication cost factor is largely dominating the local processing cost, so that the other cost factors are ignored. o 7n centrali!ed systems, only 9CA and 7@6 cost ha e to be considered.
:./ $haracteri=ation of ;uery Processors: 7t is ery difficult to gi e the characteristics, which differentiates centrali!ed and distributed )uery processors. "till some of them ha e been listed here. 6ut of them, the first four are common to both and the next four are particular to

distributed )uery processors. o anguages: The input language to the )uery processor can be based on relational calculus or relational algebra. The former re)uires an additional phase to decompose a )uery expressed in relational calculus to relational algebra. 7n distributed context, the output language is generally some form of relational algebra augmented with communication primiti es. That is it must perform perfect mapping between input languages with the output language. o Types of optimi>ation: 9onceptually, )uery optimi!ation is to choose a best point of solution space that leads to the minimum cost. A popular approach called exhausti e search is used. This is a method where heuristic techni)ues are used. 7n both centrali!ed and distributed systems a common heuristic is to minimi!e the si!e of intermediate relations. Cerforming unary operations first and ordering the binary operations by the increasing si!e of their intermediate relations can do this. o *ptimi>ation Timing: A )uery may be optimi!ed at different times relati e to the actual time of )uery execution. 6ptimi!ation can be done statically before executing the )uery or dynamically as the )uery is executed. The main ad antage of the later method is that the actual si!es of the intermediate relations are a ailable to the )uery processor, thereby minimi!ing the probability of a bad choice. The main drawback of the dynamic method is that the )uery optimi!ation, which is an expensi e one, must be repeated for each and e ery )uery. "o, (ybrid optimi!ation may be better in some situation. o #tatistics: The effecti eness of the )uery optimi!ation is based on statistics on the database. .ynamic )uery optimi!ation re)uires statistics in order to choose the operation that has to be done first. "tatic )uery optimi!ation re)uires statistics

??

to estimate the si!e of intermediate relations. The accuracy of the statistics can be impro ed by periodical updating. o !ecision sites: 0ost of the systems use centrali!ed decision approach, in which a single site generates the strategy. (owe er, the decision process could be distributed among arious sites participating in the elaboration of the best strategy. The centrali!ed approach is simpler but re)uires the knowledge of the complete distributed database where as the distributed approach re)uires only local information. (ybrid approach is better where the ma8or decisions are taken at one particular site and other decisions are taken locally. o /1ploitation of the "etwork Topology: the distributed )uery processor exploits the network topology. -ith wide area networks, the cost function to be minimi!ed can be restricted to the data communication cost, which is a dominant factor. This issue reduces the work of distributed )uery optimi!ation, that can be dealt as two separate problems+ "election of the global execution strategy, based on the inter#site communication and selection of each local execution strategy, based on a centrali!ed )uery processing algorithms. -ith local area networks, communication costs are comparable to 7@6 costs. Therefore, it is reasonable to the distributed )uery processor to increase parallel execution at the cost of increasing communication. o /1ploitation of 3eplicated fragments: %or reliability purposes it is useful to ha e fragments replicated at different sites. Query processors ha e to exploit this information either statically or dynamically for processing the )uery efficiently. o ,se of semi- +oins: The semi#8oin operation reduces the si!e of the data that are exchanged between the sites so that the communication cost can be reduced. :.+ 8ayers (f ;uery Processin!: The problem of )uery processing can itself be decomposed into se eral subprograms, corresponding to arious layers. 7n figure 5.2, a generic layering scheme for )uery processing is shown where each layer sol es a well#defined sub#problem. The input is a )uery on distributed data expressed in relational calculus. This distributed )uery is posed on global &distributed' relations, meaning that data distribution is hidden. %our main layers are in ol ed to map the distributed )uery into an optimi!ed se)uence of local

1//

operations, each acting on a local database. These layers perform the functions of 3uery decomposition, data localization, global 3uery optimization, and local 3uery optimization. The first three layers are performed by a central site and use global information= the local sites do the fourth.
$A8$U8US ;UERF (# DISTRIBUTED

RE8ATI(#S

;UERF

)8(BA8 S$&E'A

A,1:;>A79 QA:>R 6* .7"T>7;AT:. >:,AT76*"

$(#TR(8 SITE

DATA 8($A8IPATI(#
%>A10:*T QA:>R

>RA)'E#T S$&E'A

)8(BA8 (PTI'IPATI(# 6CT707[:. %>A10:*T QA:>R -7T( 9600A*79AT76* 6C:>AT76*"

STATISTI$S (# >RA)'E#TS

8($A8 SITES

8($A8 (PTI'IPATI(#

8($A8 S$&E'A

6CT707[:. ,69A, QA:>7:"

>i!ure :.3 )eneric 8ayerin! Scheme for Distributed ;uery Processin!

1/1

:.+.4 ;uery Decomposition: The first layer decomposes the distributed calculus )uery into an algebraic )uery on global relations. The information needed for this transformation is found in the global conceptual schema describing the global relations. (owe er, the information about data distribution is not used here but in the next layer. Thus the techni)ues used by this layer are those of a centrali!ed .;0". ;uery decomposition can be 1ie6ed as four successi1e steps: o The calculus )uery is rewritten in a normali!ed form that is suitable for subse)uent manipulation. *ormali!ation of a )uery generally in ol es the manipulation of the )uery )uantifiers and of the )uery )ualification by applying logical operator priority. o The normali!ed )uery is analy!ed semantically so that incorrect )ueries are detected and re8ected as early as possible. Techni)ues to detect incorrect )ueries exist only for a subset of relational calculus. Typically, they use some sort of graph that captures the semantics of the )uery. o The correct )uery &still expressed in relational calculus' is simplified. 6ne way to simplify a )uery is to eliminate redundant predicates. o The calculus )uery is restructured as an algebraic )uery. The )uality of an algebraic )uery is defined in terms of expected performance. The traditional way to do this transformation toward a \better\ algebraic specification is to start with an initial algebraic )uery and transform it in order to find a \good\ one. The initial algebraic )uery is deri ed immediately from the calculus )uery by translating the predicates and the target statement into relational operations as they appear in the )uery. This directly translated algebra )uery is then restructured through transformation rules. The algebraic )uery generated by this layer is good in the sense that the worse executions are a oided. :.+.- Data 8oca%i=ation: The input to the second layer is an algebraic )uery on distributed relations. The main role of the second layer is to locali!e the )uery<s data using data distribution information. >elations are fragmented and stored in dis8oint subsets called fragments, each being stored at a different site. This layer determines which fragments are in ol ed in the )uery and transforms the distributed )uery into a fragment )uery. %ragmentation is defined through fragmentations rules that can be

1/2

expressed as relational operations. A distributed relation can be reconstructed by applying the fragmentation rules, and then deri ing a program, called a localization program, of relational algebra operations, which then act on fragments. 1enerating a fragments )uery is done in two steps. o The distributed )uery is mapped into a fragment )uery by substituting each distributed relation by its reconstruction program &also called materialization program. o The fragment )uery is simplified and restructured to produce another IgoodJ )uery. "implification and restructuring may be done according to the same rules used in the decomposition layer. As in the decomposition layer, the final fragment )uery is generally far from optimal because information regarding fragments is not utili!ed. :.+.3 )%oba% ;uery (ptimi=ation: The input to the third layer is a fragment )uery, that is, an algebraic )uery on fragments. The goal of )uery optimi!ation is to find an execution strategy for the )uery, which is close to optimal. An execution strategy for a distributed )uery can be described with relational algebra operations and communication primitives &send@recei e operations' for transferring data between sites. The pre ious layers ha e already optimi!ed the )uery for example, by eliminating redundant expressions. (owe er, this optimi!ation is independent of fragments characteristics such as cardinalities. 7n addition, communication operations are not yet specified. ;y permuting the ordering of operations within one fragment )uery, many e)ui alent )ueries may be found. Query optimi!ation consists of finding the IbestJ ordering of operations in the fragments )uery, including communication operations, which minimi!e a cost function. The cost function, often defined in terms of time units, refers to computing resources such as disk space, disk 7@6s, buffer space, 9CA cost, communication cost and so on. An important aspect of )uery optimi!ation is 5oin ordering, since permutations of the 8oint within the )uery may lead to impro ements of orders of magnitude. 6ne basic techni)ue for optimi!ing a se)uence of distributed 8oin operations is through the semi#8oin operator. The main alue of the semi#8oin in a distributed system is to reduce the si!e of the 8oin operands and then the communication

1/3

cost. The output of the )uery optimi!ation layer is an optimi!ed algebraic )uery with communication operation included on fragments. :.+./ 8oca% ;uery (ptimi=ation: The last layer us performed by all the sites ha ing fragments in ol ed in )uery. :ach sub#)uery executing at one site, called a local 3uery, is then optimi!ed using the local schema of the site. At this time, the algorithms to perform the relational operations may be chosen. ,ocal optimi!ation uses the algorithms of centrali!ed systems. $heck %our Progress: Answer the following: a' -hat is a Query processorF b' "tate the Query processing problem. c' :xplain the different characteristics of Query processor. d' .escribe the layer architecture of )uery processing. e' .iscuss Query optimi!ation. :.: Summary: 7n this unit we ha e pro ided an o er iew of )uery processing in distributed .;0"s. The following points are discussed+ -e ha e introduced the function and ob8ecti es of )uery processing. The goals of the )uery processing are discussed. They are

1i en a calculus )uery on a distributed database, find a corresponding execution strategy that minimi!es a system cost function, which includes 7@6, 9CA, and communication costs. An execution strategy is specified in terms of relational algebra operations and communication primiti es applied to the local databases.

-e ha e described a characteri!ation of )uery processors based on their implementation choices. This is useful for comparing alternati e )uery processor designs and to understand the trade#offs between efficiency and complexity.

-e ha e proposed a generic layering scheme for describing distributed )uery processing. (ere four main functions ha e been isolated+ Query

1/4

decomposition, .ata locali!ation, Query optimi!ation, and ,ocal )uery optimi!ation.

1/$

U#IT 5 , TRA#SA$TI(# 'A#A)E'E#T A#D $(#$URRE#$F $(#TR(8 Structure ,.. (bAecti1es. ,.4 Introduction. ,.- Transaction 'ana!ement ,.-.4 A >rame 6or7 for transaction mana!ement ,.-.4.4 TransactionNs properties ,.-.4.- Transaction 'ana!ement )oa%s ,.-.4.3 Distributed Transactions ,.-.- Atomicity of Distributed Transactions ,.-.-.4 Reco1ery in $entra%i=ed Systems ,.-.-.- Prob%ems 6ith respect to $ommunication in Distributed databases ,.-.-.3 Reco1ery of Distributed Transactions ,.-.-./ The Transaction contro% ,.3 $oncurrency $ontro% for Distributed Transactions ,.3.4 $oncurrency $ontro% Based on 8oc7in! in $entra%i=ed Databases ,.3.- $oncurrency $ontro% Based on 8oc7in! in Distributed Databases ,./ Summary

?.) *b+ectives: "t the end of this unit you will be able to' Describe the different problems in managing the distributed transactions. Discuss about recovery of distributed transactions. (xplain a popular algorithm called 9 C ,hase Commit ,rotocol Discuss the aspects of Concurrency control in Distributed .ransactions. ,.4 Introduction: The management of distributed transaction means dealing with interrelated problem like reliability, concurrency control and the efficient utili!ation of the resources of the complete system. 7n this unit we ha e considered the well#known protocols like 2#Chase

1/5

commit protocol for reco ery and 2#Chase locking for concurrency control. All the aspects are discussed under different sections as gi en in the abo e structure. ,.- Transaction 'ana!ement: The following section deals with the problems of transactions in both centrali!ed and distributed transactions. The properties of transaction and arious goals of transaction management are discussed. The reco ery problems in both centrali!ed and distributed transactions are analy!ed. ,.-.4 A >rame6or7 for Transaction 'ana!ement: 7n this case we define the properties of transactions, state the goals of distributed transaction management and describe architecture of distributed transaction. ,.-.4.4 TransactionOs Properties: The Transaction is an application or part of application that is characteri!ed by the following properties. Atomicity: :ither all or none of the transaction<s operations are performed. 7t re)uires that if a transaction is interrupted by a failure its partial results are not at all taken into consideration and the whole operation has to be repeated. The two types of problems that does not allow the transaction to complete are+ o Transaction aborts: This may be re)uested by the transaction itself as some of its inputs are wrong or it has been estimated that the results produced may become useless. 7t also may be forced by the system for its own reason. The acti ity of ensuring atomicity in the presence of Transaction aborts is called Transaction reco1ery. o System $rashes: 7t is because of some catastrophic effects that crash the system without any prior knowledge. The acti ity of ensuring atomicity in the presence of system crashes is called crash reco1ery. The completion of transaction is called $ommit. The primiti es that can be used for carrying out the transaction are+ ;egin STransaction 9ommit ;egin STransaction Abort ;egin STransaction E "ystem

1/4

%orces Abort

Durabi%ity: 6nce a transaction is committed, the system must guarantee that the results of operations will ne er be lost, independent of subse)uent failures. The acti ity of pro iding .urability of the transaction is called Database reco1ery. Seria%i=abi%ity: 7f many transactions execute concurrently, the result must be same as if they were executed serially in the same order. The acti ity of pro iding "eriali!ability of the transaction is called $oncurrency contro%. Iso%ation: This property states that an incomplete transaction cannot disclose its result to other transactions until it is committed. This property has to be strictly followed to a oid a problem called $ascadin! Aborts DDomino EffectE. According to this all the transactions that has obser ed the partial results ha e to be aborted. These properties ha e to be fulfilled for the efficient transaction to happen. 7n the next section we will see why the transaction management is an important aspect. ,.-.4.- Transaction 'ana!ement )oa%s: After knowing the performance characteristics of transactions let us see what are the real goals of transaction managementF The goal of the transaction management in a .istributed database is to control the execution of transactions so that+ 1. Transactions ha e atomicity, durability, seriali!ability, and isolation properties. 2. Their cost in terms of main memory, 9CA, and number of transmitted control messages and their response time are minimi!ed. 3. The a ailability of the system is maximi!ed. The second point talks more about the efficiency of the transaction. ,et us discuss in detail about second and third point as we ha e already dealt the first point in the pre ious sub section. $PU and main memory uti%i=ation: 7t is a common aspect in both centrali!ed and distributed database. 7n case of concurrent transactions the 9CA and main memory

1/2

should be properly scheduled and managed by the operating system. 6therwise it becomes a bottleneck when the number concurrent transactions are more. $ontro% messa!es and their Response time: As the control messages does not carry any fruitful data and only they are used to control the execution of transactions, there should be ery less exchange of such messages between the sites. The ob ious reason is the communication cost will be increased unnecessarily. Another important aspect is the response time of each indi idual transaction. This should be as small as possible for the better performance of the system. .efinitely it will be ery crucial as in distributed system an additional time is re)uired for communication between different sites. A1ai%abi%ity: This should be discussed keeping the failure of the systems in mind. The algorithms implemented by the transaction manager must bypass the site which is not operational and pro ide the access to a site so that the re)uest can be some how ser iced. After studying the goals of the transaction management in detail, in the coming section we will suggest an appropriate model for .istributed transaction. ,.-.4.3 Distributed Transactions: A transaction is a part of the application. 6nes some application issues be!in0transaction primiti e= from this point onwards, all actions which are performed by the application, until a commit or abort primiti e is issued are to be considered as one compete transaction. *ow let us discuss a model for distributed transaction. ,et us study some related terminologies of this model. A!ents: An agent is a local process, which performs se eral functions on behalf of an application. 7n order to cooperate in the execution of global operation re)uired by the application the agents ha e to communicate. As they are resident at different sites, the communication between the agents is performed through messages. There are arious methods for organi!ing the a!ents to build a structure of cooperating processes. 7n this model let us ha e a hypothetical assumption of the method, which will be discussed in detail in the next section. Root a!ent: There exists a root agent, which starts the whole transaction, so that when the user re)uests the execution of an application, the root agent is started= the site of the root agent is called Site of (ri!in of the transaction.

1/?

The root agent has the responsibility of issuing the beginS transaction, commit and abort primiti es. 6nly the root agent can re)uest the creation of a new agent.

%inally, to summari!e the distributed transaction model consists of a root a!ent that has initiated the transaction and number of a!ents depending upon the application, which works concurrently. All the primiti es are executed by the root agent and these are not local to the site of ori!in but also affect all the agents of transaction. A case study of a !istributed Transaction: A distributed transaction includes one or more statements that, indi idually or as a group, update data on two or more distinct nodes of a distributed database. %or example, assume the database configuration depicted in the fig 4.1
Fig. 7.1 Distributed System

11/

.he following distributed transaction executed by scott updates the local sales database, the remote hq database, and the remote maint database' UPDATE scott.dept@hq.us.acme.com SET loc = 'REDWOOD SHORES' WHERE deptno = UPDATE scott.emp SET deptno = WHERE deptno = SET %oom = 1225 WHERE room = 1163;
COMMIT;

!"

!"

UPDATE scott.#ld$@maint.us.acme.com

EIamp%e: ,et us consider an example of a distributed transaction. The example is the I%und TransferJ operation between two accounts. A global relation %A*.ST>A* &A99S *A0, A06A*T' is taken for to manage this application. The application starts reading from the terminal the amount that has to be transferred, the account numbers from which the amount must be taken and to which it must be credited. Then the application issues a beginStransaction primiti e and the usual operations starts from now onwards. This is a global en ironment. The following transaction code &fig.4.2' narrates the whole process. 7f we assume that the accounts are distributed at different sites of a network like the branches of the bank, at execution time arious cooperating processes will perform the transaction. %or example, in the following transaction code &fig. 4.3' two agents are shown. 6ne of the two is the root a!ent. (ere we assume that the Ifrom accJ is located at the root agent site and that the Ito accJ is located at a different site, where the A)E#T4 is executed. -hen the root agent wants to perform the transaction then it executes the primiti e 9reate A1:*T1= then it sends the parameters to A1:*T1. The root agent also issues beginStransaction, commit and abort primiti es. All these transaction operations will be carried out preser ing the properties of distributed transactions discussed in pre ious section.

111

>U#D TRA#S>ER: >ead &terminal, XA06A*T, Xfrom acc, Xto acc'= ;eginStransaction= "elect A06A*T into X%>60SA06A*T from A99 where A99S*A0 O Xfrom acc= if X%>60SA06A*T # XA06A*T Y / then abort else begin Apdate A99 "et A06A*T O A06A*T # XA06A*T -here A99 O Xfrom acc= Apdate A99 "et A06A*T O A06A*T B XA06A*T -here A99 O Xto acc= 9ommit :nd >i! ,.- The >U#D TRA#S>ER transaction in the centra%i=ed en1ironment R((T A)E#T: >ead &terminal, XA06A*T, Xfrom acc, Xto acc'= ;eginStransaction= "elect A06A*T into X%>60SA06A*T from A99 where A99S*A0 O Xfrom acc= if X%>60SA06A*T # XA06A*T Y / then abort else begin Apdate A99 "et A06A*T O A06A*T # XA06A*T -here A99 O Xfrom acc= 9reate A1:*T1=

112

"end to A1:*T1& XA06A*T, Xto acc'= 9ommit :nd A)E#T4: >ecei e from >66T A1:*T &XA06A*T, Xto acc'= Apdate A99 "et A06A*T O A06A*T B XA06A*T -here A99 O Xto acc >i! ,.3 The >U#D TRA#S>ER transaction in the Distributed en1ironment ,.-.- Atomicity of Distributed Transactions: 7n this section we are trying to experience the concept of atomicity, which is a re)uired property for distributed transactions. 7n the first section we discuss reco ery techni)ues in centrali!ed system. *ext we shall deal the possible communication failures in distributed transactions. After this let us concentrate on reco ery procedures followed in distributed transactions. %inally the section talks about a distributed transaction algorithm called -5Phase commit Protoco%, which actually takes care all the properties expected by distributed transactions. ,.-.-.4 Reco1ery in $entra%i=ed Systems: The issue of reco ery mechanism is important in the case of re oking the system to perform normal database operations after a failure. Thus, before discussing about reco ery, let us first analy!e about the different kinds of failure that can occur in a centrali!ed database.

>ai%ures in $entra%i=ed Systems: %ailures are classified as follows+ >ai%ures 6ithout %oss of information: 7n these failures, all the information stored in memory, is a ailable for the reco ery. The example for this type of failures is the abort of transactions because an error condition is disco ered, like o erflow. >ai%ures 6ith %oss of 1o%ati%e stora!e: (ere the content of memory is lost. 6f course the information recorded in the disks are not affected. :xamples are system crashes.

113

>ai%ures 6ith %oss of #on *1o%ati%e stora!e: such failures are called as 0edia %ailures 3 (ere the contents of disks are also lost. Typical failures are (ead crashes. The probability of such failures are ery less compare to the other two types. (owe er, it is possible to make the possibility still less by ha ing the same information on se eral disks. >eally this idea is the basis for the concept of Stab%e Stora!e. A strategy known as the $arefu% Rep%acement is used for this purpose, which states that, at e ery update operation, first one copy of the information is updated, and then the correctness of the update is erified, finally the remaining copies are updated. >ai%ures 6ith %oss of stab%e stora!e: 7n this type, some information stored in stable storage is lost because of many, simultaneous failure of the storage disks. Any way this probability cannot be reduced to !ero. 8o!s: This is a basic techni)ue to handle transactions in presence of failures. A 8o! contains information for undoing or redoing all the actions performed by the transactions. i.e To undo the action means, to cancel the performed operation and restore back the result of 8ust pre ious operation. The necessity of undoing the actions of a transaction which fails before the commitment is that, if the commitment is not possible the database must remain the same as if the transaction were not executed at all= hence partial actions must be undone. To redo the action means, to perform again the action. To know the necessity of this we ha e to take the case of failure with the loss of olatile storage such that, the complete backup in stable storage of already committed operation has not been taken. 7n this case redoing of the actions of committed transactions is a must to record the database. A %o! record contains the re)uired information for redoing or undoing actions. -hene er a transaction is performed on the database, a log record is written in the log file. The log record includes+ The identifier of the transaction The identifier of the record

114

The type of action The old record alue The new record alue Auxiliary information for the reco ery procedure like pointer to the pre ious log record of the same information. Also when a transaction is started, committed, or aborted, a beginStransaction, commit, or abort record is written in the log. The 8o! ?rite5 Ahead Protoco% + The writing of a database and writing the corresponding log record are two separate operations. 7n case of a failure, if the database update were performed before writing the log record, the reco ery procedure would be unable of undoing the update, as the corresponding log record would not be a ailable. 7n order to o ercome this, the log write# ahead protocol is used. This consists of two rules+ At least the undo portion of the log record must ha e already been recorded on stable storage before performing a database update of the corresponding log record. All log records of the transaction must ha e already been recorded on stable storage before committing a transaction.

Reco1ery Procedures: -e ha e already seen the arious possibilities of failures in centrali!ed systems and the importance of log record in the reco ery procedure. *ow let us see how reco ery of database is done and the different steps to be followed. The reco ery procedure reads the log record and performs the following operations if the failure is due to the loss of olatile storage. o Step4+ .etermine all non#committed transactions that ha e to be undone. This can be recogni!ed easily as they ha e a beginStransaction record in the log file, without ha ing a commit or abort record. o Step-: .etermine all transactions that ha e to be redone. i.e all the transactions that ha e a commit record in the log file. 7n order to differentiate the transactions that ha e to be redone from the one

11$

which are safely recorded in the stable storage, $hec7points are used. :$heckpoints are the operations that are periodically performed to simplify the step? and step9$. o Step3: Ando the transactions, which are determined in the step1 and redo the transactions that are determined in step2. The chec7point reCuires the fo%%o6in! operations to be performed+ o All log records and the database updates which are still in the olatile storage has to be recorded in the stable storage o -riting to a stable storage a checkpoint record. A checkpoint record in the log contains the indication of transactions that are acti e at the time when the checkpoint is done &Acti e transaction is one begin Stransaction belongs to the log but not a commit or abort record'. The usa!e of chec7points modifies step4 and step- of the reco1ery procedure as fo%%o6s: o %ind and read the last checkpoint record. o Geep all transactions written in the checkpoint record into the undo set, which contains the transactions to be undone. The redo set, which contains the transactions to be redone, is initially empty. o >ead the log file starting from the checkpoint record until the end. 7f a beginStransaction is found, put the corresponding transaction in the undo set. 7f a commit record is found, mo e the corresponding transaction from the undo set to the redo set. %rom the abo e discussion we can say that only the latest portion of the log must be kept online, where as the remaining part can be kept in the stable storage. "o far only we ha e seen the failure of olatile storage. ,et us see the failures with the loss of stable storage. This can be studied by considering two possibilities. o >ai%ures in 6hich database information is %ostG but %o!s are safe: 7n this case, performing redo of all committed transactions using the log does the reco ery. Taking the database to a DumpG which is an image of pre ious state,

115

which was stored on tape storage, is done before redoing the transactions &6f course it is a lengthy process'. o &ere the %o! information itse%f is %ost: This is a catastrophic e ent where complete reco ery is any way not possible. 7n this case, it is reestablished to a recent a ailable state, by resetting the database to the last dump and using the log that is not damaged. The principles that we ha e seen will be sufficient to understand the reco ery procedures of distributed databases. ,et us now see in detail the reco ery in distributed database. ,.-.-.- Prob%ems 6ith respect to $ommunication in the Distributed databases: As usual reco ery mechanisms for distributed transactions re)uires to know the communication failures between the sites in distributed system en ironment. ,et us assume one communication model and according to this let us estimate the different possibilities of errors. -hen a message is transferred from A to ;, we re)uire the following from a communication network+ i. ii. iii. A recei es a positi e acknowledgement after a delay that is less than some maximum alue 0AE. The message is deli ered at ; in proper se)uence with respect to other A 3 ; messages. The message is error free. There are possibilities that these specifications are really not met because of different types of errors that may occur. The different possibilities are, missing of acknowledgements, late arri al of acknowledgements etc. These can be ery easily eliminated by adding ad anced design features so that we can assume that+ o 6nes the message is deli ered at ;, then the message is error free and is in se)uence with to other recei ed messages. o 7f A recei es an acknowledgement, then the message has been deli ered. The two possible communication errors are+ 8ost messa!es and #et6or7 Partitions. 7f the acknowledgement for a message has not recei ed within some predefined inter al

114

called timeoutG then the source has to look only for the abo e errors and try to take the corresponding steps. 'u%tip%e >ai%ures and @5resi%iency: %ailures do not occur one at a time. A system which can tolerate G failures is called @5resi%ient. 7n distributed databases, this concept is applied to site failures andEor partitions. -ith respect to site failures, an algorithm is said to be G# resilient if it works properly e en if G sites are down. An extreme case of failure is called Tota% >ai%ure, where all sites are down. ,.-.-.3 Reco1ery of Distributed Transactions: *ow let us consider reco ery problems in distributed databases. %or this purpose, let us assume that at each site a ,ocal Transaction 0anager is a ailable. :ach agent can issue beginFtransaction, commit, and abort primitives to its ,T0. After ha ing issued a beginFtransaction to its ,T0, an agent will possess the properties of a local transaction. -e will call an agent that has issued a beginFtransaction primitive to its local transaction manager a Sub5transaction. Also to distinguish the beginFtransaction, commit, and abort primitives of the distributed transaction from the local primiti es issued by each agent to its ,T0, we will call the later as localFbegin, localFcommit, and localFabort. %or building a .istributed Transaction 0anager &.T0', the following properties are expected from the ,T0+ :nsuring the atomicity of sub#transaction -riting some records on stable storage on behalf of the distributed transaction manager -e need the second re)uirement, as some additional information must also be recorded in such away that they can be reco ered in case of failure. 7n order to make sure that either all actions of a distributed transaction are performed or none is performed at all, two conditions are necessary+ At each site either all actions are performed or none is performed All sites must take the same decision with respect to the commitment or abort of sub transaction. %ig. 4.4 shows a reference model of .istributed transaction reco ery.

112 >66T A1:*T 0essages A1:*T 0essages A1:*T .istribution Transaction

.T0# A1:*T

0essages

.T0# A1:*T

0essages

.T0# A1:*T

.istribution Transaction 0anager &.T0'

1 ,ocal Transaction 0anager &,T0'

,T0 at "ite i

,T0 at "ite 8

,T0 at "ite i

7nterface 1 + ,ocalSbegin, ,ocalScommit, ,ocalSAbort, ,ocalS9reate 7nterface 2 + ;eginSTransaction, 9ommit, Abort, 9reate >i!ure ,./ A reference mode% of distributed transaction reco1ery

Be!in0transaction: -hen it is issued by the root agent, .T0 will ha e to issue a localFbegin primitive to the ,T0 at the site of origin and all the sites at which there are already acti e agents of the same application, thus transforming all agents into sub#transactions= from this time on the acti ation of a new agent by the same distributed transaction re)uires that the localSbegin be issued to the ,T0 where the agent is acti ated, so that the new agent is created as a "ub#transaction. The example of %A*. T>A*"%:> &refer fig 4.2 and fig 4.3' is taken for explaining this concept. The fig 4.4 explains this. Abort: -hen an abort is issued by the root agent, all existing sub#transactions must be aborted. 7ssuing localSaborts to the ,T0s at all sites where there is an acti e sub transaction performs this.

11?

$ommit: The implementation of the commit primiti e is the most difficult and expensi e. The main difficulty originates from the fact that the correct commitment of a distributed transaction re)uires that al sub#transactions commit locally e en if there are failures. 7n order to implement this primiti e for a distributed transaction, the general idea of 2#Chase commit Crotocol has been de eloped. 7t is discussed in detail in the next section. EIamp%e: The figure 4.$ shows an example where the primiti es and messages are shown which are issued by the arious components of the reference model for the execution of the %A*. T>A*"%:> Application which is already explained. The different numbers indicate the arious actions and their order of execution. 7ssuing of ;egin transaction primiti e to the .T0 agent. 1. 7ssuing of ,ocalSbegin primiti e to the ,T0 agent. 2. 7ssuing of 9reate A1:*T1 primiti e to the .T0 agent 3. "end 9reate re)uests to the other .T0 agents 4. 7ssuing of ,ocal 9reate primiti es to the ,T0 agent $. .epending upon the number of local transactions re)uired that many agents are created in a loop 5. Then the local transactions will begin. 4. Then the communications re)uired for committing or aborting the transaction takes place between >66T A1:*T and A1:*T"

>66T A1:*T

12/

A1:*T1 >ecei eM

. . .

;egin transaction
. . .
9reate A1:*T1 . . .

"end to A1:*T1

.T0#A1:*T

.T0#A1:*T

5
,ocalSbegin "end 9reate >e).

4
,ocal 9reate ,ocalSbegin

2
,T0

$
,T0

-rite beginS transaction in local log


>i!.,.+ %und Transfer

-rite beginS transaction

,.-.-./ The Transaction $ontro% Statements+ The following list describes transaction control statements supported+ COMMIT ROLLBACK &A;6>T' SAVEPOINT Session Trees for Distributed Transactions+

>i!ure ,.+

Actions and messa!es durin! the first part


of the >U#D0TRA#>ER transaction

121

As the statements in a distributed transaction are issued, the concept of .istributed Transaction defines a session tree of all nodes participating in the transaction. A session tree is a hierarchical model that describes the relationships among sessions and their roles. %ig 4.5 illustrates a session tree. All nodes participating in the session tree of a distributed transaction assume one or more of the following roles+

Role $%ient Database ser1er )%oba% coordinator 8oca% coordinator $ommit point site

Description A node that references information in a database belonging to a different node. A node that recei es a re)uest for information from another node. The node that originates the distributed transaction. A node that is forced to reference data on other nodes to complete its part of the transaction. The node that commits or rolls back the transaction as instructed by the global coordinator.

Fig.7.6 Example of a Session Tree

122

The role a node plays in a distributed transaction is determined by+


-hether the transaction is local or remote The commit point stren!th of the node -hether all re)uested data is a ailable at a node, or whether other nodes need to be referenced to complete the transaction -hether the node is read#only

$%ients: A node acts as a client when it references information from another nodeZs database. The referenced node is a database ser er. 7n 5.3, the node sa%es are a client of the nodes that host the 6arehouse and finance databases. Database Ser1ers: A database ser er is a node that hosts a database from which a client re)uests data. 7n %ig 4.5 an application at the sa%es node initiates a distributed transaction that accesses data from the 6arehouse and finance nodes. Therefore, sa%es.acme.com has the role of client node, and 6arehouse and finance are both database ser ers. 7n this example, sa%es be a database ser er and a client because the application also modifies data in the sa%es database. 8oca% $oordinators: A node that must reference data on other nodes to complete its part in the distributed transaction is called a local coordinator. 7n, %ig 4.5 sales be a local coordinator because it coordinates the nodes it directly references+ warehouse and finance. The node sales also happen to be the global coordinator because it coordinates all the nodes in ol ed in the transaction. A local coordinator is responsible for coordinating the transaction among the nodes it communicates directly with by+ >ecei ing and relaying transaction status information to and from those nodes Cassing )ueries to those nodes >ecei ing )ueries from those nodes and passing them on to other nodes >eturning the results of )ueries to the nodes that initiated them

)%oba% $oordinator+ The node where the distributed transaction originates is called the global coordinator. The database application issuing the distributed transaction is directly connected to the node acting as the global coordinator. %or example, in, %ig 4.5 the

123

transaction issued at the node sales references information from the database ser ers warehouse and finance. Therefore, sales.acme.com is the global coordinator of this distributed transaction. The global coordinator becomes the parent or root of the session tree. The global coordinator performs the following operations during a distributed transaction+

"ends all of the distributed transactionZs "Q, statements, remote procedure calls, and so forth to the directly referenced nodes, thus forming the session tree 7nstructs all directly referenced nodes other than the commit point site to prepare the transaction 7nstructs the commit point site to initiate the global commit of the transaction if all nodes prepare successfully 7nstructs all nodes to initiate a global abort of the transaction if there is an abort response

$ommit Point Site: The 8ob of the commit point site is to initiate a commit or roll back &abort' operation as instructed by the global coordinator. The system administrator always designates one node to be the commit point site in the session tree by assigning all nodes commits point strength. The node selected as commit point site should be the node that stores the most critical data. %ig 4.4 illustrates an example of distributed system, with sales ser ing as the commit point site+
.he commit point site is distinct from all other nodes involved in a distributed transaction in these ways'

The commit point site ne er enters the prepared state. 9onse)uently, if the commit point site stores the most critical data, this data ne er remains in#doubt, e en if a failure occurs. 7n failure situations, failed nodes remain in a prepared state, holding necessary locks on data until in#doubt transactions are resol ed.

The commit point site commits before the other nodes in ol ed in the transaction. 7n effect, the outcome of a distributed transaction at the commit point site determines whether the transaction at all nodes is committed or rolled back+ the other nodes follow the lead of the commit point site. The global coordinator ensures that all nodes complete the transaction in the same manner as the commit point site.

124

Figure G.G Commit Point Site

&o6 a Distributed Transaction $ommitsQ A distributed transaction is considered committed after all non#commit point sites are prepared, and the transaction has been actually committed at the commit point site. The online redo log at the commit point site is updated as soon as the distributed transaction is committed at this node. ;ecause the commit point log contains a record of the commit, the transaction is considered committed e en though some participating nodes may still be only in the prepared state and the transaction not yet actually committed at these nodes. 7n the same way, a distributed transaction is considered not committed if the commit has not been logged at the commit point site. $ommit Point Stren!th: : ery database ser er must be assigned commit point strength. 7f a database ser er is referenced in a distributed transaction, the alue of its commit point strength determines which role it plays in the two%phase commit.

12$

"pecifically, the commit point strength determines whether a gi en node is the commit point site in the distributed transaction and thus commits before all of the other nodes. This alue is specified using the initiali!ation parameter
COMMIT_POINT_STRENGTH.

This section explains how the system determines the

commit point site. The commit point site, which is determined at the beginning of the prepare phase, is selected only from the nodes participating in the transaction. The following se)uence of e ents occurs+

6f the nodes directly referenced by the global coordinator, the software selects the node with the highest commit point strength as the commit point site. The initially selected node determines if any of the nodes from which it has to obtain information for this transaction has a higher commit point strength. :ither the node with the highest commit point strength directly referenced in the transaction or one of its ser ers with a higher commit point strength becomes the commit point site.

After the final commit point site has been determined, the global coordinator sends prepare responses to all nodes participating in the transaction.

%ig 4.5 shows in a sample session tree the commit point strengths of each node &in parentheses' and show the node chosen as the commit point site. The following conditions apply when determining the commit point site+

A read#only node cannot be the commit point site. 7f multiple nodes directly referenced by the global coordinator ha e the same commit point strength, then the software designates one of these as the commit point site. 7f a distributed transaction ends with an abort, then the prepare and commit phases are not needed. 9onse)uently, the software ne er determines a commit point site. 7nstead, the global coordinator sends a ABORT statement to all nodes and ends the processing of the distributed transaction.

As %ig 4.2 illustrates, the commit point site and the global coordinator can be different nodes of the session tree. The commit point strength of each node is communicated to the coordinators when the initial connections are made. The coordinators retain the commit point strengths of each node they are in direct communication with so those commit point sites can be efficiently selected durin!

125

two#phase commits. Therefore, it is not necessary for the commit point strength to be exchanged between a coordinator and a node each time a commit occurs.

>i! ,.9 The commit point site and the !%oba% coordinator T6o5Phase $ommit 'echanism: Anlike a transaction on a local database, a distributed transaction in ol es altering data on multiple databases. 9onse)uently, distributed transaction processing is more complicated, because The system must coordinate the committing or rolling back of the changes in a transaction as a self# contained section. 7n other words, the entire transaction commits, or the entire transaction rolls back &aborts'. The software ensures the integrity of data in a distributed transaction using the t6o5 phase commit mechanism. 7n the prepare phase, the initiating node in the transaction asks the other participating nodes to promise to commit or roll back the transaction. .uring the commit phase, the initiating node asks all participating nodes to commit the transaction. 7f this outcome is not possible, then all nodes are asked to roll back. All participating nodes in a distributed transaction should perform the same action+ they should either all commit or all perform a abort of the transaction. The software automatically controls and monitors the commit or abort of a distributed transaction and

124

maintains the integrity of the global database &the collection of databases participating in the transaction' using the two#phase commit mechanism. This mechanism is completely transparent, re)uiring no programming on the part of the user or application de eloper. The commit mechanism has the following distinct phases, which the software performs automatically whene er a user commits a distributed transaction: Crepare phase The initiating node, called the !%oba% coordinator, asks participating nodes other than the commit point site to promise to commit or roll back the transaction, e en if there is a failure. 7f any node cannot prepare, the transaction is rolled back. 9ommit phase 7f all participants respond to the coordinator that they are prepared, then the coordinator asks the commit point site to commit. After it commits, the coordinator asks all other nodes to commit the transaction. %orget The global coordinator forgets about the transaction.

phase This section contains the following topics: Prepare Phase The first phase in committing a distributed transaction is the prepare phase. 7n this phase, the system does not actually commit or roll back the transaction. 7nstead, all nodes referenced in a distributed transaction &except the commit point site', are told to prepare to commit. ;y preparing, a node+

>ecords information in the online redo logs so that it can subse)uently either commit or roll back the transaction, regardless of inter ening failures Claces a distributed lock on modified tables, which pre ents reads

-hen a node responds to the global coordinator that it is prepared to commit, the prepared node promises to either commit or roll back the transaction later##but does not make a unilateral decision on whether to commit or roll back the transaction. The promise means that if an instance failure occurs at this point, the node can use the redo records in the online log to reco er the database back to the prepare phase.

122

Types of Responses in the Prepare Phase: -hen a node is told to prepare, it can respond in the following ways+ Crepared .ata on the node has been modified by a statement in the distributed transaction, and the node has successfully prepared. >ead# only *o data on the node has been, or can be, modified &only )ueried', so no preparation is necessary.

Abort The node cannot successfully prepare. Prepared Response: -hen a node has successfully prepared, it issues a prepared messa!e. The message indicates that the node has records of the changes in the online log, so it is prepared either to commit or perform a abort. The message also guarantees that locks held for the transaction can sur i e a failure. Read5(n%y Response: -hen a node is asked to prepare, and the "Q, statements affecting the database do not change the nodeZs data, the node responds with a read# only message. The message indicates that the node will not participate in the commit phase. *ote that if a distributed transaction is set to read#only, then it does not use abort segments. 7f many users connect to the database and their transactions are not set to READ
ONLY,

then they allocate abort space e en if they are only performing )ueries.

Abort Response+ -hen a node cannot successfully prepare, it performs the following actions+

>eleases resources currently held by the transaction and roll back the local portion of the transaction. >esponds to the node that referenced it in the distributed transaction with an abort message.

These actions then propagate to the other nodes in ol ed in the distributed transaction so that they can roll back the transaction and guarantee the integrity of the data in the global database. This response enforces the primary rule of a distributed transaction+ all nodes

12?

involved in the transaction either all commit or all roll back the transaction at the same logical time. Steps in the Prepare Phase: To complete the prepare phase, each node excluding the commit point site performs the following steps+

The node re)uests that its descendants, that is, the nodes subse)uently referenced, prepare to commit. The node checks to see whether the transaction changes data on itself or its descendants. 7f there is no change to the data, then the node skips the remaining steps and returns a read#only response

The node allocates the resources it needs to commit the transaction if data is changed. The node sa es redo records corresponding to changes made by the transaction to its online redo log. The node guarantees that locks held for the transaction are able to sur i e a failure. The node responds to the initiating node with a prepared response or, if its attempt or the attempt of one of its descendents to prepare was unsuccessful, with an abort response. These actions guarantee that the node can subse)uently commit or roll back the transaction on the node. The prepared nodes then wait until a 96007T or A;6>T re)uest is recei ed from the global coordinator.

After the nodes are prepared, the distributed transaction is said to be in5doubt .7t retains in#doubt status until all changes are either committed or aborted. $ommit Phase: The second phase in committing a distributed transaction is the commit phase. ;efore this phase occurs, all nodes other than the commit point site referenced in the distributed transaction ha e guaranteed that they are prepared, that is, they ha e the necessary resources to commit the transaction. Steps in the $ommit Phase: The commit phase consists of the following steps+

The global coordinator instructs the commit point site to commit. The commit point site commits. The commit point site informs the global coordinator that it has committed. The global and local coordinators send a message to all nodes instructing them to commit the transaction.

13/

At each node, the system commits the local portion of the distributed transaction and releases locks. At each node, the system records an additional redo entry in the local redo log, indicating that the transaction has committed. The participating nodes notify the global coordinator that they ha e committed.

)uaranteein! )%oba% Database $onsistency: :ach committed transaction has an associated system change number &"9*' to uni)uely identify the changes made by the "Q, statements within that transaction. The "9* functions as an internal .The system timestamp that uni)uely identifies a committed ersion of the database. 7n a distributed system, the "9*s of communicating nodes are coordinated when all of the following actions occur+

A connection occurs using the path described by one or more database links A distributed "Q, statement executes A distributed transaction commits

.uring the prepare phase, the system determines the highest "9* at all nodes in ol ed in the transaction. The transaction then commits with the high "9* at the commit point site. The commit "9* is then sent to all prepared nodes with the commit decision. >or!et Phase+ After the participating nodes notify the commit point site that they ha e committed, the commit point site can forget about the transaction. The following steps occur+

After recei ing notice from the global coordinator that all nodes ha e committed, the commit point site erases status information about this transaction. The commit point site informs the global coordinator that it has erased the status information. The global coordinator erases its own information about the transaction.

131

Response of -5Phase $ommit protoco% for >ai%ures: 7t is tough to all failures in which no log information is lost. The response of the protocol in the presence of failure is now discussed. 4. Site >ai%ure: A participant fails before ha ing written the ready record in the log. 7n this case, the coordinator<s timeout expires, and it takes the abort decision. A participant fails after written the ready record in the log. 7n this case, the operational sites correctly terminate the transaction &abort or commit'. -hen the failed site reco ers, the restart procedure has to ask the coordinator or some other participant about the result of the transaction. The coordinator fails after ha ing written the prepare record in the log before ha ing written the log, but before ha ing written a globalScommit or globalSabort record in the log. 7n this case all participants who ha e already answered >:A.R message must wait for the reco ery of the coordinator. The restart procedure of the coordinator resumes the commit protocol from the beginning, reading the identity of the participants from the prepare record in the log, and sending again prepare message to them. :ach ready participant must recogni!e that the new C>:CA>: message is a repetition of the pre ious one. The coordinator fails after ha ing written a globalScommit or globalSabort record in the log, but before ha ing written the complete record in the log. 7n this case, the coordinator must send the decision again to all participants= participants who ha e not recei ed the command ha e to wait until the coordinator reco ers. The coordinator fails before ha ing written the complete record in the log. 7n this case, the transaction was already concluded, and no action is re)uired at start. -. 8ost 'essa!es: An answer message &>:A.R or A;6>T' from a participant is lost. 7n this case the coordinator<s time out expires, and the whole transaction is aborted.

132

A C>:CA>: message is lost. 7n this case, the participant remains in wait. The global result is same as the pre ious one as the coordinator does not recei e an answer. A command message is lost i.e either 96007T or A;6>T. 7n this case, the destination participant remains uncertain about the decision. (a ing a timeout in the participant= if no command has been recei ed after the timeout inter al from the answer can eliminate this problem, a re)uest for the repetition of the command is sent. An A9G message is lost. 7n this case, the coordinator remains uncertain about the fact that the participant has been recei ed the command message. 7ntroducing a timeout in the coordinator can eliminate this problem= if no A9G message is recei ed after the timeout inter al from the transmission of the command, the coordinator will send the command again. 3.#et6or7 Partitions: ,et us suppose that a simple partitions occurs, di iding the sites in two groups the group contains the coordinator is called the coordinator * !roup the other the participant * !roup. %rom the iew of the coordinator the partition is e)ui alent to the multiple failure of as set of participants, and the solution is already discussed. %rom the iew of the participant the failure is e)ui alent to a coordinator failure and the situation is similar to the case already discussed. 7t has been obser ed that the reco ery procedure for a site that is in ol ed in processing a distributed transaction is more complex than that for a centrali!ed database. ,.3 $oncurrency $ontro% for Distributed Transactions: 7n this section we discuss the fundamental problems, which are due to the concurrent execution of transactions. -e deal concurrency control based on locking. %irst, the 2# Chase#locking protocol in centrali!ed databases is presented= then, 2#phase#locking is extended to distributed databases. ,.3.4 $oncurrency $ontro% Based on 8oc7in! in $entra%i=ed Databases: The basic idea of locking is that whene er a transaction accesses a data item, it locks it, and that a transaction which wants to lock a data item which is already locked by another transaction must wait until the other transaction has released the lock &unlock'.

133

,et us see some important terminologies related to this concept+ ock 0ode' Transaction locks the data item in the following modes+ o -hared *ode' (ere the transaction wants only to read the data item. o (xclusive *ode' (ere the transaction wants edit the data item. The @ell-formed Transactions: The transactions are always well#formed if it always locks a data item in shared mode before reading it, and it always locks a data item in exclusi e mode before writing it $ompatibility 3ules e1isting between ock 0odes: o o A transaction can lock a data item in shared mode if it is not locked at all or it is locked in shared mode by another transaction A transaction can lock a data item in exclusi e mode only if it is not locked at all. $onflicts: Two transactions are in conflict if they want to want to lock the same data item with two compatible modes= two types of conflicts+ >ead#-rite conflict and -rite#-rite conflict. <ranularity of ocking: This term relates to the si!e of ob8ects that are locked with a single lock operation. 7n general, it is possible to lock at the record level i.e to lock indi idual tuples' or at the 2ile level &to lock at the fragment le el'. $oncurrent transactions are successful if the following rules are followed + o Transactions are well#formed o 9ompatibility rules are obser ed o :ach transaction does not re)uest new locks after it has released a lock. A sophisticated locking mechanism known as 9%,hase locking which includes the abo e said principles is normally used. According to this, there are two separate phases+ &rowing phase+ :ach transactions there is a first phase during which new locks are ac)uired -hrinking ,hase: A second phase during which locks are only released.

134

-e will simply assume that all transactions are performed according to the following scheme+ /egin "pplication$ /egin transaction "c3uire locks before reading or writing Commit +elease locks (nd application$ 7n this way the transactions are well formed, 2#Chase locked and isolated. Dead%oc7: A deadlock between two transactions arises if each transaction has locked a data item and is waiting to lock a different data item which has already been locked by he other transaction in the conflicting mode. ;oth transactions will wait fore er in this situation, and system inter ention is re)uired to unblock the situation. The system must first find out the deadlock situation and force one transaction to release its locks, so that the other one can proceed. i.e one transaction is aborted. This method is called as !eadlock detection. ,.3.- $oncurrency $ontro% Based on 8oc7in! in Distributed Databases: ,et us now concentrate about .istributed transaction concurrency control. (ere some assumptions like+ The local agents &,T0s' can lock and unlock local data items. The ,T0s interpret local locking primiti es' local% lock%shared, local% lock% exclusive and local unlock. The global agent issue global primiti es like+ lock%shared, lock% exclusive and unlock. The most important result for distributed databases is the following+ If a distributed transactions are well%formed and 9%phaselocked, then 9%phase locking is the correct locking mechanism in distributed transaction as well as in centralized databases. -e shall now discuss the important problems that ha e to be sol ed by the .istributed transaction manager &.T0'.

13$

Dea%in! 6ith mu%tip%e copies of the data: 7n distributed databases, redundancy between data items, which are stored at different sites, is often desired, and in this case two transactions, which hold conflicting locks on two copies of the same data item stored at different sites, could be unaware of their mutual existence. 7n this case locking would be completely useless. 7n order to a oid this problem, the lock primiti e issued by the .T0 agent has to translate the lock primiti e issued by an agent on a data item in such a way that it is impossible for a conflicting transaction to be unaware of this lock. The simple way is to issue local locks to all ,T0s at all sites where a local copy of the data item is stored. 7n this way, the lock primiti e is con erted into as many lock primiti es, as there are copies of the locked items. These schemes are only briefly explained here+ Write%locks%all, read%locks%one' 7n this scheme exclusi e locks are ac)uired on all copies, while shared locks are ac)uired only an arbitrary copy. A conflict is always detected, because a shared# exclusi e conflict is detected at the site where the shared lock is re)uired and exclusi e#exclusi e conflicts are detected at all sites. *a5ority locking' ;oth shared and exclusi e and exclusi e locks are re)uested at a ma8ority of the copies of the data item. 7n this way, if two transactions are re)uired to lock the same data item, there is at least one copy of it where the conflict is disco ered. ,rimary copy locking' 6ne copy of each data item is pri ileged &called Primary copy'= all locks must be re)uired at this copy so that conflicts are disco ered at the site where the primary copy resides. Dead%oc7 detection: The second problem that is faced by the .T0 is .eadlock detection. A deadlock is a circular waiting situation, which can in ol e many transactions, not 8ust two. The basic characteristic of a deadlock is the existence of a set of transactions such that each transaction waits for another one. This can

135

>66T A1:*T

0essages

A1:*T

0essages

A1:*T

.istribution Transaction

2Z

2Z

2Z

.T0# A1:*T

0essages

.T0# A1:*T

0essages

.T0# A1:*T

.istribution Transaction 0anager &.T0'

1Z

1Z

1Z ,ocal Transaction 0anager &,T0'

,T0 at "ite i

,T0 at "ite 8

,T0 at "ite i

7nterface 1Z + ,ocalSlockSshared, ,ocalSlockSexclusi e, ,ocalSunlock 7nterface 2Z+ lockSshared, lockSexclusi e, Anlock >i!ure ,.< A reference mode% of distributed $oncurrency contro%

be represented with a wait-for graph. 7t is a directed graph ha ing transactions as the nodes= an edge from transaction T1 to transaction T2 represents the fact that T1 waits for T2 as shown in the fig 4.1/. The existence of the deadlock situation corresponds to the existence of a cycle in the wait#for graph. Therefore a system can disco er deadlocks by constructing wait#for graph and analy!ing whether there are cycles in it. 7n the fig. 4.1/ the notation TiA8 refers to the agent A8 of transaction Ti. (ere there are two sites and two transactions T1 and T2, each one consisting of two agents. %or simplicity, we assume that each transaction has only one agent at each site where it is executed. A direct edge from an agent TiA8 to an agent TrAs means that TiA8 is blocked and waiting for TrAs.

134

EIamp%e of a dead%oc7 situation: "ite1 "ite2

T1A1

T1A2 T2A2

T2A1

>i! ,.4. A distributed wait graph showing a distributed deadlock 9learly, if the arcs of the wait#for graph are different sites, the deadlock detection problem becomes intrinsically a problem of distributed transaction management. 7n this case, a global wait#for graph should be built by the .T0. The construction of the global wait#for graph re)uires the execution of rather complex algorithms. 0ost of the systems do not determine deadlocks in the abo e way as it is little complicated. They simply use timeouts for detecting deadlocks. -ith the timeout method a transaction is aborted after a gi en time inter al has passed since the transaction entered a wait state. This method does not determine a deadlock= it simply obser es a \long waiting\ which could possibly be caused by a deadlock. The main challenge here is to estimate an optimum timeout inter al. 7n a distributed system it is e en more difficult to determine a workable timeout inter al than in a centrali!ed system, because of the less predictable beha ior of the communication network and of remote sites. $heck %our Progress - &: Answer the following: A. .efine the different properties of a transaction. ;. ,ist the goals of transaction management. 9. -hat do you mean by a distributed transactionF .. .efine the concept of atomicity. :. .efine the ,og in case of transaction management. %. -hat is the role of .istributed Transaction 0anagerF 1. -hat are the different transaction control statements a ailableF (. .efine 9ommit point site.

132

7. .efine 9ommit point strength. N. "tate Two#Chase 9ommit Crotocol. G. ,ist the different failures possible in case of .istributed transaction. ,. -hat do you mean by the problem of concurrency controlF 0. .efine .eadlock in .istributed transaction. $heck %our Progress - (: Answer the following: 1. :xplain .istributed transaction. 2. :xplain the >eco ery techni)ues in 9entrali!ed database. Also explain how it is extended to distributed databasesF 3. :xplain Two#Chase commit protocol with an example. 4. .escribe Two#Chase locking mechanism used to sol e 9oncurrency control issue in distributed transaction. $. :xplain .eadlock in .istributed transactions. ,./ Summary: 7n this unit, we ha e learnt about the distributed transaction management and their related aspects. The following are the key points which are to be noted+ .istributed transaction managers &.T0s' ha e the atomicity, durability, seriali!ability, and isolation properties. 7n most of the systems this is obtained by implementing the 2#Chase 9ommit protocol for reliability, 2#phase#locking for concurrency control, and timeout for deadlock detection. The 2#Chase 9ommit protocol ensures that the sub#transactions of the same transaction will either all commit or all abort, in spite of the possible failures= it is resilient to any failure in which no log information is lost. The 2#phase#locking mechanism re)uires that all sub#transactions ac)uire locks in the growing phase and release locks in the shrinking phase.Time out mechanisms for deadlock detection abort those transactions, which are in wait, possibly for a deadlock. U#IT5 9

13?

TI'E A#D SF#$&R(#IPATI(# Structure 9.. (bAecti1e 9.4 Introduction 9.- $%oc7 Synchroni=ation 9.-.4 Synchroni=in! physica% c%oc7s 9.-.4.4 $rstianNs method for synchroni=in! c%oc7s 9.-.4.- The Ber7e%ey a%!orithm 9.-.4.3 The #et6or7 Time Protoco% 9.3 8o!ica% Time and 8o!ica% c%oc7s 9.3.4 8o!ica% c%oc7s 9.3.- Tota%%y (rdered 8o!ica% $%oc7s 9./ Distributed $oordination 9./.4 Distributed 'utua% EIc%usion 9./.4.4The centra% ser1er a%!orithm 9./.4.- A Distributed a%!orithm usin! %o!ica% c%oc7s 9./.4.3 A Rin! based a%!orithm 9.+ E%ections 9.+.4 The bu%%y a%!orithm 9.+.- A Rin! based e%ection a%!orithm 9.: Summary

14/

A.) *b+ective: In this unit we introduce some topics related to the issue of coordination in distributed systems. .o start with, the notion of time is dealt. We go on to explain the notion of logical clocks, which are tool for ordering events, without knowing precisely when they occurred. .he second half examines briefly some algorithms to achieve distributed coordination. .hese include algorithms achieve mutual exclusion among collection of processes, so as to coordinate their accesses to shared resources. It goes on to examine how a group of processes agree upon a new coordinator of their activities, after their previous coordinator has failed or become unreachable. .his process is called as (lection. 9.4 Introduction: This unit introduces some concepts and algorithms related to the timing and coordination of e ents occurring in distributed systems. 7n the first half we describe the notion of time in a distributed system. -e discuss the problem of how to synchroni!e clocks in different computers, and so time e ents occurring at them consistently, and we e en discuss the related problem of determining the order in which e ents occurred. 7n the latter half of the chapter we introduce some algorithms whose goal is to confer a pri ilege upon some uni)ue member of collection of processes. 7n the next section we explain methods whereby computer clocks can be approximately synchroni!ed, using message passing. (owe er, we shall not be able to obtain sufficient accuracy to determine, in many cases, the relati e ordering of e ents on some e ents by appealing to the flow of data between processes in a distributed system. -e go on to introduce logical clocks, which are used to define an order on e ents without measuring the physical time at which they occurred. 9.- $%oc7 Synchroni=ation: .ime is an important and interesting issue in distributed systems, for se eral reasons. o >irst, the time is a )uantity we often want to measure accurately. 7n order to know at what time of day a particular e ent occurred at a particular computer 3 for example, for accountancy purposes 3 it is necessary to synchroni!e its clock with a trustworthy, external source of time. This is external synchronization. Also, if computer clocks are synchroni!ed with one another to a known degree of accuracy, then we can, within the bounds of this accuracy, measure the inter al

141

between two e ents occurring at different computers, by appealing to their local clocks. This is internal synchronization. Two or more computers that are internally synchroni!ed are not necessarily externally synchroni!ed, since they may drift collecti ely from external time. o Second%y, algorithms that depend upon clock synchroni!ation ha e been de eloped for se eral problems in distributions. These include maintaining the consistency of distributed data &the use of timestamps to seriali!e transactions'= checking the authenticity of a re)uest and eliminating the processing of duplicate updates. o Third%y, the notion of physica% time is also a problem in a distributed system. This is not due to the effects of special relati ity, which are negligible or non# existent for normal computers, but the problem is based on a similar limitation concerning our ability to pass information from one computer to another. This is that the clocks belonging to different computers can only be synchroni!ed, at least in the ma8ority of cases, by network communication. 0essage passing is limited by irtue of the speed at which it can transfer information, but this in itself would not be a problem if we knew how long message transmission took The problem is that sending a message usually takes an unpredictable amount of time. *ow let us see different aspects of clock synchroni!ation one by one. 9.-.4 Synchroni=in! physica% c%oc7s: 9omputers each contain their own physical clock. These clocks are electronic de ices that count oscillations occurring in a crystal at a definite fre)uency, and which typically di ide this count and store the result in a counter register. 9lock de ices can be programmed to generate interrupts at regular inter als in order that, for example, time slicing can be implemented= howe er we shall not concern oursel es with this aspect of clock operation. The clock output can be read by software and scaled into a suitable time unit. This alue can be used to timestamp any e ent experienced by any process executing at the host computer. ;y ]e ent< we mean an action that appears to occur indi isibly, all at once 3 such as sending or recei ing a message. *ote howe er, that successi e e ents will correspond to different timestamps only if the clock resolution 3 the period between updates of the clock register 3 is smaller than the rate at which e ents

142

can occur. The rate at which e ents occur depends on such factors as the length of the processor instruction cycle. 7n order to compare timestamps generated by clocks of the same physical construction at different computers, one might think that we need only know the relati e offset of one clock<s counter from that of the other 3 for example, that one count had the alue 1?$2 when the other was initiali!ed to /. Anfortunately, this supposition is based on a false argument= computer clocks in practice are extremely unlikely to ]tick< at the same rate, whether or not they are of the ]same< physical construction. $%oc7 drift: The crystal#based clocks used in computers are, like any other clocks, sub8ect to clock drift &%igure2.1', which means that they count time at different rates, and so de iate. The underlying oscillators are sub8ect to physical ariations, with the conse)uence that their fre)uencies of oscillation differ. 0oreo er, e en the same clock<s fre)uency aries with the temperature. .esigns exist that attempt to compensate for this ariation but they cannot eliminate it. The difference in the oscillation period

#et6or7 >i!.,.4 .rift between computers clocks in a distributed system.

between two clocks might be extremely small, but the different accumulated o er many oscillation leads to an obser able difference in the counters registered by two clocks, no matter how accurately they were initiali!ed to the same alue. A clockBs drift rate is the change in the offset &difference in reading' between the clock and a nominal perfect reference clock per unit of time measured by the reference clock. %or clocks based on a )uery crystal, this is about 1/#5 3 gi ing a difference of one second e ery 1,///,/// seconds or 11.5 days. $oordinated uni1ersa% time: The most accurate physical clocks known use atomic oscillators, whose accuracy is about one part in 1/ 13. The output of these atomic clocks is used as the standard for elapsed real time, known as International "tomic .ime.

143

Coordinated universal time 3 abbre iated as 6.C &sic' 3 is an international standard that is based on atomic time, but a so#called leap second is occasionally inserted or deleted to keep in step with astronomical time. AT9 signals are synchroni!ed and broadcast regularly from land#based radio stations and satellites co ering many parts of the world. %or example, in the A" radio station --broadcasts time signals on se eral short#wa e fre)uencies. "atellite sources include the &eostationary )perational (nvironmental -atellite &16:"' and the &lobal ,ositioning -ystem &1C"'. >ecei ers are a ailable commercially. 9ompared with ]perfect< AT9, the signals recei ed from land#based stations ha e accuracy in the order of /.1#1/ milliseconds, depending on the station used. "ignals recei ed from 1:6" are accurate to about /.1ms and signals recei ed from 1C" are accurate to about one millisecond. 9omputers with recei ers attached can synchroni!e their clocks with these timing signals.

$ompensatin! for c%oc7 drift :7f the time pro ided by a time ser ice, such as Ani ersal 9oordinated Time signals, is greater than the time at a computer C, then it may be possible simply to set C<s clock to the time ser ice time. "e eral clock ]ticks< appear to ha e been missed, but time continues to ad ance, as expected. *ow consider what should happen if the time ser ice<s time is behind that of C. -e cannot set C<s time back to the time, because this is liable to confuse applications that relay on the assumption that time always ad ances. The solution is not set C<s clock back, but to cause it to run slow for a period, until it is in accord with the timeser er time. 7t is possible to change the rate at which updates are made to the time as gi en to applications. This can be achie ed in software, without changing the rate at which the hardware clock ticks &an operation which is not always supported by hardware clocks'. ,et us call the time gi en to applications &the software clock<s reading' - and the time gi en by the hardware clock D. ,et the compensating factor be , so that - t$7D t$ H t$

144

The simplest form for to make - change continuously is a linear function of the hardware clock+ t$ 7 aD t$ H b, where a and b are constants to be found. "ubsttuting for in the identity, we ha e - t$ 7 &1 B a'D t$ H b. ,et the alue of the software clock be .skew when D7h, and let the actual time at the point be .real. -e may ha e that .skew B .real or .skew A.real. 7f - is to gi e the actual time after # further ticks, we must ha e .skew 7&1 B a'hHb, and .real H # 7 &1Ba'&h H #$ H b ;y sol ing these e)uations, we find A O &.real # .skew' @ # and b7 .skew C &1 B a'h. 9.-.4.4 $ristianNs method for synchroni=in! c%oc7s : 6ne way to achie e synchroni!ation between computers in a distributed system is for a central timeser er process - to supply the time according to its clock upon re)uest, as shown in %ig 2.2. The timeser er computer can be fitted with a suitable recei er so as to be synchroni!ed with AT9. 7f a process , re)uests the time in a message mr, and recei es the time alue t in a message mt , then in principle it could set its clock to the time t H.trans, where .trans,

mr , mt Time ser er,s

>i!.9.- 9lock synchroni!ation using a timeser er. where .trans is the time taken to transmit mt from - to , &t is inserted in mt at the last possible point before transmission from -Is computer'. Anfortunately, .trans is sub8ect to ariation. 7n general, other processes are competing with - and , for resources at each computer in ol ed, and other messages compete with mt for the network. These factors are unpredictable and not practically a oidable or accurately measurable in most installations. 7n general, we may say that .trans O min H x, where x /. The minimum alue min is the alue that would be obtained if no other processes executed and no other network traffic existed= min can be measured

14$

or conser ati ely estimated. The alue of x is not known in a particular case, although a distribution of alues may be measurable for a particular installation. 9ristian suggested the use of such a timeser er, connected to a de ice that recei es signals from a source of AT9, to synchroni!e computers. "ynchroni!ation between the timeser er and its AT9 recei er can be achie ed by a method that is similar to the following procedure for synchroni!ation between computers. A process , wishing to learn the time from - can record the total round#trip time with reasonable accuracy if its rate of clock drift is small. %or example, the round# trip time should be in the order of 1#1/ milliseconds on a ,A*, o er which time a clock with a drift rate of 1/#5 aries by at most 1/#$ milliseconds. ,et the time returned in -<s message mt be t. A simple estimate of the time to which , should set its clock is tH .roundE9, which assumes that the elapsed time is split e)ually before and after - placed t in mt. ,et the time between sending and receipt for mr and mt. be min H x and this result as follows. The earliest point that - could ha e placed the time in mt was min after , dispatched mt. The latest point at which it could ha e done this was min before mt arri ed at ,. The time by -I- clock when the reply messages arri es is therefore in the range KtHmin, tH.round%minL. The width of this range is .round C 9 min, so the accuracy is .round E9%min$. Discussion of $ristianNs a%!orithm: As described 9ristian<s method suffers from the problem associated with all ser ices implemented by a single ser er, that the single timeser er might fail and thus render synchroni!ation impossible temporarily. 9ristian suggested, for this reason, that time should be pro ided by a group of synchroni!ed timeser ers, each with a ser er for AT9 time signals. %or example, a client could multicast its re)uest to all ser ers, and use only the first reply obtained. *ote that a malfunctioning timeser er that replied with spurious time system. alues, or an improper timeser er that replied with deliberately incorrect times could wreak ha oc in a computer min H y, respecti ely. 7f the alue of min is known or can be conser ati ely estimated, then we can determine the accuracy of

145

9.-.4.- The Ber7e%ey a%!orithm:

1usella and [atti describe an algorithm for internal

synchroni!ation which they de eloped for collections of computers running ;erkeley A*7E. 7n it, a coordinator computer is chosen to act as the master. Anlike 9ristian<s protocol, this computer periodically polls the other computers whose clocks are to be synchroni!ed, called slaves. The sla es send back their clock alues to it. The master estimates their local clock times by obser ing the round#trip times &similarly to 9ristian<s techni)ue', and it a erages the alues obtained &including its own clock<s reading'. The balance of probabilities is that this a erage cancels out the indi idual clocks< tendencies to run fast or slow. The accuracy of the protocol depends upon a nominal maximum round#trip time between the master and the sla es. The master eliminates any occasional readings associated with larger times than this maximum. 7nstead of sending the updated current time back to the other computers 3 which would introduce further uncertainty due to the message transmission time 3 the master sends the amount by which each indi idual sla e<s clock re)uires ad8ustment. This can be a positi e or negati e alue. The algorithm eliminates readings from clocks that ha e drifted badly or that has failed and pro ides spurious readings. "uch clocks could ha e a significant ad erse effect if an ordinary a erage was taken. The master takes a fault%tolerant average. That is, a subset of clocks is chosen that do not differ from one another by more than a specified amount, and the a erage is taken only of readings from these clocks. 1usella and [atti describe an experiment in ol ing 1$ computers whose clocks were synchroni!ed to within about 2/#2$ milliseconds using their protocol. The local clocks< drift rate was measured to be less than 2 x 1/ #$ , and the maximum round#trip time was taken to be 1/ milliseconds. 9.-.4.3 The #et6or7 Time Protoco%: The *etwork Time Crotocol &*TC' defines architecture for a time ser ice and a protocol to distribute time information o er a wide ariety of interconnected networks. 7t has been adopted as a standard for clock synchroni!ation throughout the 7nternet. *TC<s chief design aims and features are+ .o provide a service enabling clients across the Internet to be synchronized accurately to 6.CJ despite the large and ariable message

144

delays encountered in 7nternet communication. *TC employs statistical techni)ues for the filtering of timing data and it discriminates between the )ualities of timing from different ser ers. .o provide a reliable service that can survive lengthy losses of connectivity= there are redundant ser ers and redundant paths between the ser ers. The ser ers can reconfigure so as still to pro ide the ser ice if one of them becomes unreachable. .o enable clients to resynchronize sufficiently = to offset the rate of drift found in most computers. The ser ice is designed to scale to large number of clients and ser ers. .o provide protection against interference with the time service = whether malicious or accidental. The time ser ice uses authentication techni)ues to check that timing data originate from the claimed trusted sources. 7t also alidates the return addresses of messages sent to it. .

142

>i!.9.3 An example synchroni!ation subnet in an *TC implementation #ote+ Arrows denote synchroni!ation control and *umbers denote strata.

14?

A network of ser ers located across the 7nternet pro ides the *TC ser ice. ,rimary servers are directly connected to a time such as a radio clock recei ing AT9= secondary servers are synchroni!ed, ultimately, to primary ser ers. The ser ers are connected in a logical hierarchy called a synchronization subnet &see %ig 2.3', whose le els are called strata. Crimary ser ers occupy stratum 1+ they are at the root. "tratum 2 ser ers are secondary ser ers that are synchroni!ed directly to the primary se ers= stratum 3 are synchroni!ed from stratum 2 ser ers, and so on. The lowest le el &leaf' se ers execute in users< workstations. *TC ser ers synchroni!e modes+ multicast, procedure%call and symmetric mode. 0ulticast mode+ is intended for use on a high#speed ,A*. 6ne or more ser ers periodically multicasts the time to the ser ers running in other computers connected by the ,A*, which set their clocks assuming a small delay. This mode can only achie e relati ely low accurate, but ones which nonetheless are considered sufficient for many purposes. Crocedure#call mode+ is similar to the operation of 9ristian<s algorithm, described abo e. 7n this mode, one ser er accepts re)uests from other computers, which it processes by replying with its timestamp &current clock reading'. This mode is suitable where higher accuracies are re)uired that can be achie ed with multicast or where multicast is not supported in hardware. %or example, file ser ers on the same or a neighboring ,A* , which need to keep accurate timing information for file accesses, could contact a local master ser er in procedure#call mode. "ymmetric mode+ is intended for use by the master ser ers that supply time information in ,A*s and by the higher le els &lower strata' of the synchroni!ation subnet, where the highest accuracies are to be achie ed. A pair of ser ers operating in symmetric mode exchange messages bearing timing information. Timing data are retained as part of an association between the ser ers that is maintained in order to impro e the accuracy of their synchroni!ation o er time.

1$/

7n all modes, messages are deli ered unreliably, using the standard A.C 7nternet transport protocol. 7n procedure#call mode and symmetric mode, messages are exchanged in pairs. :ach message bears timestamps of recent message e ents= the local times when the pre ious *TC messages between the pair were sent and recei ed, and the local time when the current message was transmitted. The recipient of the *TC message notes the local time when it recei es the message. The four times .i%=, .i%9, .i%?, and .i are shown in %ig 2.4 for messages m and mI sent between ser ers A and ;. *ote that in symmetric mode, unlike the 9ristian algorithm, there can be a non#negligible delay between the arri al of one message and the dispatch of the next. Also, messages may be lost, but the three timestamps carried by each message are nonetheless alid. %or each pair of messages sent between two ser ers the *TC protocol calculates an offset oi that is an estimate of the actual offset between the two clocks, and a delay di that is total transmission time for the two messages. 7f the true offset of the clock at ; relati e to that at A is o, and if the actual transmission times for m and mI are t and tI respecti ely, then we ha e+ .i%9 7 .i%= H t H o, and .i 7 .i%? H tI C o .efining a 7 .i%9 C .i%= and b7.i%? C .i this leads to+ di7t H tI 7 a%b J o 7 oi H & tI C t$E9 where oi 7 & aHb$E9
"er er ; .i%9 .i%? Time

Time "er er A .i%= .i

>i!.9./

0essage exchanged between a pair of *TC peers

1$1

Asing the fact that t, tI / it can be shown that oi C d5E9 o oi H diE9. Thus oi is an estimate of the offset, and di is a measure of the accuracy of this estimate. 9.3 8o!ica% time and %o!ica% c%oc7s: %rom the point of iew of any single process, e ents are ordered uni)uely by times shown on the local clock. (owe er, as ,amport pointed out, since we cannot perfectly synchroni!e clocks across a distributed system, we cannot in general use physical time to find out the order of any arbitrary pair of e ents occurring within it. The order of e ents occurring at different processes can be critical in a distributed application. 7n general, we can use a scheme which is similar to physical causality, but which applies in distributed systems, to order some of the e ents that occur at different processes. This ordering is based on two simple and naturally ob ious points+ 7f two e ents occurred at the same process, then they occurred in the order in which it obser es them. -hene er a message is sent between processes, the e ent of sending the message occurred before the e ent of recei ing the message. ,amport called the ordering obtained by generali!ing these two relationships the happened5before relation. 7t is also sometimes known as the relation of casual ordering or potential casual ordering. 0ore formally, we write x y if two e ents x and y occurred at a single process p, and x occurred before y. Asing this restricted order we can define the happened%before relation, denoted by, as follows+ D/?' 7f process p+ x y, then , x y. D/9' %or any messages m, send m$ % rcv m$.

where send m$ is the e ent of sending the message, and rcv m$ is the e ent of recei ing it. y and y z, then x z.

D/=' 7f x,y and z are e ents such that x


,? a ,9 c ,= e b m?

Chysical Time m9 f

1$2

>i!. 9.+ :vent occurring at three processes. Thus if x y, then we can find a series of e ents e?,e9, e=K, en occurring at one or more processes such that x7e? and y7en and for iO1,2 .., n%1, either D/I or D/9 applies between ei and eiH?+ That is, either they occur in succession at the same process., or there is a message m such that ei7send m$ and eiH?7rev m$. .he se3uence of events e?,e9, e=K, en need not be uni)ue. The relation similarly c dJ b is illustrated for the case of the three processes ,?,,9 and ,= in b, since the e ents occur in this order at process p1, and c, since these are the sending and reception of message m?, and %ig 2.$. 7t can be seen that a

similarly d f. 9ombining these relations, we may also say that, for example, a f. 7t can also be obser er from %ig 2.$ that not all e ents are related by the relation . -e say that e ents such as a and e that are not ordered by are concurrent, and write this all e. The relation captures a flow of data inter ening between two e ents. *ote howe er, that in real life data can flow in ways other than by message passing. %or example, if "mith enters a command to his process to send a message, then telephones Nones who commands her process to issue another message, then the issuing of the first message clearly happened%before that of the second. Anfortunately, since no network messages were sent between the issuing processes, we cannot model this type of relationship in our system. 9.3.4 8o!ica% c%oc7s: ,amport in ented a simple mechanism by which the happened before ordering can be captured numerically, called a %o!ica%%y c%oc7. A logical clock is a monotonically increasing software counter, whose Cp, which it uses to timestamp e ents.
,? ,9 ,= 1 1 a 2 b m? 3 c 4 d m9

alue need bear no particular

relationship to any physical clock, in general. :ach process p keeps its own logical clock,

Chysical Time $

1$3

>i!.9.: ,ogical timestamps for the e ents shown in %ig 2.$ -e denote the timestamp of e ent a at p by Cp a$, and by C b$ we denote the timestamp of e ent b at whate er process it occurred. To capture the happened%before relation !C? ' !C9 ' , processes update their logical clocks and transmit the alues of their logical clock in messages as follows+ Cp is incremented before each e ent is issued at process p+ Cp +O Cp B 1 a' -hen a process p sends a message m, it piggybacks on m the alue t7Cp. b$ 6n recei ing m,t$ a process 3 computes C3+Omax&C3,t$ and then applies !CI before time stamping the e ent rev m$. Although we increment clocks by one, we could ha e chosen any positi e alue. 7t can easily be shown, by induction on the length of any se)uence of e ents relating two e ents a and b, that a b OP C&a' Y C&b'. *ote that the con erse is not true. 7f C&a' Y C&b', then we cannot infer that a b. 7n %ig 2.5 we illustrate the use of logical clocks for the example gi en in %ig 2.$. :ach of the processes p?,p9 and p= has its logical clocks initiali!ed to /. The clock alues gi en are those immediately after the e ent to which they are ad8acent. *ote that, for example, 9&b' P 9&e' but b ^^ e. 9.3.- Tota%%y ordered %o!ica% c%oc7s: ,ogical clocks impose only a partial order on the set of all e ents, since some pairs of distinct e ents, generated by different processes, ha e numerically identical timestamps. (owe er, we can extend this to a total order#that is, one for which all pairs of distinct e ents are ordered 3 by taking into account the identifiers of the processes at which e ents occur. 7f a is an e ent occurring at pa with local timestamp .a, and b is an e ent occurring at pb with local timestamps .b, we define the global logical timestamps for these e ents to be & .a , pa' and &.b , pb' respecti ely and we define &.a , pa' Y &.b , pb' if and only if either .a Y .b or .a O .b and , pa A pb. 9./ Distributed $oordination:

1$4

.istributed processes often need to coordinate their acti ities. %or example, if a collection of processes shares a resource or collection of resources managed by a ser er, then often# mutual exclusion is re)uired to pre ent interference and ensure consistency when accessing the resources. This is essentially the 9ritical section problem, familiar in the domain of the domain of operating systems. 7n the distributed case, howe er, neither shared ariables nor facilities supplied by a single local kernel can be used to sol e it in general. A separate, generic mechanism for distributed mutua% eIc%usion is re)uired in certain cases 3 one that is independent of the particular resource management scheme in )uestion. .istributed mutual exclusion in ol es a single process being gi en a pri ilege # the right to access shared resources 3 temporarily, before another process is granted it. 7n some other cases, howe er, the re)uirement is for a set of processes to choose one of their numbers to play a pri ileged, coordinating role o er the long term. A method for choosing a uni)ue process to play a particular role is called an e%ection a%!orithm. This sectionnow examines some algorithms for achie ing the goals of distributed mutual exclusion and elections. 9./.4 Distributed mutua% eIc%usion: 6ur basic re)uirements for mutual exclusion concerning some resource &or collection of resources' are as follows+ *(? + &safety' At most one process may execute in the critical section &9"' at a time *(9 + &,7D:*:""' A process re)uesting entry to the 9" is e entually granted it &so long as any process executing in the 9" e entually lea es it'. *(9 implies that the implementation is deadlock#free, and that star ation does not occur. A further re)uirement, which may be made, is that of casual ordering+ *(= + &ordering' :ntry to the 9" should be granted in happened%before order. A process may continue with other processing while waiting to be granted entry to a critical section. .uring this time it might send a message to another process, which conse)uently also tries to enter the critical section. *(= specifies that the first process should be granted access before the second. -e now discuss some algorithms for achie ing these re)uirements.

1$$

9./.4.4 The centra% ser1er a%!orithm: The simplest way to achie e mutual exclusion is to employ a ser er that grants permission to enter section. %or the sake of simplicity, we shall assume that there is only one critical section managed by the ser er. >ecall that the protocol for executing a critical section is as follows. enter $ KK exit $ Lenter critical section C block if necessary L$ Laccess shared resources in critical section L$ Lleave critical section C other processes may now enter L$

%ig 2.4 shows the use of this ser er. To enter a critical section, a process sends a re)uest message to the ser er and awaits a reply from it. 9onceptually, the reply constitutes a token signifying permission to enter the critical section. 7f no other process has the token at the time of the re)uest, then the ser er replies immediately, granting the token. 7f the token is currently held by another process, then the ser er does no reply
;ueue of ReCuests 4 2 &. <rant token &. 3equest token P1 P2 &. 3elease token P3

Ser1er

P4

>i!.9., Ser1er mana!in! a mutua% eIc%usion to7en for a set of processes but )ueues the re)uest. 6n existing the critical section, a message is sent to the gi ing it back the token. 7f the )ueue of waiting processes is not empty, then the ser er chooses the oldest entry in the )ueue, remo es it and replies to the corresponding process. The chosen process then holds the token. 7n the figure, we show a situation in which p2<s re)uest has been appended to the )ueue, which already contained p4<s re)uest. p3 exists the critical section, and the ser er remo es p4<s entry and grants permission to enter to p4 by replying to it. Crocess p? does not currently re)uire entry to the critical section.

1$5

The ser er is also critical point of failure. Any process attempting to communicate with it, either to enter or exit the critical section, will detect the failure of the ser er. To reco er from a ser er failure, a new ser er can be created &or one of the processes re)uiring mutual exclusion could be picked to play a dual role as a ser er'. "ince the ser er must be uni)ue if it is to guarantee mutual exclusion, an election must be called to choose one of the clients to create or act as the ser er, and to multicast its address to the others. -e shall describe some election algorithms below. -hen the new ser er has been chosen, it needs to obtain the state of its clients, so that it can process their re)uests, as the pre ious ser er would ha e done. The ordering of entry re)uests will be different in the new ser er from that in the failed one unless precautions are taken, but we shall not address this problem here. 9./.4.- A distributed a%!orithm usin! %o!ica% c%oc7s: >icart and Agrawala de eloped an algorithm to implement mutual exclusion that is based upon distributed agreement, instead of using a central ser er. The basic idea is that processes that re)uire entry to a critical section multicast a re)uest message, and can enter it only when all the other processes ha e replied to this message. The conditions under which a process replies to a re)uest are designed to ensure that conditions *(? C *(= are met. Again, we can associate permission to enter a token in this case takes multiple messages. The assumptions of the algorithm are that the processes p? K K pn know one another<s addresses, that all messages sent are e entually deli ered, and that each process pi keeps a logical clock, updated according to the rules !C? and !C9 of the pre ious unit. 0essages re)uesting the token are of the for Y., piP, where . is the sender<s timestamp and pi is the sender<s identifier. %or simplicity<s sake, we assume that only one critical section is at issue, and so it does not ha e to be identified. :ach process records its state of ha ing released the token &>:,:A":.', wanting the token &-A*T:.' or processing the token &(:,.' in a ariable. The protocol is gi en in %ig 2.2. 'ig.A.A +icart and "grawalaIs algorithm )n initialization' -tate '7 RE8EASEDR .o obtain the token' -tate '7 ?A#TEDR 0ulticast re)uest to all processes=

+e3uest processing deferred here

1$4

.'7re)uest<s timestamp= Wait until &number of replies recei ed O &n#1'= )n receipt of a re3uest A .i , pi B at p5 i M 5$J If state 7 &E8D or &state O ?A#TED and .,p5$ A .,pi$$$ .hen :ueue re3uest from pi without replyingJ (lse +eply immediately to piJ (nd if .o release token -tate '7 RE8EASEDR +eply to any 3ueued re3uestsJ 7f a process re)uests the token is >:,:A":. e erywhere else 3 that is, no other process wants it 3 then all processes will reply immediately to the re)uest and the re)uester will obtain the token. 7f a token is (:,. at some process, then that process will not reply to re)uests until it has finished with the token, and so the re)uester cannot obtain the token in the meantime. 7f two or more processes re)uest token at the same time, then whiche er process<s re)uest bears the lowest timestamp will be the first to collect &n%1' replies, granting it the token next. 7f the re)uests bear e)ual timestamps, the process identifiers are compared to order them. *ote that, when a process re)uests the token, it defers processing re)uests from other processes until its own re)uest has been sent and the timestamp . is known. This is so that processes make consistent decisions when processing re)uests. To illustrate the algorithm, consider a situation in ol ing three processes, p?, p9 and p= shown in %ig 2.?. ,et us assume that p= is not interested in the token, and that p? and p9 re)uests it concurrently. The timestamp of p1<s re)uest is 41, that of p2 is 34. -hen p3 recei es their re)uests, it replies immediately. -hen p2 recei es p1<s re)uest it finds its own re)uest has the lower timestamp, and so does not reply, holding p1 off. (owe er, p? finds that p9Is re)uest has a lower timestamp than that of its own re)uest, and so replies immediately. 6n recei ing this second reply, p9 possesses the token. -hen p2 releases the token, it will reply to p1<s re)uest, and so grant it to token. 6btaining the token takes 2&n%1' messages in the algorithm+ &n%1' to multicast the re)uest, followed by &n#1' replies. 6r, if there is hardware support for multicast, only one message is re)uired for the re)uest= the total is then n messages. 7t is thus a processes

1$2

re)uest mutual exclusion by multicasting. 9onsiderably more expensi e algorithm, in general, than the central ser er algorithm 8ust described. *ote also, that while it is a fully distributed algorithm, the failure of any process in ol ed would make process impossible. And in the distributed algorithm all the processes in ol ed recei e and process e ery re)uest, so no performance gain has been made o er the single ser er bottleneck, which does 8ust the same. %inally, note that a process that wishes to obtain the token and which was the last to process it still goes through the protocol as described, e en though it could simply decide locally to reallocate it to itself. >icart and Agrawala refined this protocol so that it re)uests n messages to obtain the token in the worst &and common' case, without hardware support for multicast. 9./.4.3 A rin!5based a%!orithm: 6ne of the simplest ways to arrange mutual exclusion between n processes p1, ... pn is to arrange them in a logical ring. The idea is that exclusion is conferred by obtaining a token in the form of a message passed from process to process in a single direction#clockwise, say 3 round the ring. The ring topology 3 which is unrelated to the physical interconnections between the underlying computers# which is unrelated to the physical interconnections between the underlying computers 3 is created by gi ing each process the address of its neighbor. A process that re)uires the token waits until it recei es it, but retains it. To exit the critical section, the process sends the token on to its neighbor.

41

41 >eply

P3

P1 >eply 34 34 41 P2

>eply

34

1$?

>i!. 9.< 0ulticast "ynchroni!ation The arrangement of processes is shown in %ig 2.1/. 7t is straightforward to erify that the conditions *(? and *(9 are met by this algorithm, but that the token is not necessarily obtained in happened%before order. 7t can take from 1 to &n#1' messages to obtain this token, from the point at which the token becomes re)uired. (owe er, messages are sent around the ring e en when no process re)uires the token. 7f a process fails, then clearly no progress can be made beyond it in transferring the token, until a reconfiguration is applied to extract the failed process from the ring.
P1 Pn P2

P3

P4 To7en

7f the process holding the token fails, then an election is re)uired to pick a uni)ue process >i!. 9.4. A rin! of processes transferrin! a mutua% eIc%usion to7en. from the sur i ing members, which will regenerate the token and transmit it as before. 9are has to be taken, in ensuring that the ]failed< process really has failed, and does not later unexpectedly in8ect the old token into the ring, so that there are tow tokens. This situation can arise since process failure can only be ascertained by repeated failure of the process to acknowledge messages sent to it. 9.+ E%ections: An election is a procedure carried out to choose a process from a group, for example to take o er the role of a process that has failed. The main re)uirement is for

15/

the choice of elected process to be uni)ue, e en if se eral processes call elections concurrently. 9.+.4 The bu%%y a%!orithm: The bully algorithm can be used when the numbers of the group know the identities and addresses of the other members. The algorithm selects the sur i ing members with the largest identifiers to function as the coordinator. -e assume that communication is reliable, but processes can be failed during an election. The algorithm proceeds as follows. There are three types of messages in this algorithm. An election message is sent to announce an election An answer message is sent in response to an election message A coordinator message is sent to announce the identity of the new coordinator. A process begins an election when it notices that the coordinator has failed. To begin an election, a process sends an election message those processes that ha e a higher identifier. 7t then awaits an answer message in response. 7f none arri es within a certain time, the process considers itself the coordinator, and sends a coordinator message to all processes with lower identifiers announcing this fact. 6therwise, the process waits further limited period for a coordinator message to arri e from the new coordinator. 7f none arri es, it begins another election. 7f a process recei es a coordinator message, it records the identifier of the coordinator contained within it, and treats that process as the coordinator. 7f a process recei es an election message, it sends back an answer message, and begins another election 3 unless it has begun one already. -hen a failed process is restarted, it begins an election. 7f it has the higher process identifier, then it will decide that it is the coordinator, and announce this to the other processes. Thus it will become the coordinator, e en though the current coordinator is functioning. 7t is this reason that the algorithm is called the ]bully< algorithm. The operation of the algorithm is shown in %ig 2.11. There are four processes p1#p>, and an election is called when p? detects the failure of the coordinator, p>, and announces an election &stage 1 in the figure'. 6n recei ing an election message from p?, p9 and p= send answer messages to p? and begin their own election= p= sends an answer message to p9, but p= receives no answer message from the failed process p> &stage 2'. 7t therefore decides

151

that it is the coordinator. ;ut before it can send out the coordinator message, it too fails &stage 3'. -hen p?<s timeout period expires &which we assume occurs before p9<s timeout expires', it notices the absence of a coordinator message and begins another election. : entually, p9 is elected coordinator &stage 4'. 9.+.- A rin!5based e%ection a%!orithm: -e gi e the algorithm of 9hang and >oberts, suitable for a collection of processes that are arranged in a logical ring &>efer fig2.12'. -e assume that the processes do not know the identities of the others a priori, and that each process knows only how to communicate with its neighbor in, say, the clockwise direction. The goal of this algorithm is to elect a single coordinator, which is the process with the largest identifier. The algorithm assumes that all the processes remain functional and reachable during its operation. 7nitially, e ery process is marked as a non%participant in an election. Any process can begin an election. 7t proceeds by marking itself as a participant, placing its identifier in an election message and sending it to its neighbor. -hen a process recei es an election message, it compares the identifier in the message with its own. 7f the arri ed identifier is the greater, then it forwards the message to its neighbor. 7f the arri ed identifier is smaller and the recei er is not a participant then it substitutes its own identifier in the message and forwards it= but it does not forward the message if it is already a participant. 6n forwarding an election message in any case, the process marks itself as a participant. 7f howe er, the recei ed identifier is that of the recei er itself, then this process<s identifier must be the greatest, and it becomes the coordinator. The coordinator marks itself as a non%participant, and forwards the message to its neighbor.

152

election election answer ,?

Sta!e 4

,9 answer

,= election

,>

Sta!e ,? timeout ,9

election answer ,=

election ,>

Sta!e 3
,? coordinator ,9 ,= ,>

E1entua%%yH.. Sta!e /
,?

,9

,=

,>

>i!. 9.44

.he bully algorithm+ The election of coordinator p9, after the failure of p4 and then p=.
3 17 4

24 9

1 15 28 -/

'ig.A.&( " +ing based (lection in progress.

153

#).(' .he election was started by process ?G. .he highest process identifier encountered so far is 9>. ,articipant processes are shown darkened The point of making processes as participant or non%participant is so that messages arising when another process starts an election at the time are extinguished as soon as possible, and always before the ]winning< election result has been announced. 7f only a single process starts an election, then the worst case is when its anticlockwise neighbor has the highest identifier. A total of n%1 messages are then re)uired to reach this neighbor, which will not announce its election until its identifier has completed another circuit, taking a further n messages. The elected message is then sent n times, making 3n#1 messages in all. An example of a ring#based election in progress is shown in %igure 1/.12. The election message currently contains 24, but process 22 will replace this with its identifier when the message reaches it. $heck your Progress - &: Answer the following: 1. .efine Chysical time and ,ogical time. 2. .efine clock drift. 3. .efine coordinated uni ersal time. 4. ,ist the design features of *etwork time Crotocol. $. "tate (appened#before relation. 5. .efine .istributed mutual exclusion principle. 4. *ame a distributed algorithm that uses the concept of logical clocks. 2. -hat for the ring#based algorithm is used forF ?. -hat are the three types of the messages used in bully algorithmF 1/. -hat do you mean by distributed coordinationF $heck your Progress - (: Answer the following: 1. :xplain the necessity of clock synchroni!ation. 2. .escribe 9rstian<s method for synchroni!ing clocks. 3. :xplain ;erkley algorithm. 4. -hat is the need of the concept of logical clockF :xplain (appened 3 before principle.

154

$. .iscuss and compare the different algorithms used for maintaining mutual exclusion in .istributed transaction. 5. :xplain :lection algorithms. 9.: Summary: 7n this unit first we ha e discussed the importance of accurate time keeping for distributed systems. 7t then described algorithms for synchroni!ing clocks despite the drift between them and the ariability of message delays between computers. The degree of synchroni!ation accuracy that is practically obtainable fulfils many re)uirements, but is nonetheless not sufficient to determine the ordering of an arbitrary pair of e ents occurring at different computers. The happened%before relationship is a partial order on e ents, which reflects a flow of information 3 within a process, or ia messages between processes 3 between them. "ome algorithms re)uire e ents to be ordered in happened%before order, for example, successi e updates made at separate copies of data. ,ogical clocks are counters that are updated so as to reflect the happened% before relationship between them. The unit then described the need for processes to access shared resources under conditions of mutual exclusion. >esource ser ers in all cases do not implement locks, and a separate distributed mutual exclusion ser ice is then re)uired. Three algorithms were considered which achie e mutual exclusion+ the central ser er, a distributed algorithm using logical clocks, and a ring#based algorithm. These are hea yweight mechanisms that cannot withstand failure, although they can be modified to be fault#tolerant. 6n the whole, it seems ad isable to integrate locking with resource management. %inally, the unit considered the bully algorithm and a ring#based algorithm whose common aim is to elect a process uni)uely from a gi en set 3 e en if se eral elections take place concurrently. These algorithms could be used, for example, to elect a new master timeser er, or a new lock ser er, when the pre ious one fails.

15$

RE>ERE#$ES 1. "tefano 9uri, 1iuseppe Celagatti, I .istributed .atabases Crinciples U "ystemsJ 3 0c1raw 3 (ill 7nternational :ditions, 1?2$. 2. Cradeep . G . "inha, I .istributed 6perating "ystems 9oncepts and .esignJ 3 C(7, 1??2. 3. 1eorge 9oulouris, Nean .ollimore, Tim Gindberg, I .istributed "ysems 9oncepts U .esignJ, second edition 3 Addison -esley, 2///. 4. Andrew.".Tanenbaum, I 9omputer *etworksJ# Third :dition 3 C(7,1???.

S-ar putea să vă placă și