Sunteți pe pagina 1din 747

MCT702 : Distributed

Computing
Course Objectives
The differences among: concurrent, networked, distributed, and
mobile.
Resource allocation and deadlock detection and avoidance
techniques.
Remote procedure calls.
IPC mechanisms in distributed systems.

Course Outcomes
Develop, test and debug RPC based client-server programs in
Unix.
Design and build application programs on distributed systems.
Improve the performance and reliability of distributed programs.
Design and build newer distributed file systems for any OS.
1

Syllabus

UNIT I
Introduction-Examples of Distributed System - Resource, Sharing and the WebChallenges, case study on World wide web-System Models-IntroductionArchitectural Models-Fundamental Models.
Distributed Objects and Components: Introduction, Distributed Objects, from
objects to components, Case study: enterprise java beans and fractals.
Remote Invocation- Remote Procedure Call-Events and Notifications.
UNIT II
Distributed
Operating
Systems-Introduction-Issues-Communication
Primitives-Inherent Limitation-Lamports Logical Clock; Vector Clock; Causal
Ordering; Global State; Cuts; Termination Detection. Distributed Mutual
Exclusion-Non-Token based Algorithms-Lamports Algorithm-Token based
Algorithms-Suzuki-Kasamis Broadcast Algorithm- consensus and related
problems. Distributed Deadlock Detection-Issues-Centralized DeadlockDetection Algorithms-Distributed Deadlock-Detection Algorithms.
UNIT III
Distributed Resource Management-Distributed File Systems-ArchitectureMechanisms-Design Issues-Case Study: Sun Network File System-Distributed
Shared Memory-Architecture-Algorithm-Protocols-Design Issues. Distributed
Scheduling-Issues-Components-Algorithms- Load Distributing Algorithms, Load
Sharing Algorithms.
Unit IV
Transaction
and
Concurrency:Introduction,
Transactions,
Nested
2
Transactions, Locks, Optimistics concurrency control ,Time Stamp Ordering,

Syllabus

UNIT V
Resource Security and Protection : Access and Flow control-IntroductionThe Access Matrix Model-Implementation of Access Matrix Model-Safety in the
Access Matrix Model-Advanced Models of Protection-Data SecurityIntroduction-Modern Cryptography:-Private Key Cryptography, Public key
cryptography.
UNIT VI
Distributed Multimedia Systems:-Introduction- Characteristics Quality of
Service Management- Resource Management-Stream Adaptation -Case Study.
Designing Distributed System:-Google Case Study-Introducing the Case
Study: Google- Overall architecture and Design Paradigm-Communication
Paradigm- Data Storage and Coordination Services-Distributed Computation
Services

Text Book:
Distributed Systems Concepts and Design, George Coulouris, Jean Dellimore
and Tim KIndberg,Pearson Education,5th Edition.
Advanced Concepts in Operating Systems, Mukesh Singhal and
N.G.Shivaratri, McGraw-Hill.
Distributed Operating Systems, Pradeep K. Sinha, PHI,2005
References:
Distributed
Computing-Principles,Algorithms
and
Systems,
Ajay
3
D.Kshemkalyani and Mukesh Singhal Cambridge University Press.

Unit No.I

Chapter 1 : Introduction to DSExamples of Distributed System


Resource Sharing
Web Challenges
Case study on World wide web
Chapter 2 :System ModelsIntroduction
Architectural Models
Fundamental Models
Chapter 3: Distributed Objects and ComponentsIntroduction
Distributed Objects,
From objects to components, Essence of Components
Case study: Enterprise java beans and Fractals.
Chapter 4 :Remote InvocationIntroduction
Remote Procedure Call
Events and Notifications
4

Chapter 1:Introduction to
DS
A distributed system is a collection of
independent entities that cooperate
to solve a problem that cannot be
individually solved.

Definition :
A distributed system is one in which hardware
or software components located at networked
computers, communicate and coordinate their
actions only by passing messages.
Note : Computers that are connected by a
network may be spatially separated by any
distance.
They may be on separate continents, in the same
building or in the same room.
Passing the message is the key feature of
distributed system

Major Consequences
1. Concurrency
concurrent program execution
sharing resources
increasing the capacity of the system

2. No global clock
No shared memory
provide the abstraction of a common address space
shares idea of the time at which the programs actions occur.

3. Independent Failures
Faults in the network result in the isolation of the computers that
are connected to it
The failure of a computer, or the unexpected termination of a
program somewhere in the system (a crash)
It is the responsibility of system designers to plan for the
consequences of possible failures
7

Examples of Distributed
System
1. Application Domains
. Web search
. Massively multiplayer online games
. Financial trading
2. Recent Trends
. Pervasive networking and the modern
Internet
. Mobile and ubiquitous computing
. Distributed multimedia systems
. Distributed computing as a utility
8

1. Application Domain
1. Web search
Sr. no.

Application Domain

Application Network

01.

Finance & commerce

Ecommerce ( companies like Amazon


and eBay)
Online payments, trading & banking

02.

Information Societies

WWW
Search engines : Google & Yahoo
user-generated content :YouTube,
Wikipedia and Flickr
Social networking : Facebook and
MySpace.

03.

Creative industries
and
entertainment

On line gaming , downloading sites of


multimedia , Youtubes

04.

Healthcare

online electronic patient records


telemedicine in supporting remote

Conti
1. Web search
Sr. no.

Application Domain

Application Network

05.

Education

E-learning
virtual learning environments
Distance learning
community-based learning.

06.

Transport & Logistic

location technologies such as GPS


web-based map services :MapQuest,
Google Maps and Google Earth.

07.

Science

e_-science : Grid technology has


enable
worldwide collaboration between
groups of scientists.

08.

Environmental
management

sensor technology : avoid Natural


disasters
understand complex natural
10
phenomena such as climate change.

2. Massively multiplayer online games (MMOGs)


Objective is to provide:
fast response times to preserve the user experience of
the game.
the real-time propagation of events to the many
players
maintaining a consistent view of the shared world.
Solutions:
1. Client server Architecture (EVE Online for online
gamming)
2. Distributed Architecture (EverQuest)
3. peer-to-peer technology (purely decentralized
approach)
11

3. Financial Trading:
The industry employs automated
monitoring and trading applications

Distributed event-based systems

12

2. Recent Trends
1. Pervasive networking and the modern
Internet
intranet
ISP

backbone

satellite link
desktop computer:
server:
network link:

A typical portion of the Internet

Large Distributed.
sys
Services such as
www
It is an open 13

The Internet is also a very large distributed system which


make use of services such as the World Wide Web, email
and file transfer.
Programs running on the computers connected to it
interact by passing messages, employing a common
means of communication.
The figure shows a collection of intranets subnetworks
operated by companies and other organizations and
typically protected by firewalls.
Internet Service Providers (ISPs) are companies that
provide broadband links and other types of connection to
individual users and small organizations, enabling them to
access services anywhere in the Internet as well as
providing local services such as email and web hosting.
A backbone is a network link with a high transmission
capacity, employing satellite connections, fibre optic cables
and other high-bandwidth circuits.
Problem in (i)isolation of some system eg police
confidential matter and (ii) improvement in firewall by
14
some fine grained mechanism and policies.

2. Mobile and ubiquitous computing


Advancement in device miniaturization
and wireless networking.
Small portable devices like laptops,
smart phone , mobile phones, GPS,
PDAs like wearable watches, digital
cameras and video camers.
Embedded appliances such as washing
machines, hi-fi systems, cars and
refrigerators.
15

Example of Mobile and Ubiquitous computing


Internet

Host intranet

WAP
gatew ay

Wireless LAN

Home intranet

Mobile
phone
Printer
Camera

Laptop

.spontaneous interoperation
. service discovery

Host site

16

Mobile computing
Mobile computing is the performance of computing tasks while the
user is on the move, or visiting places other than their usual
environment .
by providing with access to resources via the devices they carry
with them.
They can continue to access the Internet;
they can continue to access resources in their home intranet;
and there is increasing provision for users to utilize resources such
as printers or even sales points that are conveniently nearby as
they move around. The latter is also known as location-aware or
context-aware computing.
Mobility introduces a number of challenges for distributed
systems, including the need to deal with variable connectivity and
indeed disconnection, and the need to maintain operation in the
face of device mobility
17

Ubiquitous computing
This includes many small, cheap computational
devices that are present in users physical
environments, including the home, office and
even natural settings.
it may be convenient for users to control their
washing machine or their entertainment
system from their phone or a universal remote
control device in the home.
Equally, the washing machine could notify the
user via a smart badge or phone when the
washing is done.
18

3. Distributed multimedia
systems
The main objectives are the storage, transmission
and presentation of
discrete media types, such as pictures or text messages.
continuous media types such as audio and video

Webcasting is an application of distributed


multimedia technology to broadcast continuous
media, typically audio or video, over the Internet.

range of encoding and encryption formats


desired quality of service
resource management strategies
scheduling policies
adaptation strategies in open system
19

4. Distributed computing as a utility


A number of companies are promoting the view of
distributed resources as a commodity or utility.
Physical resources such as storage and processing.
(operating system virtualization)
Software services across the global Internet
(Google Apps)

Cloud computing
Clouds are generally implemented on cluster computers which
include set of interconnected computers that cooperate closely
to provide a single, integrated high performance computing
capability.
Blade servers are minimal computational elements containing
for example processing and (main memory) storage capabilities.
Grid computing has support for scientific applications.
20

Resources sharing
We routinely share :
Resources
Hardware resources
Data resources

Example
Printers, disks,
Files, databases,
webpages
Search engines,

Functionally specific
Services
:
that
manages
a
collection
resources
of related resources and presents
their functionality to users and
applications. e.g. Services
Purpose
Access files
Document send to
printers
Buying of goods

File services
Printing services
Electronic payment

21

Services can be accessed only on the basis


of some set of operations that it export .
E.g. such as a file service provides read,
write and delete operations on files.
Resources in a distributed system are
physically encapsulated within computers
and can only be accessed from other
computers by means of communication.
Distributed system uses the approach
known as client-server computing.
features

Client

Servers

operation

invokes an operation

remote invocation.

nature

active

passive

time

Lasts upto applications


run

Work continuously

objects

Encapsulates the

Contains the

22

Challenges
Sr.
no.

Challenges

Remarks

01.

Heterogeneity

Varity and differences

02.

Openness

Extension of resource-sharing services

03.

Security

Protection of shared resources

04.

Scalability

Addition of user vs constant of resources

05.

Failure handling

Computer or network

06.

Concurrency

presence of multiple users to request for

07.

Transparency

Encapsulation of operation, location etc


for users

08.

Quality of services

Provides performance, security and


reliability. Adaptability

23

1. Heterogeneity
Resources

Examples

Network

Internet Protocol on ehternet

Computer
hardware

Data types like integer in form of messages are to


be exchanged between programs running on
different hardwares

Operating systems the calls for exchanging messages in UNIX or


Windows.
Programming
languages

Use of characters and data structures by different


programming lang. should be able to
communicated with each other.

Implementations
1. Middleware
Middleware
:provides
a programming abstraction as well as
by different
2. Mobile codes
virtual
machines
masking
the heterogeneity
. eg:and
The
Common
Object Request
developers.
Broker (CORBA), Java Remote Method Invocation (RMI)
Mobile Codes : refers to transfer of program code . Eg, Java
applets
Virtual Machines : provides a way of making code executable
on a variety of host computers: Java compiler
24

2. Openness
Extension and reimplementation of new resourcesharing services, which can be made available for use
by a variety of client programs.
Open systems are characterized by the fact that their
key interfaces are published. Eg. Requests For
Comments in IP
Open distributed systems are based on the provision
of a uniform communication mechanism and published
interfaces for access to shared resources. Eg. World
Wide Web Consortium (W3C) provides standards of
working on Web.
Open distributed systems can be constructed from
heterogeneous hardware and software, possibly from
different vendors. But the conformance of each
component to the published standard must be
carefully tested and verified if the system is to work
25
correctly.

3. Security
Security for information resources has
three components:
confidentiality

protection against disclosure to unauthorized


individuals

integrity

protection against alteration or corruption

availability

protection against interference with the means to


access the resources

Two security challenges :


Denial of service
attack

To disrupt a service for some reason

Security of mobile
code

Receive of an executable program as an


electronic mail attachment. (running is
unpredictable )
26

4. Scalability

Distributed systems operate effectively and efficiently at


many different scales, ranging from a small intranet to the
Internet.
A system is described as scalable if it will remain effective
when there is a significant increase in the number of
resources and the number of users.
Major scalable challenges in distributed systems
1. Controlling the cost of physical resources:
(file servers should be proportionate to file users)
2. Controlling the performance loss:
( data sets used in hierarchic structures scale better
than linear structures )
3. Preventing software resources running out:
(Adaptation of a new version of the Internet protocol
with 128-bit Internet addresses )
4. Avoiding performance bottlenecks:
27
(partitioning of spaces or replication of web-pages)

5. Failure handling

Detecting Failures : suspect


Masking failures : hiding of failures
Tolerating failures: direct alert
Recovery from failures: rollback
processes
Redundancy : Replication

28

6.Concurrency
It is a problem when two or more users access to the
same resource at the same time

Each resource is encapsulated as an object and invocations are


executed in concurrent threads
Concurrency can be maintained by use of semaphores and other
mutual exclusion mechanisms.
Note :
Thread : A smallest sequence of programmed instructions that can
be managed independently by an operating system scheduler.
Semaphore : A semaphore is a variable or abstract data type that
provides a simple but useful abstraction for controlling access, by
multiple processes, to a common resource in a
parallel programming or a multi user environment.

29

7. Transparency
Concealment of the separation of components
from users:
Access transparency: local and remote resources can
be accessed using identical operations (ftp services)
Location transparency: resources can be accessed
without knowing their where abouts (URL)
Concurrency transparency: processes can operate
concurrently
using
shared
resources
without
interferences( threads and semaphores)
Failure transparency: faults can be concealed from
users/applications( retransmits of mail)
Mobility transparency: resources/users can move
within a system without affecting their operations
(airtel & idea comu)
30

Conti
Replication transparency : enables
multiple instances of resources to be
used to increase reliability and
performance without knowledge of the
replicas by users or application
programmers.
Performance transparency: system can
be reconfigured to improve performance
Scaling transparency: system can be
expanded in scale without change to the
applications
31

Transparency Examples
Distributed File System allows access transparency
and location transparency
URLs are location transparent, but are not mobility
transparent
Message retransmission governed by TCP is a
mechanism for providing failure transparency
Mobile phone is an example of mobility transparency
Note
:
Access
Transparency
and
transparency together is called as
transparency.

Location
Network

32

8. Quality of services
Responsiveness and computational throughput.
Ability
to
meet
timeliness
guarantees
depending on computing and communication
resources.
QoS applies to operating systems as well as
networks.
There must be resource managers that provide
guarantees.
Provision for Reservation requests that cannot
be met should rejected.
33

Case Study : World Wide


Web
Introduction
Short History
Concept

Major Components
HTML
URL
HTTP

Related Terms
34

Short History
The Web began life at the European
centre for nuclear research (CERN),
Switzerland, in 1989 as a vehicle for
exchanging documents between a
community of physicists connected
by the Internet.

35

Concept
Web provides
hypertext structure among the documents that it
stores
hyperlinks i.e. references to other documents
and resources that are also stored in the Web
Open system : it can be extended and
implemented in new ways without disturbing its
existing functionality.
Operation is based on communication standards
and document or content standards that are
freely published and widely implemented.
36

Concept
There are many types of browser, which are implemented
on several platforms and also there are many
implementations of web servers.

The users have access to browsers on the majority of the


devices that they use, from mobile phones to desktop computers.

The Web is open with respect to the types of resource that


can be published and shared on it.
If somebody invents, say, a new image-storage format, then
images in this format can immediately be published on the Web.

Note : The Web has moved beyond these simple data


resources to encompass services, such as electronic
purchasing of goods. It has evolved without changing its
basic architecture.

37

http://www.google.com/search?q=kindberg

www.google.com

Browsers

Web servers
Internet

www.cdk3.net

http://www.cdk3.net/

www.w3c.org
File system of
www.w3c.org

Protocols

http://www.w3c.org/Protocols/Activity.html

Activity.html

Web servers and web browsers


38

Majors Components
The Web is based on three main standard technological
components:
the HyperText Markup Language (HTML), a
language for specifying the contents and layout of
pages as they are displayed by web browsers;
Uniform Resource Locators (URLs), also known
as Uniform Resource Identifiers (URIs), which
identify documents and other resources stored as part
of the Web;
a client-server system architecture, with
standard rules for interaction (the HyperText
Transfer Protocol HTTP) by which browsers and other
clients fetch documents and other resources from web
servers.
39

1.HTML
The HyperText Markup Language is used to specify
the text and images that make up the contents of a
web page, and to specify how they are laid out and
formatted for presentation to the user.
A web page contains such structured items as
headings, paragraphs, tables and images.
HTML is also used to specify links and which
resources are associated with them.
Users may produce HTML by hand, using a standard
text editor, but they more commonly use an HTMLaware wysiwyg editor that generates HTML from a
layout that they create graphically.
40

Example
Consider a piece of code stored in a html file say
earth.html
<IMG SRC = http://www.cdk5.net/WebExample/Images/earth.jpg>
<P>
2
Welcome to Earth! Visitors may also be interested in taking a look at the
<A HREF = http://www.cdk5.net/WebExample/moon.html>Moon</A>.
</P>
5

3
4

Output :

Welcome to Earth! Visitors may also be interested in taking a look at


the Moon.

41

2. URL
The purpose of a Uniform Resource
Locator is to identify a resource.
Uses File Transfer Protocol (FTP) or
HyperText Transfer Protocol (HTTP)
ftp://ftp.downloadIt.com/software/aProg.ex
e

An HTTP URL has two main jobs: to


identify which web server maintains the
resource, and to identify which of the
resources at that server is required.
42

In general, HTTP URLs are of the


following form:
http:// servername [:port] [/pathName] [?query]
[ #fragment]
e.g.
intro

intro

Publishing Resources : http:/S/P


43

3. HTTP
The HyperText Transfer Protocol defines the ways in
which browsers and other types of client interact with
web servers.
The main features are :
Request-reply interactions:
Operation : GET, to retrieve data from the resource, and POST, to provide data to
the resource

Content types:
text/html then a browser will interpret the text as HTML and display
it;
image/GIF then the browser will render it as an image in GIF format
application/zip then it is data compressed in zip format,

One resource per request


Browser makes several requests concurrently, to reduce the overall
delay to the user.

Simple access control:


any user with network connectivity to a web server can access or
restrict access any of its published resources.

44

Related Terms
Dynamic Pages
Download codes
Common Gateway Interfaces on server
Javascript, Asynchronous Javascript And XML(AJAX)
Applets

Web services
programmatic access to web resources
Web resources provide service-specific operations.
GET, POST, PUT,DELETE

Web Discussion
Web faces problems of scale.
Use of proxy servers
clusters of computers.
45

Chapter 2 : System Models


Introduction
Architecture Models
Client server model
Peer to peer model
Variations

Fundamental Models
Interaction models
Failure models
Security models
46

Introduction
Difficulties and threats for distributed systems
Widely varying modes of use: The component parts of systems
are subject to wide variations in workload for example, some
web pages are accessed several million times a day. Some parts
of a system may be disconnected, or poorly connected some of
the time for example, when mobile computers are included in a
system. Some applications have special requirements for high
communication bandwidth and low latency for example,
multimedia applications.
Wide range of system environments: A distributed system must
accommodate heterogeneous hardware, operating systems and
networks. The networks may differ widely in performance
wireless networks operate at a fraction of the speed of local
networks. Systems of widely differing scales, ranging from tens of
computers to millions of computers, must be supported.
Internal problems: Non-synchronized clocks, conflicting data
updates and many modes of hardware and software failure
involving the individual system components.
47
External threats: Attacks on data integrity and secrecy, denial of

Introduction
The properties and design issues of distributed
systems can be captured and discussed
through the use of descriptive models.
Each type of model is intended to provide an
abstract, simplified but consistent description
of a relevant aspect of distributed system
design.
The basic models under consideration are
Architecture model
Fundamental model
And there is one more model : The Physical model
48

1. The Architecture models


Objectives ( approaches)
looking at the core underlying architectural
elements that underpin modern distributed
systems,
highlighting
the
diversity
of
approaches that now exist.
examining composite architectural patterns that
can be used in isolation or, more commonly, in
combination, in developing more sophisticated
distributed systems solutions.
Considering in terms of middleware platforms
that are available to support the various styles
of programming that emerge from the above
architectural styles.
49

I. Architecture elements
Key questions ?
What are the entities that are communicating in
the distributed system?
How do they communicate, or, more specifically,
what communication paradigm is used?
What (potentially changing) roles and
responsibilities do they have in the overall
architecture?
How are they mapped on to the physical
distributed infrastructure (what is their
placements)
50

1. Communication entities

System-oriented entities
Nodes

(primitive environment, based on OS layers)

Processes
( distributed environment, based on threads)

Problem oriented entities


Objects
decomposition for the given problem domain
interface definition language (IDL) ,
methods defined on an object

Components
accessed through interfaces
making all dependencies explicit
third-party development removing hidden dependencies.

Web services
defined by the web-based technologies
a software application identified by a URI
message exchanges via Internet-based protocols.

51

2. Communication paradigms
Inter-process communication
low-level support for communication between processes in DS
Message passing : c-s arch.
Socket programming : Use of IP (TCP/UDP)
Muti-cast communication : one msg to many

Remote invocation
2-way exchange bet. entities in terms of remote oper n, pro. or mtds
Request-reply protocols: c-s communication, encoded as an array of bytes
Remote procedure calls: procedures in processes on remote computers
can be called
Remote method invocation : calling object can invoke a method in a
remote object

Indirect Communication

Group communication : one to many


Publish-subscribe systems: event based system
Message queues: point-to-point service.
Tuple spaces: parallel programming
Distributed shared memory: do not share physical memory

52

3. Roles & Responsibilities


Roles
are
fundamental
in
establishing the overall architecture
to be adopted which reflects the
responsibilities of each components
in the DS:
Basic architecture models
Client Server Model
Peer to peer Model
53

A. Client-server model:
Most important and most widely distributed
system architecture.
Client and server roles are assigned and
changeable.
Servers may in turn be clients of other
servers.
Services may be implemented as several
interacting processes in different host computers
to provide a service to client processes:
Servers partition the set of objects on which
the service is based and distribute them
among themselves
(e.g. Web data
and web servers)
54

Clients invoke individual servers

Client

invocation
result

Client

invocation

Server

Server

result

Key:
Process:

Computer:

Search engine :programs called web


crawlers
55

B. Peer to Peer Model


Peer processes:
All processes play similar roles without destination as a
client or a server.
Interacting cooperatively to perform a distributed activity.
Communications pattern will depend on application
requirements

In system architecture and networks, peer-to-peer is an


architecture where computer resources and services are
direct exchanged between computer systems.
These resources and services include the exchange of
information, processing cycles, cache storage, and disk
storage for files..
In such an architecture, computers that have traditionally
been used solely as clients communicate directly among
themselves and can act as both clients and servers,
assuming whatever role is most efficient for the network.56

A distributed application based on


peer processes
Peer 2

Peer 1

Application

Application
Peer 3

Sharable
objects

application
database,
the storage,
processing
communication
loads
for access to
objects

Application

Peer 4
Application

Peers 5 .... N

exploit the resources (both data and hardware)

57

4. Placements
Variation in the models:
i. Mapping of services to multiple
servers
ii. Caching using web proxy servers
iii. Web applets in form of mobile
code
iv. Mobile agents
v. Thin client

58

i. Multiple servers
Service

Server
Client

Server

Client
Server

59

ii. Web proxy server


Web
server

Client
Proxy
server

Client

Web
server

Cache:
A store of recently used data objects that is closer to the client
process than those remote objects.
When an object is needed by a client process the caching service
checks the cache and supplies the object from there in case of an upto-date copy is available.

Proxy server:

Provides a shared cache of web resources for client machines at a


site or across several sites.
Increase availability and performance of a service by reducing load
on the WAN and web servers.
60
May be used to access remote web servers through a firewall.

iii. Web applets


a) client request results in the dow nloading of applet code

Client
Applet code

Web
server

b) client interacts w ith the applet

Client

A pplet

Web
server

Example: Java applets

The user running a browser selects a link to an applets


whose code is stored on a web server.
The code is downloaded to the browser and runs there.

Advantage:

Good interactive response since.


Does not suffer from the delays or variability of
bandwidth associated with network communication.

Disadvantage:

61

Security threat to the local resources in the destination

iv. Mobile Agents


A running program (including both code and data)
that travels from one computer to another in a
network carrying out a task on someones behalf.
Can make many invocations to local resources at
each visited site.
Visited sites must decide which local resources
are allowed to use based on the identity of the
user owning the agent.
Advantage: Reduce communication cost and time
by replacing remote invocation with local ones.
Disadvantages:
Limited applicability.
Security threat of the visited sites resources.

62

v. Thin Client
Thin client refers to a software layer
that supports a window-based user
interface that is local to the user
while executing application programs
on a remote computer.

Thin clients and compute


servers

63

Conti..
Same as the network computer scheme but
instead of downloading the applications code
into the users computer, it runs them on a
server machine, compute server.
Compute server is a powerful computer that has
the capacity to run large numbers of
applications simultaneously.
Disadvantage: Increasing of the delays in highly
interactive graphical applications .
Recently this concept has led to the emergence
of virtual network computing (VNC), which has
out dated the network computers.
Since all the application data and code is stored
by a file server, the users may migrate from one
network computer to another.
64

II. Architecture Patterns


Layering
Tier Architecture

65

1. Layering
In the layered view of a system each layer offers its services to the level
above and builds its own service on the services of the layer below.
Software architecture is the structuring of software in terms of layers
(modules) or services that can be requested locally or remotely.

Applications, services

Middlew are

Operating system
Platform
Computer and netw ork hardw are

66

Platform:
Lowest-level layers that provide services to other higher layers.
bring a systems programming interface for communication
and coordination between processes .
Examples:
Pentium processor / Windows NT
SPARC processor / Solaris

Middleware:
Layer of software to mask heterogeneity and provide a unified
distributed programming interface to application programmers.
Provide services, infrastructure services, for use by application
programs.
Examples:
Object Management Groups Common Object Request Broker
Architecture (CORBA).
Java Remote Object Invocation (RMI).
Microsofts Distributed Common Object Model (DCOM).
Limitation: require application level involvement in some tasks.

67

2. Tier Architecture

68

III : Middleware Platform

69

Architectures Design
Requirements
1. Performance Issues:
Considered under the following factors:
Responsiveness:
Fast and consistent response time is important for the
users of interactive applications.
Response speed is determined by the load and
performance of the server and the network and the
delay in all the involved software components.
System must be composed of relatively few software
layers and small quantities of transferred data to
achieve good response times.

Throughput:
The rate at which work is done for all users in a
distributed system.

Load balancing:
Enable applications and service processes to proceed
concurrently without competing for the same resources.
Exploit available processing resources.
70

Architectures Design
Requirements
2. Quality of Service:
Main system properties that affect the service
quality are:
Reliability: related to failure fundamental model
(discussed later).
Performance: ability to meet timeliness guarantees.
Security: related to security fundamental model
(discussed later).
Adaptability: ability to meet changing resource
availability and system configurations.

3. Dependability issues:
A requirement in most application domains.
Achieved by:
Fault tolerance: continuing to function in the presence of
failures.
Security: locate sensitive data only in secure computers.
Correctness of distributed concurrent programs:
research topic.
71

2. Fundamental Models
Models of systems share some fundamental
properties which are more specific about their
characteristics , the failures and security risks
they might exhibit.
The interaction model is concerned with the
performance of processes and communication
channels and the absence of a global clock.
The failure model classifies the failures of
processes and basic communication channels
in a distributed system.
The security model identifies the possible
threats to processes and communication
72
channels in an open distributed system.

I. Interaction Model
Distributed systems consists of multiple interacting
processes with private set of data that can access.
Distributed processes behavior is described by
distributed algorithms.
Distributed algorithms define the steps to be taken
by each process in the system including the
transmission of messages between them.
Transmitted
messages
transfer
information
between these processes and coordinate their
ordering and synchronization activities.

73

Two Significant Factors


1. Performance of communication channels: is
characterized by:
Latency: delay between sending and receipt of a
message including
Network
access
time.(Ethernet
transmission
depends on free traffic)
Time for first bit transmitted through a network to
reach its destination. (satellite link using radio
signals)
Processing time within the sending and receiving
processes. (current load on the operating systems.)
Throughput: number of units (e.g., packets) delivered
per time unit.
Bandwidth: total amount of information transmitted
per time unit. (Communication channels using the
same network, share the available bandwidth).
Jitter: variation in the time taken to deliver series 74of

2. Computer clocks & Event ordering : In computer clocks

Each computer in a distributed system has its own


internal clock to supply the value of the current time to
local processes.

Therefore, two processes running on different


computers read their clocks at the same time may take
different time values.

Clock drift rate refers to the relative amount a


computer clock differs from a perfect reference clock.

Several approaches to correcting the times on


computer clocks are proposed.eg
Radio receivers to get time readings from the
Global Positioning System(GPS) with an accuracy of
about 1 microsecond.
Clock corrections can be made by sending
messages, from a computer has an accurate time
to other computers, which will still be affected by
network delays.
75

Two Variation in Interaction Model


Setting time limits for process execution, as
message delivery, in a distributed system is
hard.
Two opposing extreme positions provide a
pair of simple interaction models:
1. Synchronous distributed systems:

A system in which the following bounds are


defined:
Time to execute each step of a process
has known lower and upper bounds.
Each message transmitted over a channel
is received within a known bounded time.
Each process has a local clock whose drift
rate from perfect time has a known bound.
Easier to handle, but determining realistic
bounds can be hard or impossible.
76
A synchronous model is required for

2. Asynchronous distributed systems:


A system in which there are no bounds
on:
process execution times.
message delivery times.
clock drift rate.

Allows no assumptions about the time


intervals involved in any execution.
Exactly models the Internet.
Browsers are designed to allow users to do
other things while they are waiting.

More abstract and general:

A distributed algorithm executing on one


system is likely to also work on another one.

77

Interaction Model
Event ordering: when its need to know if an
event at one process (sending or receiving a
message) occurred before, after, or concurrently
with another event at another process.
It is impossible for any process in a distributed
system to have a view on the current global state
of the system.
The execution of a system can be described in
terms of events and their ordering despite the
lack of accurate clocks.
Logical clocks define some event order based on
causality.
Logical time can be used to provide ordering
among events in different computers in a
78
distributed system (since real clocks cannot be

1. User X sends a message with the subject Meeting.


2. Users Y and Z reply by sending a message with the subject Re:
Meeting.
send
X

receive
m1
2
receive

receive

4
send
3

m2
receive

Physical
time

send

receive

receive
m3

A
t1

t2

m1

m2

receive receive receive


t3

Real-time ordering of events


79

II. Failure Model


Defines the ways in which failure may occur in order
to provide an understanding of its effects.
A taxonomy of failures which distinguish between the
failures of processes and communication channels is
provided:
Omission failures
Process or channel failed to do something.
Eg The chief omission failure of a process is to crash. Fail stop is other process
to detect crash .
Send and receive are related to communication primitives.

Arbitrary failures
Any type of error can occur in processes or channels (worst).

Timing failures
Applicable only to synchronous distributed systems where time
limits may not be met.
80

Process p

1. Omission
Failures

sendm

Process q

receive

Communication channel
Outgoing message buffer

Incoming message buffer

Processes and channels


loss of messages between the sending process
and the outgoing message buffer as Sendomission failures.
loss of messages between the incoming
message buffer and the receiving process as
receive-omission failures.
81

2. Arbitrary Failures
The term arbitrary or Byzantine failure is used to describe
the worst possible failure semantics, in which any type of
error may occur. For example, a process may set wrong
values in its data items, or it may return a wrong value in
response to an invocation.
Arbitrary failures in processes cannot be detected by
seeing whether the process responds to invocations,
because it might arbitrarily omit to reply.
Communication channels can suffer from arbitrary failures;
eg, message contents may be corrupted, nonexistent
messages may be delivered or real messages may be
delivered more than once.
Arbitrary failures of communication channels are rare
because the communication software is able to
recognize them and reject the faulty messages. Eg ,
checksums are used to detect corrupted messages, and
message sequence numbers can be used to detect
82
nonexistent and duplicated messages.

Failure Model
Omission and arbitrary failures

83

3. Timing Failures
Timing
failures
are
applicable
in
synchronous distributed systems where
time limits are set on process execution
time, message delivery time and clock drift
rate.
Real-time operating systems (like UNIX) are
designed with a view to providing timing
guarantees.
ClassofFailure
Affects
Description
Clock The typical
Process timing
Processslocalclockexceedstheboundsonits
failures are :
Performance

Process

Performance

Channel

rateofdriftfromrealtime.
Processexceedstheboundsontheinterval
betweentwosteps.
Amessagestransmissiontakeslongerthanthe
84
statedbound.

III. Security Model


Secure processes and channels and protect
objects encapsulated against unauthorized
access.
Protecting access to objects
Access rights
In client server systems: involves authentication
of clients.

Protecting processes and interactions


Threats
to
processes:
problem
unauthenticated requests / replies.

of

e.g., "man in the middle"

Threats to communication channels: enemy


may copy, alter or inject messages as they
travel across network.
85

1.Protecting access to objects


Access rights
invocation

Eg. users
private
data
(mailbox),
Server shared
data (web
pages)

Client
result

Principal (user)

Network

Object

Principal (server)

Authority ( principal)

Objects and principals


86

2. Protecting processes and


interactions
Copy of m
The enemy

m
Process p

Processq
Communication channel

Analysis of Security Threats

The enemy

Threats to processes:
server side or client side
Threats to communication channels:
the privacy and integrity of information as it

87

Defeating security threats


Principal B

Principal A

Processp

Secure channel

Process q

Secure channels
By Cryptography and shared secrets
88

Other types of threats


Denial of service
e.g., pings to selected web sites
Generating debilitating network or server load
so that network services become de facto
unavailable

Mobile code:
Requires executability privileges on target
machine
Code may be malicious (e.g., mail worms)
89

Chp.3:Distributed Objects &


Components
Introduction
Distributed Objects
Need from Distributed
Components
Components
Case Study

Object

to

Enterprise Java Bean

Fractals

90

Overlook of Architecture
model
3 Objectives : Arch. Elements, Arch. Patterns,
Middleware Platform Available
4 Arch. Element : Entities, Comm. Paradigms,
Roles & Responsibilities and Placements
2 Entities : System & Problem Oriented entities
3 Problem O.E. : Objects, Components & Web
services
3 Comm. Paradigms: Inter- process comm.
Remote invocation (RRP,RMI,RPC), indirect
Comm. (eg event based)
91

92

Introduction
This chapter discuss about complete middleware
solutions,
presenting
distributed
objects
and
components as two of the most important styles of
middleware in use today.
Software that allows a level of programming beyond
processes and message passing is called middleware.
Middleware layers are based on protocols and
application programming interfaces.
Applications
RMI, RPC and events
Request reply protocol
External data representation

Middleware
layers

Operating System
93

Programming Models
Remote Procedure Calls Client programs
call procedures in server programs
Remote Method Invocation Objects
invoke methods of remote objects on
distributed hosts
Event-based Programming Model Objects
receive notice of events in other objects in
which they have interest

94

Interface
Current programming languages allow programs to be
developed as a set of modules that communicate with
each other.
Permitted interactions between modules are defined by
interfaces.
A specified interface can be implemented by different
modules without the need to modify other modules
using the interface.
In Distributed system , a Remote Interface defines the
remote objects on a server, and each of the objects
methods input and output arguments that are
available to clients.
Remote objects can return objects as arguments
back to the client
Remote objects can return references to remote
objects to the client
95
Interfaces do not have constructors.

Benefits of Middleware
Location Transparency:
Remote Objects seem as if they are on the
same machine as the client
Communication Protocols:
Client/Server does not need to know if the
underlying protocol used by the middleware is
UDP or TCP
Computer Hardware/ Operating System:
Hides differences in data representation
caused by different computer hardware or
operating system
Programming Languages:
Allows the client and server programs to be
96
written in different languages

The major tasks of middleware :


To provide a higher-level programming
abstraction for the development of
distributed systems.
Through layering, to abstract over
heterogeneity
in
the
underlying
infrastructure
to
promote
interoperability and portability.
The types of middleware use today :
Distributed object middleware
Component-based middleware
97

1.Distributed object middleware


Adopt an object-oriented programming
model , where the communicating
entities are represented by objects.
The encapsulation inherent and data
abstraction provides more dynamic and
extensible solutions.
A range of middleware solutions based
on distributed objects are available,
including Java RMI and CORBA.
98

Benefits of Distributed object


middleware

The encapsulation inherent in object-based


solutions is well suited to distributed programming.
The related property of data abstraction provides a
clean separation between the specification of an
object
and
its
implementation,
allowing
programmers to deal solely in terms of interfaces
and not be concerned with implementation details
such as programming language and operating
system used.
This approach also lends itself to more dynamic
and extensible solutions, for example by enabling
the introduction of new objects or the replacement
of one object with another (compatible) object.
99

Limitations
Implicit dependencies:
Programming complexity:
Lack of separation
concerns:

of

distribution

No support for deployment:

100

2. Component-based middleware
Component-based middleware builds on the
limitations of distributed object middleware ,
but also adds significant support for distributed
systems development and deployment.
Software components are like distributed
objects in that they are encapsulated units of
composition.

A given component specifies both its


interfaces provided to the outside world and its
explicit dependencies on other components in
the distributed environment.
101

Distributed Objects
Middleware
based
on
distributed objects is designed
to provide a programming
model based on object-oriented
principles and benefits the
approach
to
distributed
programming.

The term distributed objects or remote


objects usually refers to software modules
that are designed to work together, but reside
either in multiple computer connected via a
network or may be different process in a
single computer.
One object sends a message to another
102
object in a remote machine or process to

Continue..
In a system for distributed objects, the
unit of distribution is the object.
Objects that can receive remote requests
for services are called remote objects.
Remote objects must have a way to be
accessed through a remote reference.
To invoke a method, its signature and
parameters must be defined in a remote
interface.
Together, these technologies are called
remote method invocation (action).
103

Rectangle r1 = new Rectangle();


Rectangle r2 = r1;

Interface for remote object:


public interface Hello extends java.rmi.Remote
{String sayHello()
throws java.rmi.RemoteException;}

remote
interface

(IDL)

Data
m1
m2
m3

remote
invocation
A

remoteobject

implementation
of methods

m4
m5
m6

local

C
E
invocation local
invocation
B
local
invocation
D

remote
invocation

F
104

RMI should be able to raise Distributed


exceptions such as timeouts that are due to
distribution as well as those raised during
the execution of the method invoked.
Distributed garbage collection is generally
achieved by cooperation between the
existing local garbage collector and an
added module that carries out a form of
distributed garbage collection.
Due to the level of heterogeneity that may
exist in a distributed system, both class and
inheritance avoided or adapted.
105

The added complexities

1. Inter-object communication:

Distributed object middleware framework must offer one or


more mechanisms for objects to communicate in the
distributed environment.

2. Lifecycle management:
Lifecycle management is concerned with the creation,
migration and deletion of objects, with each step having to
deal with the distributed nature of the underlying
environment.

3. Activation and deactivation:


Activation is the process of making an object active in the
distributed environment by providing the necessary
resources for it to process incoming invocations effectively,
locating the object in virtual memory and giving it the
necessary threads to execute. Deactivation is then the
opposite process, rendering an object temporarily unable to
106
process invocations.

Continue
4. Persistence:
Objects typically have state, and it is important to
maintain this state across possible cycles of activation
and deactivation and indeed system failures.
Distributed object middleware must therefore offer
persistency management for stateful objects.
5. Additional services:
A comprehensive distributed object middleware
framework must also provide support for the range of
distributed system services viz naming, security and
transaction services.

107

Examples of Dist. Obj.


Middleware

108

Need from Distributed Object to Components


1. Implicit dependencies:
A distributed object offers a contract (interface)
which represents a binding agreement between
the provider of the object and users of that object
in terms of its expected behavior (methods).
Implicit dependencies avoids to replace one
object with another, and hence also for third-party
developers to implement one particular element
in a distributed configuration.
Requirement: There should clear requirement to
specify not only the interfaces offered by an
object but also the dependencies that object has
on other objects in the distributed configuration.
109

2. Interaction with the middleware:


Despite the goals of transparency,
Programmers are exposed to many
relatively low-level details associated with
the middleware architecture which needs
further simplifications.
Requirement: There should be clear need
to simplify the programming of distributed
applications, to present a clean separation
of concerns between code related to
operation in a middleware framework and
code associated with the application.
110

3. Lack of separation of distribution concerns:


Programmers using distributed object
middleware also have to deal explicitly with
non-functional concerns related to issues
such as security, transactions, coordination
and replication.
Requirement: The separation of concerns
related to above issues should extended by
providing the full range of distributed
system services.
The complexities of dealing with the
distributed system services should be
hidden
wherever
possible
from
the
programmer.
111

4. No support for deployment:


Technologies such as Java RMI and
CORBA does not support for the
deployment of the developed arbitrary
distributed configurations.
Requirement:
Middleware
platforms
should provide intrinsic support for
deployment so that distributed software
can be installed and deployed in the
same way as software for a single
machine, with the complexities of
deployment hidden from the user.
112

Components
A component can be thought of as collection of
objects that provide a set of services to other
systems.
The set of services includes code providing
graphing facilities, network communication
services, browsing services related to database
tables etc.
The object linking embedded (OLE) architecture
is one of the first component based framework
on which Microsoft Excel spreadsheets are
designed.
113

Rationale for Components


Highlights
Improved productivity/ reduced complexity
Emphasis on reuse

Programming by assembly
(manufacturing) rather than
development (engineering)
Reduced skills requirement

Key benefit on server side development


(for example EJB )

114

Essence of component
A component is specified in terms of a contract, which
includes
A set of provided interfaces that is, interfaces that
the component offers as services to other components
A set of required interfaces that is, the dependencies
that this component has in terms of other components
that must be present and connected to this
components for it to function correctly.
Note: Interfaces
includes

in

component

based

middleware

interfaces supporting RMI, as in CORBA and Java RMI


interfaces supporting distributed events, as in indirect
communication
115

116

Component-based development
Programming in component-based systems is
concerned with the development of components
and their composition.
Goal
Support a style of software development that
parallels hardware development in using off-theshelf components and composing them together to
develop more sophisticated services.
It supports third-party development of software
components and also make it easier to adapt
system configurations at runtime, by replacing one
component with another.
Note : Components are encapsulated in Containers. 117

Containers

Containers
support
a
common
pattern
often
encountered in distributed systems development
It consists of:
A front-end (web-based) client
A container holding one or more components that
implement the application or business logic
System services that manage the associated data
in persistence storage.
Tasks of a container
Provides a managed server-side hosting
environment for components
Provides the necessary separation of concerns

the components deal with the application concerns


the container deals with the distributed systems and
118
middleware issues

Continued ..
The container implements middleware's services like:
To authenticate user
To make an application remotely accessible
To provide transaction handling
Other services
Activation and passivation, persistence
Life cycle management
Container metadata (introspection)
Packaging and deployment

Container invokes such services at appropriate time


during the execution of business logic in a
transparent way.
Container is capable of modularization of services
which can be encapsulated and decoupled to tailor
119
the specific applications needs.

Structure of Container
This shows a number of
components
encapsulated within a
container.
The container does not
provide direct access to
the components but
rather
intercepts
incoming
invocations
and
then
takes
appropriate actions to
ensure
the
desired
properties of the
120
distributed application

Example of Container (EJB)

121

Application Server
Middleware that supports the container
pattern and the separation of concerns implied
by this pattern is known as an application
server.
A wide range of application servers are now
available:

122
Note : Enterprise JavaBeans specification is an example of an application

Component-based deployment
Component-based middleware provides support for
the deployment of component configurations.
Deployment descriptors
fully describe how the
configurations should be deployed in a distributed
environment.
Deployment descriptors are typically interpreters
written in XML and include sufficient information to
ensure that:
components are correctly connected using appropriate
protocols and associated middleware support;
the underlying middleware and platform are configured to
provide the right level of support to the component
configuration
the associated distributed system services are set up to
provide the right level of security, transaction support and
123
so on.

Case Study 1: Enterprise Java


Bean(EJB)

What is EJB?

A server-side component architecture for


Java
Based on the concept of a container
Offers implicit distributed systems
management
Formalises the interface between a
managed bean (EJB) and its container
Event (callback) interface
Services expected in the container

Deployment using JAR files


124

125

Enterprise beans
The Enterprise JavaBeans architecture is a component
architecture for the development and deployment of
component-based distributed business applications.
Example: In an inventory control application, the
enterprise beans might implement the business logic
in
methods
called
checkInventoryLevel
and
orderProduct.
Benefits of Enterprise Beans
EJB container provides system-level services to
enterprise beans, the bean developer can concentrate
on solving business problems.
Client developer can focus on the presentation of the
client.
Application assembler can build new applications from
126
existing beans.

When to use enterprise beans


The application must be scalable. To accommodate
a growing number of users, there is need to
distribute an applications components across
multiple machines. Not only can the enterprise
beans of an application run on different machines,
but also their location will remain transparent to
the clients.
Transactions must ensure data integrity. Enterprise
beans support transactions, the mechanisms that
manage the concurrent access of shared objects.
The application will have a variety of clients. With
only a few lines of code, remote clients can easily
locate enterprise beans. These clients can be thin,
various, and numerous.

127

Programming in EJB
The task of programming in EJB has been simplified
significantly through the use of Enterprise JavaBeanPOJOs
(plain old Java objects) together with Java Enterprise
JavaBean annotations.
A bean is a POJO supplemented by annotations.
Annotations were introduced in Java 1.5 as a mechanism for
associating metadata with packages, classes, methods,
parameters and variables.
The following are examples of annotated bean definitions
@Stateful public class eShop implements Orders {...}
@Stateless public class CalculatorBean implements Calculator {...}
@MessageDriven public class SharePrice implements MessageListener
{...}

The following example introduces the Orders interface as a


remote interface and the Calculator interface from the
CalculatorBean as a local interface only:
@Remote public interface Orders {...}
@Local public interface Calculator {...}

128

Types of EJBs
1. Session Bean: EJB used for implementing highlevel business logic and processes :
Session beans handle complex tasks that require
interaction with other components (entits,
web services, messaging, etc.)
Session bean is used to represent state of single
interactive communication session between the
a client and the business tier of the server.
Session beans are transient :
when a session is completed , then the
associate bean is discarded.
in case of any failures , session bean are lost as
they are not stored in stable storages.

129

There are two categories of session beans :


Stateful session bean : Holds the conversational state and is
required currently open
Stateless session bean : Holds no state ( eg outside of call) .
These are the inputs from the client tier which may be
pooled or reused.
2. Message Driven Bean

EJB is used to integrate with the external services via


asynchronous messages using Java Message Services
(JMS).
Usually, EJB delegate business logic to session beans first
using RMI.
On the server tier is uses non blocking primitives.
Note : There are also entity beans which provide an inmemory copy of the long term data.
130

EJB containers
EJB container
Runtime environment that provides services, such as
transaction management, concurrency control, pooling,
and security authorization.
Historically, application servers have added other
features such as clustering, load balancing, and failover.
Some JEE Application Servers
GlassFish (Sun/Oracle, open source edition)
WebSphere (IBM)
WebLogic (Oracle)
JBoss (Apache)
WebObjects (Apple)
131

Need for Fractals


Lack of tailor ability in EJB container
There is no mechanism to configure EJB container
There is no mechanism to configure infrastructure
services.
It is not possible to add new services to EJB container

It prevents the non functional aspect like levels of


control facilities for components.
Lacks the tradeoff aspect such as degree of
confrigur_ability
vs
performance
and
space
consumption.
Lacks usability of frameworks and languages in
different environment eg. Embedded systems.
( Reminds the no support for the deployment of the
developed arbitrary distributed configurations)
132

Case Study 2: Fractal


Goals :
Motivate the main features of the Fractal model:
composite components (to have a uniform
view of applications at various abstraction
levels),
shared components (to model resources),

introspection capabilities (to monitor a


running system),
configuration and reconfiguration capabilities
(to deploy and dynamically reconfigure an
application)
133

Essence of Fractal
Fractal is a lightweight component model that
can be used with various programming
language to design , implement, deploy and
reconfigure various system and applications
from OS to middleware and to GUI.

Fractal component model uses 3 the


separation of concern design principles.
Fractal model is also referred as open
component model in the sense it also define
the factory components ie components that
can create new components.
134

I. Various programming
language
Programming platforms
Julia and AOKell (Java-based)
Cecilia and Think (C-based)
FracNet (.NET-based)
FracTalk (Smalltalk-based)
Julio (Python-based).
Julia and Cecilia are treated as the reference implementations of
Fractal.

Middleware platforms
Think (a configurable operating system kernel),
DREAM (supporting various forms of indirect communication),
GOTM (offering flexible transaction management)
Proactive ( Grid computing).
Jasmine (monitoring and management of SOA platforms)
135

II. Separation of concern


principles
1. Separation of interface and implementation

bridge pattern , separation of design and implementation


concern
Guaranties the replacement of one component with another ,
without worrying about class-inheritance problem.
Deals with core component model

2. Component oriented programming

separation of implementation concern


deals with the well separate entities called as components

3. Inversion of control

separation of functional and configuration concern


level of controls
Deals with the configuration and deploy of external entities
136

Core component model


Defined on : Binding
& Structure
of
Fractal
components and is based on the interfaces :
Two types of interfaces available:
server interfaces
which support incoming operational invocations
equivalent to provided interfaces

client interfaces
which support outgoing invocations
equivalent to required interfaces

Note :
Communication between the Fractal components is
only possible if their interface are bound.
137
This leads to the composition of Fractal components.

Binding in Fractals
To enable composition, Fractal supports
bindings between interfaces.
Two styles of binding are :
Primitive bindings:
Composite bindings:

138

Primitive bindings
Direct mapping between one client
interface and one server interface
within the same address space.
Operation invocation emitted by the
client interface should be accepted
by the specified server interface.
It can readily implements by using
pointers or by direct language
reference ( java using object
reference).
139

Composite bindings
Build out of set of primitive bindings and binding
components like stub, skeleton, adaptor etc.
Implemented in terms of communication path
between a number of component interfaces
potentially on different machines.
Composite bindings are themselves components
in Fractal
Interconnnection (remote invocation or indirect,
point-to-point or multiparty)
reconfigured at runtime (security and scalability)

140

Structure of Fractal
components
A Fractal component is
runtime entity that is encapsulated,
has a distinct identity and
supports one or more interfaces

Architecture based on
Membrane & Controllers (Non functional concern)

*supports the interfaces to introspect and


reconfigure its internal features.
*defines the control capabilities.
Content (functional concern)

*consists of finite sets of other components (sub/


nested or shared - components)
141

External
interface
interceptors
Internal
interface

1. Activity
controllers
2. Threads
controllers
3. Scheduling
controllers

The structure of fractal


component
142

Purpose of Controllers
1. Implementation of lifecycle management:
Activation or deactivation of a process
Allows even replacement of server by other
enhanced server

2. Offers Introspection capabilities:


Interface associated with similar components
replacement of client call by another client

3. Offers interception capabilities;


Implement an access control policy
Transparent invocation between client server.
143

Purpose of Membrane
To provide different level of controls
simple encapsulation of components
support for non functional issues like
transactions and security as in
application servers.

144

Level of controls
Low level controls (Run time entities,
base components, eg objects in java)
Middle level controls(introspection level,
provides components interfaces, eg
external interfaces in client-sever, com
services)
High level controls (configuration level,
exploits internal elements)
Additional level of controls(
a
framework for the instantiation of
components )
145

Content
The content of a component is composed of (a finite number of)
other components, called sub components, which are under the
control of the controller of the enclosing component.
The Fractal model is thus recursive and allows components to be
nested (i.e. to appear in the content of enclosing components) at an
arbitrary level.

A component that exposes its content is called


a composite component.
A component that does not expose its content, but has at least one
control interface is called a primitive component.
A component without any control interface is called a base
component.
146

Content
Sub-components :
Hierarchy of components
Components as run-time entities
(computational units)
Caller and callee interface in client-server

Sharing of components :
Software architectures with resources
Menu and toolbar components
Share of Undo button
147

III. Factory components


Component that can create new component
Generic component factories :
creates several kinds of components
Provides GenericFactory Interface

Standard factories:
creates only one kind of components
Use templates and sub-templates

148

Benefits of Fractal Component Model


1. It enforce the definition of good modular design in terms of
binding and structure of components.
2. It enforces the separation of interfaces and implementation,
which ensure the minimum level of flexibility.
3. It enforces the separation between the functional,
configuration and deployment concerns, which allows the
application architecture's to be describe separate from code.
Note : All these features increases the productivity.
(further reading :http://fractal.objectweb.org)

149

Chapter 4: Remote
Invocation
1. Introduction
2. Remote Procedure Call
3. Events & Notification

150

1. Introduction
Middleware layers are based on protocols and
application programming interfaces.

151

L1: Request-reply Protocol


Client

doOperation

Server

Request
message

(wait)

(continuation)

Reply
message

getRequest
select object
execute
method
sendReply

152

Operations of the
request-reply protocol

public byte[ ] doOperation (RemoteObjectRef o, int methodId, byte[ ] arguments)


sends a request message to the remote object and returns the reply.
The arguments specify the remote object, the method to be invoked and the
arguments of that method.
public byte[] getRequest ();
acquires a client request via the server port.
public void sendReply (byte[ ] reply, InetAddress clientHost, int clientPort);
sends the reply message reply to the client at its Internet address and port.

153

messageType
requestId
objectReference

int(0=Request,1=Reply)
int
RemoteObjectRef

methodId

intorMethod

arguments

arrayofbytes

Request-reply message structure


154

Three modes of doOperation


1. Retry request message
Whether to retransmit message

Until reply comes or Server seems to failed

2. Duplicate Filtering
Retransmission are used

Whether to duplicate or filter requests at the server

3. Retransmission of result
Whether to keep the history of result
Avoid the re-execution of server operation
Note : Combination of all these leads to variety of invocation
scematics

155

Invocation Scematics
Maybe invocation scematics
At-least-once invocation scematics
At-most-once invocation scematics

Note : Also known as call scematics


156

1.Maybe invocation
Remote method
may execute or not at all, invoker cannot tell
useful only if occasional failures
Invocation message lost...
method not executed
Result not received...
was method executed or not?
Server crash...
before or after method executed?
if timeout, result could be received after timeout...
157

2. At-least-once invocation
Remote method
invoker receives result (executed exactly) or exception
(no result, executed once or not at all)
retransmission of request messages
Invocation message retransmitted...
method may be executed more than once
arbitrary failure (wrong result possible)
method must be idempotent (repeated execution has the
same effect as a single execution)
Server crash...
dealt with by timeouts, exceptions
158

3. At-most-once invocation
Remote method
invoker receives result (method executed once)
or exception (no result was received)
retransmission of reply & request messages
duplicate filtering
Best fault-tolerance...
arbitrary failures prevented if method called at
most once
Used by CORBA and Java RMI
159

160

L2 : Programming Models
Remote procedure call (RPC)
call procedure in separate process
client programs call procedures in server programs

Remote method invocation (RMI)


extension of local method invocation in OO model
invoke the methods of an object of another process

Event-based model
Register interested events of other objects
Receive notification of the events at other objects
161

Remote Procedure Call (RPC)

Introduction
Design issues
Implementation
Case study :Sun RPC

162

Introduction
Remote Procedure Call (RPC) is a high-level model
for client-sever communication.
It provides the programmers with a familiar
mechanism for building distributed systems.
Examples: File service, Authentication service.

163

Introduction

Why we need Remote Procedure Call (RPC)?


The client needs a easy way to call the
procedures of the server to get some services.
RPC enables clients to communicate with
servers by calling procedures in a similar way
to the conventional use of procedure calls in
high-level languages.
RPC is modelled on the local procedure call,
but the called procedure is executed in a
different process and usually a different
computer.
164

Introduction
How to operate RPC?
When a process on machine A calls a procedure on
machine B, the calling process on A is suspended, and the
execution of the called procedure takes place on B.
Information can be transported from the caller to the callee
in the parameters and can come back in the procedure
result.
No message passing or I/O at all is visible to the
programmer.
165

Introduction
The RPC model
server

client
Call procedure and
wait for reply

request
Receive request and start
process execution

reply
Resume
execution
Blocking state

Send reply and wait for next


execution

Executing state

166

Characteristics
The called procedure is in another process which may
reside in another machine.
The processes do not share address space.
Passing of parameters by reference and passing
pointer values are not allowed.
Parameters are passed by values.
The called remote procedure executes within the
environment of the server process.
The called procedure does not have access to the
calling procedure's environment.
167

Features
Simple call syntax
Familiar semantics
Well defined interface
Ease of use
Efficient
Can communicate between processes on the same
machine or different machines
168

Limitations
Parameters passed by values only and pointer values are not
allowed.
Speed: remote procedure calling (and return) time (i.e.,
overheads) can be significantly (1 - 3 orders of magnitude)
slower than that for local procedure.
This may affect real-time design and the programmer should be aware of
its impact.

Failure: RPC is more vulnerable to failure (since it involves


communication system, another machine and another process).
The programmer should be aware of the call semantics, i.e. programs
that make use of RPC must have the capability of handling errors that
cannot occur in local procedure calls.
169

Design Issues
Exception handling
Necessary because of possibility of network and nodes
failures;
RPC uses return value to indicate errors;

Transparency
Syntactic achievable, exactly the same syntax as a local
procedure call;
Semantic impossible because of RPC limitation: failure
(similar but not exactly the same);
170

Design Issues
Delivery guarantees
Retry request message: whether to retransmit the request
message until either a reply or the server is assumed to
have failed;
Duplicate filtering : when retransmission are used, whether
to filter out duplicates at the server;
Retransmission of replies: whether to keep a history of
reply messages to enable lost replies to be retransmitted
without re-executing the server operations.
171

Call Semantics
Maybe call semantics
After a RPC time-out (or a client crashed and restarted), the
client is not sure if the RP may or may not have been
called.
This is the case when no fault tolerance is built into RPC
mechanism.
Clearly, maybe semantics is not desirable.
172

Call Semantics
At-least-once call semantics
With this call semantics, the client can assume that the RP
is executed at least once (on return from the RP).
Can be implemented by retransmission of the (call) request
message on time-out.
Acceptable only if the servers operations are idempotent.
That is f(x) = f(f(x)).

173

Call Semantics
At-most-once call semantics
When a RPC returns, it can assumed that the remote
procedure (RP) has been called exactly once or not at all.
Implemented by the server's filtering of duplicate requests
(which are caused by retransmissions due to IPC failure,
slow or crashed server) and caching of replies (in reply
history, refer to RRA protocol).

174

Call Semantics
This ensure the RP is called exactly once if the server does
not crash during execution of the RP.
When the server crashes during the RP's execution, the
partial execution may lead to erroneous results.
In this case, we want the effect that the RP has not been
executed at all.

175

RPC Mechanism

client process

server process
Request

client
program

client stub
procedure
Communication
module

Reply

server stub
procedure
Communication
dispatcher
module

service
procedure

176

RPC Mechanism:
Client computer
Local
return

Local
call

Marshal
Unmarshal
arguments
results

Receive
reply

Send
request

service
procedure
client
server
stub
client
proc.
stub
proc.
Communication
module

Server computer
Execute procedure
Unmarshal
arguments

Marshal
results

Select procedure
Receive
request

Send
reply

177

RPC Mechanism:
1. The client provides the arguments and calls the client stub in
the normal way.
2. The client stub builds (marshals) a message (call request) and
traps to OS & network kernel.
3. The kernel sends the message to the remote kernel.
4. The remote kernel receives the message and gives it to the
server dispatcher.
5. The dispatcher selects the appropriate server stub.
6. The server stub unpacks (unmarshals) the parameters and call
the corresponding server procedure.

178

RPC Mechanism
7. The server procedure does the work and returns the result to
the server stub.
8. The server stub packs (marshals) it in a message (call return)
and traps it to OS & network kernel.
9. The remote (receiver) kernel sends the message to the client
kernel.
10. The client kernel gives the message to the client stub.
11. The client stub unpacks (unmarshals) the result and returns to
client.

179

A pair of Stubs
Client-side stub
Looks like local server
function
Same interface as local
function
Bundles arguments into
message, sends to serverside stub
Waits for reply, unbundles results
returns

Server-side stub
Looks like local client
function to server
Listens on a socket for
message from client stub
Un-bundles arguments to
local variables
Makes a local function
call to server
Bundles result into reply
message to client stub
180

RPC Implementation
Three main tasks:
Interface processing: integrate the RPC mechanism with
client and server programs in conventional programming
languages.
Communication handling: transmitting and receiving
request and reply messages.
Binding: locating an appropriate server for a particular
service.
181

Case Studies: SUN RPC


Designed for client-server communication as in the
SUN NFS.
Also called ONC (Open Network Computing) RPC.
Supplied as part of Sun OS product with Unix
System V, Linux, BSD, OS X and even with NFS
Installation.
Uses at-least-once call schematics.
Interfaces defined in an Interface Definition
Language (IDL)
182

Interface definition language: XDR


initially XDR is used for data representation
a standard way of encoding data in a portable fashion between
different systems

Interface compiler: rpcgen


Use with C programming language
A compiler that takes the definition of a remote procedure
interface, and generates the client stubs and the server stubs

Communication handling: TCP or UDP


UDP is used for restricting the length of the request & reply
messages.
Max : 64 kilobytes , Min : 8 or 9 Kilobytes

Binding services: port mapper

183

RPC IDL
program numbers instead of interface names (unique)
procedure numbers instead of procedure names ( version changes)
single input procedure parameter (structs)

Procedure definition
(e.g. WRITE file procedure)

version 1

Procedure definition
(e.g. READ file procedure)

version 2

version number
program number
184

Files interface in Sun XDR (sample code)


const MAX = 1000;
typedef int FileIdentifier;
typedef int FilePointer;
typedef int Length;
struct Data {
int length;
char buffer[MAX];
};
struct writeargs {
FileIdentifier f;
FilePointer position;
Data data;
};

struct readargs {
FileIdentifier f;
FilePointer position;
Length length;
};
program FILEREADWRITE {
version VERSION {
void WRITE(writeargs)=1;
Data READ(readargs)=2;
}=2;
} = 9999;

185

Complier : rpcgen
rpcgen name.x

produces:
name.h header
name_svc.c
server stub
name_clnt.c
client stub
[ name_xdr.c ] XDR conversion routines

186

What goes on in the system:


server
Start server
Server stub creates a socket and binds any
available local port to it
Calls a function in the RPC library:
svc_register to register program#, port #
contacts portmapper (rpcbind on SVR4):
Name server
Keeps track of
{program#,version#,protocol}port# bindings

Server then listens and waits to accept


connections
187

What goes on in the system: client


Client calls clnt_create with:
Name of server
Program #
Version #
Protocol#

clnt_create contacts port mapper on


that server to get the port for that
interface
early binding done once, not per
188
procedure call

Authentication
SUN RPC request and reply message have
an additional fields for authentication
information to be passed between client
and server
SUN
RPC
supports
the
following
authentication protocols :
UNIX style using uid and gid of user
Shared key is eshtablished for signing the RPC
message
Well known Kerberos style of authentication
189

Advantages
Dont worry about getting a unique transport address (port)
But with SUN RPC you need a unique program number
per server
Greater portability
Transport independent
Protocol can be selected at run-time
Application does not have to deal with maintaining message
boundaries, fragmentation, reassembly
Applications need to know only one transport address
Port mapper
Function call model can be used instead of send/receive
190

Event-Notification model
Idea
One object react to a change occurring in another object
Event causes changes in the object that maintain the state of
application
Objects that represent events are called notifications

Event examples
modification of a document
Entering text in a text box using keyboard
Clicking a button using mouse

Publish/subscribe paradigm
event generator publish the type of events
event receiver subscribe to the types of events that are interest to
them
When event occur, notify the receiver
191

Distributed event-based system two characteristics


1.Heterogeneous:
Way to standardize communication in heterogeneous systems
not designed to communicate directly.
Components in a DS that were not designed to interoperate can
be made to work together.
The heterogeneous components are used in application to
describe users location and activities.
2.Asynchronous:
Decoupling of publisher and subscriber.
prevent publishers needing to synchronize with subscribers.

192

Example - dealing room system


Requirements
allow dealers to see the latest market price of the
tocks they deal in.
The market price for a single named stock is
represented by an object with several instant
variable.

System components
Information provider process
receive new trading information
publish stocks prices event
stock price update notification

Dealer process
subscribe stocks prices event

193

Dealing room system


External
source

Dealers computer

Dealer

Notification

Dealers computer

Notification

Information
provider

Notification

Notification

Dealer

Notification

Notification

Notification
Dealers computer

Dealers computer
Notification
Information
provider
Notification

Notification

Dealer

Dealer
External
source

194

The participants in the Dist. Event Notification


The object of interest
its changes of state might be of interest to other objects
Event
An event occurs at an object of interest as the
completion of a method execution
Notification
an object that contains information about an event
Subscriber
an object that has subscribed to some type of events in
another object
Observer objects
the main purpose is to decouple an object of interest
from its subscribers.
Avoid over-complicating the object of interest.
Publisher
an object that declares that it will generate notifications of
particular types of event. May be an object of interest or an
195
observer.

Architecture for distributed event notification


Event service: maintain a database of published
events and of subscribers interests .
decouple the publishers from the subscribers.
Event service
subscriber

object of interest
1.

notification

object of interest
2.
object of interest
3.

notification

observer

subscriber
notification

observer

subscriber
notification

196

Three cases
Inside object without an observer: send
notifications directly to the subscribers
Inside object with an observer: send
notification via the observer to the subscribers
Outside object (with an observer)
1. an observer queries the object of interest in
order to discover when events occur
2. the observer sends notifications to the
subscribers
197

Notification Delivery
Delivery semantics

Unreliable, e.g. deliver the latest state of a player


in a Internet game
Reliable, e.g. dealing room
real-time, e.g. a nuclear power station or a
hospital patient monitor

Roles for observers processes


Forwarding

send notifications to subscribers on behalf of one or


more objects of interests

Filtering of notifications

according to some predicate


reduces the number of notification

Patterns of events

describe the relationship between several events

Notification mailboxes

notification be delayed until subscriber being ready 198


to
receive

Jini distributed event specification


Allow a potential subscriber in one Java Virtual
Machine (JVM) to subscribe to and receive
notifications of events in an object of interest in
another JVM.
Main objects
event generators (publishers)
remote event listeners (subscribers)
remote events (events)
third-party agents (observers)

An object subscribes to events by informing the


event generator about the type of event and
specifying a remote event listener as the target
199
for notification.

EventGenerator interface

Provide register method


Event generator implement it
Subscriber invoke it to subscribe
interested events

to

the

RemoteEventListener interface

Provide notify method


subscriber implement it
receive notifications when the notify method is
invoked

RemoteEvent

a notification that is passed as argument to the


notify method

Third-party agents

interpose between an object of interest and a


subscriber
200

MCT702 :UNIT II
Distributed Operating Systems
Chapter 1 : Architectures of Distributed System
Introduction
Issues in distributed operating system(9)
Communication Primitives(2)
Chapter 2 : Theoretical Foundation
Inherent Limitations of a Distributed System
Lamports Logical Clock and its limitations
Vector Clock
Causal Ordering of messages
Global State and Chandy-Lamports Recording algorithm;
Cuts of Distributed Computation
Termination Detection

201

Continue
Chapter 3:Distributed Mutual Exclusion
Non-Token based Algorithms
Lamports Algorithm

Token based Algorithms


Suzuki-Kasamis Broadcast Algorithm

Consensus and related problems


Comparative Performance Analysis (4)

Chapter 4:Distributed Deadlock Detection


Issues
Centralized Deadlock-Detection Algorithms (2)
Distributed Deadlock-Detection Algorithms(4)
Reference : Chapter no 4, 5, 6 and 7.
Advanced Concepts In Operating System
by Mukesh Singhal and N.G.Shivaratri,(Tata McGraw-Hill)

202

Chp 1 : Architectures of Dist. Sys.


Introduction :
Distributed System is used to describe a system with the
following characteristics:
Consists of several computers that do not share a
memory or a clock;
The computers communicate with each other by
exchanging messages over a communication network;
and
Each computer has its own memory and runs its own
operating system.
203

Architecture of Distributed OS

204

System Architecture Types


Minicomputer model: The distributed system consists of
several minicomputers , where each computer supports
multiple users and provides access to remote resources.
E.g. VAX processors.
(no. of processors/ no. of users )<1
Workstation-server model: Consists of several workstations
where each user is provided with workstation which
consist of powerful processor, memory and display.
With the help of DFS, users can access data regardless of
its location. E.g. Athena and Andrew
(no. of processors/ no. of users )1
Processor-pool model: Allocates one or more processors
according to users need. Once the processors complete
their jobs, they return to the pool and await a new
assignment. E.g. Amoeba combinations
(no. of processors/ no. of users )>1
205

Evolution of Modern Operating. Sys.

206

Definition:
Distributed operating system:
Integration of system services presenting
a transparent view of a multiple computer
system with distributed resources and
control.

Consisting
of
concurrent
processes
accessing distributed, shared or replicated
resources through message passing in a
network environment.
207

Sharing of resources and coordination of distributed


activities in networked environments are the main
goals in the design of a distributed operating system.
The key distinction between a network OS and a
distributed OS is the concept of transparency:
concurrency transparency (also in centralized OS)
location transparency
parallelism and performance transparency
migration transparency
replication transparency
Distributed operating systems consist of three major
components:
coordination of distributed processes
management of distributed resources
implementation of distributed algorithms
208

Issues in Designing
Distributed Operating Sys.
1.
2.
3.
4.
5.
6.
7.
8.
9.

Global Knowledge
Naming
Scalability
Compatibility
Process Synchronization
Resource Management
Security
Structuring
Client Server Computing Models
209

1. Global Knowledge
Complete and accurate knowledge of all processes
and resources is not easily available
Difficulties arise due to
absence of global shared memory
absence of global clock
unpredictable message delays
Challenges
decentralized system wide control
total temporal ordering of system events
process synchronization (deadlocks, starvation)
210

2. Naming
Names are used to refer to objects which includes Computers,
printers, services, files and users.
Objects are encapsulated in servers and only visible entities in the
system are servers. To contact a server, server must be
identifiable.
Three identification methods:
1.Identification by name ( name server)
2.Identification by physical or logical address (network server)
3.Identification by service that servers provide ( components)

Object models and their naming must be addressed early in the


system design as many things depend on the naming scheme:
Ex:

Structure of the system


Management of the namespace
Name resolution
Access methods

211

3. Scalability
Systems generally grow with time.
Design should be such that system should not result
in system unavailability or degraded performance
when growth occurs
E.g. broadcast based protocols work well for small
systems but not for large systems
Distributed File System.( on a larger scale increase
in broadcast queries for file location affects the
performance of every computer)
212

4. Compatibility
Refers to the interoperability among the resources in
a system.
There are three levels of compatibility in DS
Binary Level: All processes execute the same instruction
set even though the processors may differ in performance
and in input-output

E.g. Emerald distributed system


Program development is easy
DS cannot include computers with different architectures
Rarely supported in large distributed systems

213

Compatibility
Execution level: The same source code can be
compiled and executed properly on any computer in
the system
E.g. Andrew and Athena systems support execution level
compatibility

Protocol level: least restrictive form of compatibility

Requires all system components to support a common set


of protocols
Individual computers can run different operating systems
Distributed system supporting protocol level compatibility
employs common protocols for essential system services
such as file system

214

5. Process Synchronization
Process synchronization is difficult because of
unavailability of shared memory.
DOS has to synchronize process running at different
computers when they try to concurrently access
shared resources.
Mutual exclusion problem.
Request must be serialized to secure the integrity of
the shared resources.
In DS, process can request resources (local or remote)
and release resources in any order .
If the sequence of the resource allocation is not
controlled, deadlock may occur which can lead to
decrease in system performance.
215

6. Resource Management
Concerned with making both local and
remote resources available to users in an
effective manner.
Users should be able to access remote
resources as easily as they can access
local resources.
Specific location of resources should be
hidden from users in the following ways:
Data Migration
Computation Migration and
Distributed scheduling

216

6.1:Data Migration
Data can either be file or contents of physical
memory.
In process of data migration, data is brought to the
location of the computation that needs access to it by
the DOS.
If computation updates a set of data, original location
may have to be updated.
In case of file DFS is involved as a component of DOS
that implements a common file system available to
the autonomous computers in the system.
Primary goal is to provide same functional capability
to access files regardless of their location.
If the data accessed is in the physical memory of
another system then a computations data request is
handled by distributed shared memory.
It provides a virtual address space that is shared
among all the computers in a DS, main issues are
217
consistency and delays.

6.2:Computation migration
In computation migration, computation migrates to
another location.
It may be efficient when information is needed
concerning a remote file directory.
It is more efficient to send the message and receive
the information back, instead of transferring the
whole directory.
Remote procedural call has been commonly used for
computation migration.
Only a part of computation of a process is normally
carried out on a different machine.
218

6.3 :Distributed Scheduling


Processes can be transferred from one computer to
another by the DOS.
A process may be executed at a computer different
from where it was originated.
Required when the computer is overloaded or does
not have the necessary resources.
Distributed scheduling is responsible for judiciously
and transparently distributing processes amongst
computers such that overall performance is
maximized.
219

7:Security
OS is responsible for the security of the
computer system
Two issues must be considered:
Authentication: process of guaranteeing that an
entity is what it claims to be
Authorization: process of deciding what privileges
an entity has and making only these privileges
available

220

8: Structuring
1. Monolithic Kernel:
. The kernel contains all the services
provided by operating system.
. A copy of huge kernel is running on all the
machines of the system.
. The limitation of this approach is that most
of the machines will not require most of the
services but the kernel is still providing it.
Note: one size fits all (diskless workstations,
multiprocessors, and file servers)
221

2. Collective Kernel Approach:


Operating system is designed as a collection of
independent processes.
Each process represents some service such as
distributed scheduling, distributed file system etc.
The kernel consists of a nucleus of operating system
called micro kernel which is installed on all the
machines and provides basic functionalities.
The micro kernel also provides interaction between
services running on different machines.
e.g. : Galaxy, V-Kernel.
3. Object Oriented Kernel:
All services of operating system are implemented in
the form of objects.
Each object encapsulate a data structure and also a
set of operations for those data structure.
222
e.g. Amoeba, CLOUDS.

9. Client server Computing


Model
Processes are categorized as servers (provide

services) and clients(need services).


Servers merely respond to the requests of the
clients
and
do
not
typically
initiate
conversations with clients
In case of multiple servers the location and the
conversation are transparent to the clients.
Clients generally make use of cache to
minimize the frequency of sending the data
request tot the server.
System Structured on the client server model
can easily adapt the collective kernel
223
structuring technique.

Communication Primitives
It is a mode to send raw bit streams of data in
distributed environment.
There are two models that are widely accepted
to develop distributed operating system(DOS).
1.Message Passing
2.Remote Procedure Call(RPC)
Note : In DS , recall communication paradigms in
architecture model:
Inter-process (Message passing, socket prog. Multicast)
Remote invocation (RRP,RPC,RMI,)
Indirect communication (group comm., event-based
etc)

224

1.Message Passing Model


The Message Passing model provides two basic
communication Primitives: SEND & RECEIVE
The SEND Primitives has two parameters:
A message and its destination.

The RECEIVE primitive has also two parameters:


The source of a message and a buffer for storing the
message.

An application of these primitives can be found in


client server computation model.
The scematics of SEND & RECEIVE primitives are
decides on the design issues namely
Non blocking vs Blocking primitives
Synchronous vs Asynchronous primitives

225

Non Blocking VS Blocking Primitives


In the standard message passing model
messages are copied three times
From user buffer to kernel buffer
From kernel buffer on sending computer to the
kernel buffer on receiving computer.
From buffer on receiving computer to user
buffer.
This is know as buffered option.
In the unbuffered option, data is copied from one
user buffer to another user directly.
user b

receive

user a send m

buffer b

buffer a
Communication channel
kernel buffer of sending computer

kernel buffer of receiving computer


226

Non-blocking Primitives:
With non- blocking primitive, the SEND primitive
return the control to user process.
While the RECEIVE primitive respond by signaling
and provide a buffer to copy the message.
Primary advantages is the programs have
maximum flexibility to perform computation and
communication in any order.
A significant disadvantages of non-blocking is that
programming becomes difficult.
A natural use of nonblocking communication
occurs in producer(SEND)-consumer(RECEIVE)
relationship.

227

Blocking Primitives :
The SEND primitive does not
the user program

return the control to

until the message has been sent (an unreliable blocking


primitive) or
until an acknowledgment has been received ( a reliable
blocking primitive).

In both cases user buffer can be reused.


The RECEIVE primitive does not return the control
untill the message is copied to the buffer.
In case of reliable RECEIVE primitive acknowledge is send
automatically
In case of unreliable RECEIVE primitive acknowledge is
not send

The advantage is the behavior of the program can


be predicted which makes the programming
relatively easy.
The disadvantage is the absence of concurrency
228
between the computation and communication

Synchronous Vs Asynchronous
Primitives
It is based on the concept of using buffer or not.
Both of these can be extended in terms of
blocking or non blocking primitives.
Synchronous primitive:
A SEND primitive is block until a corresponding
RECEIVE primitive is executed at the receiving
computer.
This strategy is referred as blocking synchronous
primitive or rendezvous.
In unblocking synchronous primitive , the
message is copied to a buffer at the sending
side, and then allowing the process to perform
other computation activity except another SEND229

Asynchronous primitive:
The messages are buffered
A SEND primitive is not block even if
there no corresponding execution of
a RECEIVE primitive.
The RECEIVE primitive can either be
a blocking or a nonblocking primitive.
The main disadvantage in using
buffers increases the complexity in
terms of creating , managing and
destroying the buffers.
230

2. Remote Procedural Call


A More natural way to communicate is through
Procedural call:
every language supports it
semantics are well defined and understood
natural for programmers to use

Programmer using such a model must handle the


following details:

Pairing of responses with request messages


Data representation
Knowing the address of remote machine on the server
Taking care of communication and system failure
231

Basic RPC Operation


The RPC Mechanism is based on the observation
that a procedural call is well known for transfer
of control and data with in a program running an
a single machine.
On invoking a remote procedure, the calling
process is suspended.
If any parameter are passed to the remote
machine where the procedure will execute.
On completion, the result are passed back from
server to client and resuming execution as if it
had called a local procedure.
232

Design Issues in RPC


RPC mechanism is based
on the
concept of stub procedures.
The server writer writes the server and
links it with the server-side stubs; the
client writes the respective program
and links it with the client-side stub.
The stubs are responsible for managing
all details of the remote communication
between client and server.
233

Design Issues (Contd)

234

Structure
When a program (client) makes a remote procedure
call, say p(x,y), it actually makes a local call on a
dummy procedure or a client-side stub procedure p.
The client-side stub procedure construct a message
containing the identity of the remote procedure and
parameters and then send to remote machine.
A stub procedure at server side stub receives the
message and makes a local call to the procedure
specified in the message.
After execution control returns to the server stub
procedure which return the control to client side
stub.
The stub procedures can be generated at compile
time or can be linked at run time.
235

Binding
Binding is process that determines the remote
procedure, and the machine on which it will be
executed.
It may also check the compatibility of
parameters passed and procedure type called.
Binding server essentially store the server
machine along with the services they provide.
Another approach used for binding is where the
client specifies the machine and the service
required and the binding server returns the port
number for communication.
236

Parameter and Result

237

Error handling, Sematics and


Correctness
A RPC can fail for at-least two reasons
Computer failure
Communication failures

The sematics of RPCs are classified as follows :


Schematics \Execution

Success

Failure

Partial

At-least once

>= 1

0 or more

possible

Exactly once

0 or 1

possible

At most once

0 or 1

none

Correctness conditions: C1 C2 W1 W2 where


Ci denote RPC calls and Wi work done on shared
data, and denotes a happened before relation
238

RPC other issues


Implementation issues for the RPC mechanism
low latency RPC calls (use UDP)
high-throughput RPC calls (use TCP)
Increase concurrency of RPC calls via

Blocks
call

process

Multi-RPC ( invokes only one procedures on many servers


but avoid different parallel procedure)
Parallel RPC ( invoking the parallel procedure call executes a
procedure in n different address spaces in parallel)
asynchronous calls ( avoid blocking but programming
becomes difficult)

Shortcomings of the RPC mechanism


does not allow for returning incremental results
Avoids protocol flexibility : remote procedures are not
first-class objects (e.g. can not be used everywhere
where local procedures/variables can be used)
239

Chp 2 : Theoretical Foundation

Inherent Limitation of a Distributed System


Lamports Logical Clock and its limitations
Vector Clock
Causal Ordering of messages
Global State and Chandy-Lamports Recording
algorithm;
Cuts of Distributed Computation
Termination Detection
240

Inherent Limitations of a Dist. System


A DS is a collection of computers that are spatially
separated and do not share a common memory.
Processes communicate by exchanging messages
over communication channel.
DS suffers some inherent limitations because of
Lack of a systemwide common clock
i.e.

Absence of global clock

Lack of common memory


i.e.

Absence of shared memory

241

1.Absence of a Global Clock


There is no system-wide common clock in a DS
Solutions can be:

Either having a global clock common to all the computers,


or
Having synchronized clocks, one at each computer

Both of the above solutions are impractical due to


following reasons:
If one global clock is provided in the distributed system:

Two processes will observe a global clock value at different


instants due to unpredictable delays
So two processes will falsely perceive two different instants in
physical time to be a single instant in physical time

242

Continued..
if the clocks of different systems are tried to
synchronize:
These clocks can drift from the physical time and the
drift rate may vary from clock to clock due to
technological limitations.
This may also end up with the same result.
We cannot have a system of perfectly synchronized
clocks

243

Impact of the absence of global


time
Temporal ordering of events is integral to the design
and development of DS
E.g.
an OS is responsible for scheduling processes
A basic criterion used in scheduling is the temporal
order in which requests to execute processes arrive.
Due to the absence of the global time, it is difficult
to reason about the temporal order of events in a DS.
Hence, algorithms for DS are more difficult to
design and debug.
Also, the up-to-date state of the system is harder to
collect.
244

2. Absence of shared
memory
Due to the lack of shared memory, an up-to-date state
of the entire system is not available to any individual
process
It is necessary for reasoning about the systems
behavior,
debugging and
Recovery

Information exchange is subject to arbitrary network


delays.
One process in a DS can get either
a coherent but partial view or
an incoherent but complete (global) view of the
system

245

Coherent means:
all processes make their observations at the same time.
Note : incoherent sounds that all processes donot make their
observations at the same time.

Complete (or global) includes:


all local views of the state, plus
any messages that are in transit

It is very difficult for every process to get a


complete and coherent view of the global state
Example: One person has two bank accounts, and
is in process of transferring $50 between the
accounts.
246

Example: coherent but


partial
Local state of A
$500

Local state of B

Communication
$200
Channel

S1: A

S2: B

Note :

Coherent requires sequential


consistency.
In its absence blocking may
occur.
The communication channel
cannot record its state by
itself.
The processes have to keep
the record of communication
247
channels.
247

Example: incoherent but complete state


Local state of A

(a)

$500
S1: A

(b)

(c)

Local state of B

Communication
$200
Channel
S2: B

$450

$200

S1: A

S2: B

$500
S1: A

$250
S2: B
248

248

Lamports Logical Clock: Basic


Concepts
Lamport proposed the following scheme to
order events in a distributed system using
logical clocks.
The execution of processes is characterized by
a sequence of events
Depending on the application,
the execution of a procedure could be one event or
the execution of an instruction could be one event

When processes exchange messages,


sending a message constitutes one event
and receiving a message constitutes one event.

249

Logical clock : Basic


Concept
Time
One dimension.
It can not move backward.
It can not stop.

It is derived from concept of the


order in which events occur.
The concepts before and after
need to be reconsidered in a
distributed system.
250

Lamports Logical Clock:


Due to the absence of perfectly synchronized clocks and
global time in distributed systems, the order in which
two events occur at two different computers cannot be
determined based on the local time at which they occur.
under certain conditions, it is possible to ascertain the
order in which two events occur based solely on the
behavior exhibited by the underlying computation.
The happened before relation() captures the causal
dependencies between events, i.e., whether two events
are causally related or not. The relation is defined in
the following slide.
251

Happened before
relationship

a b, if a and b are events in the same process


and a occurred before b. ( one process)
a b, if a is the event of sending a message m
in a process and b is the event of receipt of the
same message rn by another process. (two
process)
If a b and b c , then a c , i.e., " "
relation is transitive (more than two processes).
In distributed systems, processes interact with
each other and affect the outcome to events of
processes.
Being able to ascertain order between events is
very important for
designing, debugging, and understanding the sequence
of execution in distributed computation.
252

Lamports Logical Clock:


Events
In general, an event changes the
system
state,
which
in
turn
influences
the
occurrence
and
outcome of future events.
Past events influence future events
and this influence among causally
related events (those events that can
be ordered by ) is referred to as
causal affects.
253

Lamports Logical Clock:


Events
CASUALLY RELATED EVENTS: Event a

causally affects event b if a b.


CONCURRENT EVENTS: Two distinct
events a and b are said to be
concurrent (denoted by a||b) if
a b and b a.
For any two events a and b in a
system, either a, b a, or a||b
254

Lamports Logical Clock:


Events
1. Processes and events
2. Path and arrows
3. Causally related e22 &e14
4. e11 || e21
e11

e12

e13

e14

Space

P1

e21

e22

e23

e24

P2
Global time

255

System of Logical Clock


In order to realize the relation , Lamport
introduced the following system of logical
clocks.

There is a clock Ci at each process Pi in the


system.
The clock Ci can be thought of as a function
that assigns a number Ci(a) to any event a,
called the timestamp of event a, at Pi.
The numbers assigned by the system of clocks
have no relation to physical time, and hence
the name logical clocks.
The
logical
clocks
take
monotonically
increasing values. These clocks can be
implemented by counters. Typically, the
timestamp of an event is the value of the clock
256
when it occurs.

Lamports Logical Clock:


Conditions
For any events a and b: if a b, then C(a) < C(b)
The happened before relation can now be
realized by using the logical clocks if the following
two conditions are met:
[C1] For any two events a and b in a process Pi, if a
occurs before b, then Ci(a) < Ci(b)
[C2] If a is the event of sending a message m in process
Pi and b is the event of receiving the same message m
at process Pj, then
Ci(a) < Cj(b)

The following implementation rules (IR) for the


clocks guarantee that the clocks satisfy the
correctness conditions Cl and C2:

257

Lamport Logical Clock:


Implementation Rules
[IR1] Clock Ci is incremented between any two
successive events in process Pi:
Ci:=Ci+d (d>0)
lf a, and b are two successive events in Pi and
then Ci(b) = Ci(a) + d.

a b,

[IR2] If event a is the sending of message m


by process Pi, then message m is assigned
a timestamp tm, = Ci(a)
(note that the value of Ci(a) is obtained after applying
rule IRI).
On receiving the same message m by process Pj, Cj is
set to a value greater than or equal to its present
value and greater than tm
Cj := max(Cj, tm + d) (d > 0)
258

Lamports logical Clock: How does it


advance?

Space

P1
Clock
values

e11

e12

e13

(1)

(2)

(3)

(1)

e21

P2

(2)

e22

e14
(4)

Max(4+1,
2+1)

e15

e16

(5)

(6)

(3)

e23

Max(2+1,
2+1)
Global time

(4)

e24

Max(6+1,4+1
)e
17

(7)
(7)

e25
Max(4+1,
6+1)

259

Lamports logical Clock: How does it


advance?
Lamport's happened before relation, ()
defines an irreflexive partial order among the
events
The set of all the events in a distributed
computation can be totally ordered (the ordering
relation is denoted by =>) using the above
system of clocks as follows:
If a is any event at process Pi and b is any event at
process Pj then a => b if and only if either

Ci(a) < Cj(b) or Ci(a) = Cj(b) and Pi < Pj where < is any
arbitrary relation that totally orders the processes to break
ties
Partial Ordering

A simple way to implement < is to assign unique


Causal events are sequenced
identification numbers to each process and then Pi < Pj , if i
Total Ordering
< j.
All events are sequenced
260

Partial Order Example


1

2
1

a b : C0(a) = 1 < 2 = C0(b)


f i : C1 (f) = 4 < 5 = C2 (i)
a e : C0 (a) = 1 < 3 = C2 (e)
etc.
261

Getting a Total Order


If a total order is required, break ties
using ids.
In the example, C0(a) = (1,0), C1(c) =
(1,1), etc.
In the example, C0(a) < C1 (c).

262

Drawback of Logical Clocks


a b implies C(a) < C(b), but C(a) <
C(b) does not necessarily imply a
b.
In previous example, C(g) = 1 and
C(b) = 2, but g does not happen
before b.
Reason is that "happens before" is a
partial order, but logical clock values
are integers, which are totally
ordered.
263

Lamports Logical Clock:


Limitations
In Lamport's system of logical clocks, if a
b then C(a) < C(b).
But, the reverse is not necessarily true if
the events have occurred in different
processes.
i.e. if a and b are events in different
processes and C(a) < C(b), then a b is
not necessarily true; events a and b may
be causally related or may not be causally
related.
So, Lamport's system of clocks is not
powerful
enough
to
capture
such
264
situations .

Lamports Logical Clock:


Limitations
1. Clearly C(e11)<C(e22) and C(e11)<C(e32)
2. Causally related on the basis of path
e11
P1
Space

P2
P3

e12

(1)

(2)

e21

e22

(1)

(3)

e31

e32

e33

(1)

(2)

(3)

Global time

Note : if a and b are events in different processes and C(a) < C(b),
then a b is not necessarily true; events a and b may be causally
265
related or may not be causally related.

Vector Clocks
Generalize logical clocks to provide
non-causality information as well
as causality information.
Implement with values drawn from
a partially ordered set instead of a
totally ordered set.
Assign a value V(e) to each
computation event e in an
execution such that a b if and
only if V(a) < V(b).
266

Vector Timestamps
Algorithm

Each pi keeps an n-vector Vi, initially all


0's
Entry j in Vi is pi 's estimate of how many
steps pj has taken
Every msg pi sends is timestamped with
current value of Vi
At every step, increment Vi[i] by 1
When receiving a message with vector
timestamp T, update Vi 's components j
i so that Vi[j] = max(T[j],Vi[j])
If a is an event at pi, then assign V(a) to
be value of Vi at end of a.
267

Manipulating Vector
Timestamps
Let V and W be two n-vectors of integers.
Equality: V = W iff V[i] = W[i] for all i.
Example: (3,2,4) = (3,2,4)
Less than or equal: V W iff V[i] W[i]
for all i.
Example: (2,2,3) (3,2,4) and (3,2,4)
(3,2,4)
Less than: V < W iff V W but V W.
Example: (2,2,3) < (3,2,4)
Incomparable: V || W iff !(V W) and !(W
V).
268
Example: (3,2,4) || (4,1,4)

Vector Clock example


(1,0,0)

P1

(2,0,0)

Space

e11

e12

(0,1,0)

P2
e21
P3

(3,4,1)

e13
(2,2,0)

(2,3,1)

e22

e23

(0,0,1)

e31

(2,4,1)

e24
(0,0,2)

e32

Global time

V(e31) = (0,0,1) and V(e12) = (2,0,0), which are


incomparable.
Compare with logical clocks C(e31) = 1 and C(e21) = 2.
Vector timestamps implement vector clocks.
Means a b implies V(a) < V(b).
269

Causal Ordering of
Messages
If M1 is sent before M2, then every recepient of
both messages must get M1 before M2.
This is not guaranteed by the communication
network since M1 may be from P1 to P2 and M2
may be from P3 to P4.
Consider a replicated database system.
Updates to the entries should be received in
order!
Basic idea for message ordering :
Deliver a message only if the preceding one has
already been delivered.
Otherwise, buffer it up. ie buffer a later message

270

Violation of
Causal Ordering of Messages
Send(M1)

Space

P1
Send(M2)

P2
P3

Time
271

Causal Ordering of Messages


e.g. : send(M1) ->send(M2) => receive (M1) ->receive(M2)

P1

(0,0,0)

(0,0,1)

P2

(buffer) (0,0,1) (0,1,1)


deliver
from buffer

M2

(0,1,1)

M1

P3
(0,0,1)

(0,1,1)

Note : This diagram explains , without vector clock increments to


prove M1 (0,0,1) is received before M2 (0,1,1) at P1.
272

PROTOCOLS
1. Birman-Schiper-Stephenson Protocol
2. Schiper-Eggli-Sandaz Protocol

273

1.Birman-Schiper-Stephenson
Protocol

BSS: Birman-Schiper-Stephenson Protocol


Broadcast based: a message sent is received by
all other processes.
Deliver a message to a process only if the
message preceding it immediately, has been
delivered to the process.
Otherwise, buffer the message.
Accomplished by using a vector accompanying
the message.

274

Birman-Schiper-Stephenson
Protocol
Pi stamps sending messages m with a
vector time.
Pj, upon receiving message m from Pi
,VTm buffers it till
VTpj[i] = VTm[i] 1
forall k, k != i, VTpj[k] >= VTm[k]

When Pj receives message m, it updates


VTpj

275

BSS Algorithm ...

1. Process Pi increments the vector time VTpi[i], time stamps,


and broadcasts the message m. VTpi[i] - 1 denotes the number
of messages preceding m.
2. Pj != Pi receives m. m is delivered when:
a. VTpj[i] == VTm[i] 1 [Pj has received all messages from Pi before m]
b. VTpj[k] >= VTm[k] for all k in {1,2,..n} - {i}, n is the
total number of processes. Delayed message are queued
in a sorted manner. [Pj has received all those messages received by Pi before m]
c. Concurrent messages are ordered by time of receipt.
3. When m is delivered at Pj, VTpj updated according Rule 2 of
vector clocks.
2(a) : Pj has received all Pis messages preceding m.
2(b): Pj has received all other messages received by Pi
before sending m.
276

BSS Algorithm
e.g. 1
(1,0,1)

P1

(2,2,1)

M2
P2

(0,1,1) (0,2,1)

P3
M1
(0,0,1)

(0,2,2)

277

Implementing causal order using Vector Clocks in BSS


application processes
1,0,0

e.g. 2

2,2,0

P1
1,1,0

1,2,0

P2
P3

?
0,0,0

? message service
=
1,0,1 1,2,2

P3s vector is at (0,0,0) and a message with timestamp (1,2,0)


arrives from P2
i.e. P2 has received a message from P1 that P3 hasnt seen.
More detail of P3s message service:
receiver vector
sender
sender vector decision
new
receiver vector
0,0,0
P2
1,2,0
buffer
0,0,0
P3 is missing a message from P1 that sender P2 has already
received
0,0,0
P1
1,0,0
deliver
1,0,1
1,0,1
P2
1,2,0
deliver
1,2,2
In each case: do the sender and receiver agree on the state of all
other processes?
278
If the sender has a higher state value for any of these others, the

2. SES Protocol
SES: Schiper-Eggli-Sandoz Algorithm.
No need for broadcast messages.
Each process maintains a vector V_P of size N - 1, N
the number of processes in the system.
V_P is a vector of tuple (P,t): P the destination
process id and t, a vector timestamp.
Eg V_P : (P2,<1,1,0>)

Initially, V_P is empty. (at start point)


Tm: logical time of sending message m
Tpi: present logical time at pi

279

SES Algorithm
Sending a Message:
Send message M, time stamped tm, along with V_P1
to P2.
Insert (P2, tm) into V_P1. Overwrite the previous
value of (P2,t), if any.
(P2,tm) is not sent. Any future message carrying
(P2,tm) in V_P1 cannot be delivered to P2 until tm <
tP2.

Delivering a message
If V_M (vector with the message) does not contain
any pair (P2, t), it can be delivered.
/* (P2, t) exists */ If t > Tp2, buffer the message.
(Dont deliver).
else (t < Tp2) deliver it

280

SES Buffering Example


Tp1: (1,1,0)

P1

(2,2,2)

Tp2:
P2 (0,1,0)
(0,2,0)
M1
M2
V_P2 V_P2:
empty (P1, <0,1,0>)
M3

P3
Tp3: (0,2,1)

(0,2,2)V_P3:
(P1,<0,1,0>)

281

SES Buffering Example...


M1 from P2 to P1: M1 + Tm (=<0,1,0>) +
Empty V_P2
M2 from P2 to P3: M2 + Tm (<0, 2, 0>) + (P1,
<0,1,0>)
M3 from P3 to P1: M3 + <0,2,2> + (P1,
<0,1,0>)
M3 gets buffered because:
Tp1 is <0,0,0>, t in (P1, t) is <0,1,0> & so Tp1 < t

When M1 is received by P1:


Tp1 becomes <1,1,0>, by rules 1 and 2 of vector
clock.

After updating Tp1, P1 checks buffered M3.


Now, Tp1 > t [in (P1, <0,1,0>].
So M3 is delivered.

282

SES Algorithm ...


On delivering the message:
Merge V_M (in message) with V_P2 as
follows.
If (P,t) is not there in V_P2, merge.
If (P,t) is present in V_P2, t is updated with max(t
in Vm, t in V_P2).
Message cannot be delivered until t in V_M is
greater than t in V_P2

Update site P2s local, logical clock.


Check buffered messages after local, logical
clock update.
283

Global state
Global state of a distributed system
Local state of each process
Messages sent but not received (state of the
queues)
Many applications need to know the state of the
system
Failure recovery, distributed deadlock detection
Problem: how to figure out the state of a
distributed system?
Each process is independent
No global clock or synchronization
284

Global State
Due to absence of global clock, states are
recorded at different times.
For
global
consistency,
state
of
the
communication channel should be the sequence
of messages sent before the senders state was
recorded excluding the messages received
before the receivers state was recorded.
Local states are defined in context of an
application
a send is a part of the local state if it happened
before the state was recorded.

285

A message causes an inconsistency if its


received state is recorded, but not sent state.
A collection of local states forms a global state.
This global state is consistent iff there are no
pairwise inconsistency between local states.
A message is in transit when it has been sent,
but not received.
The global state is transitless iff there are no
local state pairs with messages in transit.
Transitless + Consistent Strongly Consistent
State

286

Global State:

1. GS ={LS1,LS2,LS3}
2. {LS11,LS22,LS32 } is inconsistent GS as received
recorded but not send recorded for M2
3. {LS12,LS23,LS33 } is consistent GS as avoided
inconsistency as well ( no. msg send) (no. of msg
received.)
4. .{LS11,LS21,LS31 } strong consistent GS.
LS11

S1
M1
S2

LS21 M2

LS12
LS22

LS23
M3

S3
LS31

LS32

LS33

287

Chandy Lamport G S R Algorithm


The idea behind this algorithm is that we can record a
consistent state of the global system if we know that
all messages that have been sent by one process have
been received by another.
This is accomplished by the use of a Marker which
traverses the distributed system across all channels.
The Marker, in turn, causes each process to record a
snapshot of itself and, eventually, of the entire system.
As long as the Marker can traverse the entire network
in finite time, the algorithm works.
The primary benefit of the global state recording
algorithm, is the ability to detect a stable property of
the distributed system. Such a property could be
deadlock, or termination.
288

Assumptions of C-L GSRA


There are a finite number of processes
and communications channels.
Communication channels have infinite
buffers that are error free.
Messages on a channel are received in
the same order as they are sent.
Processes in the distributed system do
not share memory or clocks.
289

C-L GSR Algorithm


Sender. (Process p).
1.1] Record the state of (p).
1.2] For each outgoing channel (c) incident to (p), send
a marker before sending ANY other messages.

Receiver (Process q receives marker on channel


c1).
2.1] If (q) has not yet recorded its state.
Record the state of (q).
Record the state of (c1) as empty.
For each outgoing channel (c) incident to (q), send a marker
before sending ANY other messages.

2.2] If (q) has already recorded its state.


Record the state of (c1) as all messages received since the
last time the state of (q) was recorded.

Algorithm terminates when every process has


received a marker from every other process

290

Pictured below is a system with three nodes.


All three processes begin with $500 and the channels are
empty.
Therefore the stable property of dollars in the system is
$1500.

291

Technical requirement:
initiator

c1

c2
c3

c4
r
p
x

q
x

x
r

marker

checkpoint

292

Step 1:
Process p sends out $10 to process q and then decides
to initiate the global state recording algorithm:
p records its current state ($490) and send out a marker
along channel c1.
Sender process p
Meanwhile, process q has sent $20 to p along channel c2
Also q has sent $10 to r along channel c3.

293

Step 2:
Process q receives the $10 transfer (increasing its
value to $480) and then receives the marker on
channel c1.
Receiver process
q point 1
Because it received a marker, process q records
its state as $480 and then sends markers along
each of its outgoing channels c2 and c3.
Meanwhile, process r has sent $25 along c4.
Note that it does not matter if r sent the message
before or after q received the marker.

294

Step 3:
process r receives the $10 transfer and the marker
from channel c3.

Therefore, rupdates its state to $485 and


records this state.
Receiver process
point 1
Process r also sends a marker on its outgoing
channel, c4.
Meanwhile, process p has sent another $20 to
process q along channel c1.

295

Step 4 :
Process q receives the $20 transfer on channel c1 and
updates its value to $500.
No marker
Notice that process q does not change is recorded state
found
value.
Also process p has receives the $20 transfer on channel c2.
Process p records the $20 transfer as Receiver
part of its process
recorded state

point 2

(because it received this after the state recording algorithm


had begun and the marker on that channel had not yet been
Receiver process
received).
point 2
Process p then receives the marker on channel c2 and can
At p:channel .
stop recording any further messages on that
470+20=490

Old recorded
state

296

Step 5 :
Process p receives the $25 on channel c4.
p adds this to its recorded state and also
changes its current value from $490 to $515.
When process p receives the marker on channel c4,
the state recording is complete because all
processes have received markers on all of their
input channels.
The final recorded state is shown in the table
below.
Old recorded state
Previous recorded
state
490+25=515

297

P1
P2
P3

Snapshot
Example
e
e
e
e
1

1,2

e20

M
M

e21,2,3
e30

e13

e23 e24

Consistent Cut

M M

e32,3,4

e31

1- P1 initiates snapshot: records its state (S1); sends Markers to P2 & P3;
turns on recording for channels C21 and C31
2- P2 receives Marker over C12, records its state (S2), sets state(C12) = {}
sends Marker to P1 & P3; turns on recording for channel C32
3- P1 receives Marker over C21, sets state(C21) = {a}
4- P3 receives Marker over C13, records its state (S3), sets state(C13) = {}
sends Marker to P1 & P2; turns on recording for channel C23
5- P2 receives Marker over C32, sets state(C32) = {b}
6- P3 receives Marker over C23, sets state(C23) = {}
7- P1 receives Marker over C31, sets state(C31) = {}

Consistent Cut =time-cut across processors and channels so no event


298
after the cut happens-before an event before the cut

Notable points of C-L GSRA


Recorded global state may not be the same as any actual
state, but is equivalent (reachable from) and is consistent.
If Sinit and Sfinal
are the global state when Chandy
Lamports algorithm started and finished respectively and
S* is the state recorded by the algorithm then,
S* is reachable from Sinit
Sfinal is reachable from S*
Specifically, we show that there
exists a computation seq where
seq is a permutation of seq,
such that Sinit, S* and Sfinal occur as global states in seq.
Sinit occurs earlier than S* : executes prefix of seq
S* occurs earlier than Sfinal in seq executes rest of
299
actions

Stability Detection
The reachability property of the snapshot
algorithm is useful for detecting stable properties.
If a stable predicate is true in the state Ssnap then
we may conclude that the predicate is true in the
state Sfinal
Similarly if the predicate evaluates to False for
Ssnap, then it must also be False for Sinit.
300

Cut:
Cuts: graphical representation of a global state.
Cut C = {c1, c2, .., cn} where ci: cut event at Si.
Th : A cut C is consistent iff no two cut event are
causally related.
c1

S1
M1
S2

M2

c2
M3

S3
LS31

c3

301

Time of a Cut
Let C = {c1, c2, .., cn} is a cut where ci is the
cut event at site Si with vector time stamp
VTci.
Vector time of the cut, VTc = sup(VTc1, VTc2, ..,
VTcn).
sup is a component-wise maximum,
i.e., VTci = max(VTc1[i], VTc2[i], .., VTcn[i]).
For consistency : No message is sent after the
cut event which was received before the same
cut event.
Th : A cut is consistent iff VTc = (VTc1[1],
VTc2[2], .., VTcn[n])
302

p0

1
1

p1
p2

303

consists from an event


from each process

Cut:

p0

1
1

p1
p2

4
1
2

3

Vector time of cut

304

no messages cross the cut

Consistent cut:

p0

1
1

p1
p2

4
1
2

3

Vector time of cut

305

messages can cross


from left to right of the cut

Consistent cut:

p0

1
1

p1
p2

Vector time of cut

7
2
3

5

306

messages cross
from right to left of the cut

nconsistent cut:

p0

1
1

p1
p2

5
1
2

4

Vector time of cut

307

messages cross
from right to left of the cut

nconsistent cut:

p0

1
1

p1
p2

4
4

6
4
3

5

Vector time of cut


308

Termination Detection
Termination: completion of the sequence of
algorithm.
(e.g.,)
leader
election,
deadlock
detection, deadlock resolution.
System Model

processes can be active or idle(passive)


only active processes send messages
idle process can become active on receiving a computation message
active process can become idle at any time
Termination: all processes are idle and no computation message are
in transit
Can use global snapshot to detect termination also
309

Termination Detection

Use a controlling agent or a monitor process.


Initially, all processes are idle. Weight of controlling
agent is 1 (0 for others).
Start of computation: message from controller to a
process. Weight: split into half (0.5 each).
Repeat this: any time a process send a computation
message to another process, split the weights
between the two processes (e.g., 0.25 each for the
third time).
End of computation: process sends its weight to the
controller. Add this weight to that of controllers.
(Sending processs weight becomes 0).
Rule: Sum of W always 1.
Termination: When weight of controller becomes 1
again.
310

Huangs Algorithm

B(DW): computation message, DW is the weight.


C(DW): control/end of computation message;
Rule 1: Before sending B,
compute W1, W2 (such that W1 + W2 is W of the
process).
Send B(W2) to Pi,
W = W1.
Rule 2: Receiving B(DW): a process having weight W
does
W = W + DW,
process becomes active.
Rule 3: Active to Idle -> send C(DW), W = 0.
Rule 4: Receiving C(DW) by controlling agent ->
W = W + DW,
311
If W == 1, computation has terminated.

Huangs Algorithm
Suppose :P1(1)->P2 and then P3 also P3->P4
and then P5
1/4 P1
1/2 P1

P2

1/2

P3

P2

P3

1/16

1
P1

P2

P3 0

1/2
P4

P5

P4
1/8

P5

P4

P5

1/16

312

Unit II : Chapter 3
Distributed Mutual Exclusion
Introduction : Mutual Exclusion
Non-Token based Algorithms Lamports Algorithm

Token based Algorithms


-Suzuki-Kasamis Broadcast Algorithm

Consensus and related problems

313

Distributed Mutual Exclusion:


Introduction
In the problem of mutual exclusion, concurrent
access to a shared resource by several
uncoordinated user-requests is serialized to secure
the integrity of the shared resource.
It requires that the actions performed by a user on
a shared resource must be atomic (one at a time).
For correctness, it is necessary that the shared
resource be accessed by a single site (or process)
at a time.
Mutual exclusion is a fundamental issue in the
design of distributed systems.
An efficient and robust technique for mutual
exclusion is essential to the viable design of
distributed systems.
314

Single-Computer vs. Distributed


System

In single-computer systems, the status of


a shared resource and the status of users
is readily available in the shared memory,
and solutions to the mutual exclusion
problem can be easily implemented using
shared variables (e.g., semaphores)
However, in distributed systems, both the
shared resources and the users may be
distributed and shared memory does not
exist
Consequently, approaches based on
shared variables are not applicable to
distributed systems and approaches based
315
on message passing must be used.

DME Algorithms:
Classification
Mutual exclusion algorithms can be grouped into 2 classes.
1.The algorithms in the first class are nontoken-based
2. The algorithms in the second class are token-based
The execution of DME Algorithms are mainly focused on
the existence of critical section (CS).
A critical section is the code segment in a process in which
a shared resource is accessed.
316

1.The algorithms in the first class are nontokenbased :


These algorithms require two (e.g. REQUEST and
REPLY) or more (e.g. RELEASE) successive rounds
of message exchanges among the sites.
These algorithms are assertion (criteria) based
because a site can enter its critical section (CS)
when an assertion defined on its local variables
becomes true.

Mutual exclusion is enforced because the


assertion becomes true only at one site at any
given time.
317

2.The algorithms in the second class are


token-based:
In these algorithms, a unique token (also
known as the PRIVILEGE message) is
shared among the sites.
A site is allowed to enter its CS if it
possesses the token and it continues to
hold the token until the execution of the
CS is over.
These algorithms essentially differ in the
way a site carries out the search for the
token.
318

System Model

At any instant, a site may have several requests for


CS
A site queues up these requests and serves them
one at a time
A site can be in one of the following three states:
requesting CS
In this state the site is blocked and cannot
make further requests for CS
executing CS : operates the define task
or neither requesting nor executing CS (i.e.,
idle)
In the idle state, the site is executing outside
its CS
In the token-based algorithms, a site can also
be in a state where a site holding the token is
executing outside the CS
319

DME: 5 Requirements
1.Maintain mutual exclusion: To
guarantee that only one request
accesses the CS at a time.
2.Freedom from Deadlocks. Two or
more sites should not endlessly wait for
messages that will never arrive.
3.Freedom from starvation. A site
should not be forced to wait indefinitely
to execute CS while other sites are
320
repeatedly executing CS. That is, every

Requirements conti.
4.Fairness. Fairness dictates that requests must
be executed in the order they are made (or the
order in which they arrive in the system). Since a
physical global clock does not exist, time is
determined by logical clocks. Note that fairness
implies freedom from starvation, but not viceversa.
5.Fault Tolerance. A mutual exclusion algorithm
is fault-tolerant if in the wake of a failure, it can
reorganize itself so that it continues to function
without any (prolonged) disruptions.
321

DME: 4 Performance
The
performance
of
mutual
exclusion
algorithms is generally measured by the
following four metrics:
1. The number of messages necessary
per CS invocation
2. The synchronization delay, which is the
time required after a site leaves the CS and
Last site exist CS
before
the next site enters the CS
Next Site Enter CS
Synchronizati
on
delay

time
322

3.The response time, which is the time


interval a request waits for its CS execution
to be over after its request messages have
CS Request
Its Request
The site enters
The site exits
been sent
.
Arrives

message sent

the CS

the CS

CS execution
time
Response
Time

4.The system throughput, which is the


rate at which the system executes
requests for the CS.
system throughput = 1/ (sd + E)
where sd is the synchronization delay and E is the
323
average critical section execution time

Performance Measuring Parameters


1. Low and High LOAD performance
Low load condition : avoids simultaneous request
in the system
High load condition : identifies the pending
request at a site

2. Best and Worst CASE performance


Best case : reflects the best possible value of the
response time.
Worst case : normally coincides with the best case
value in case DME algorithms.

Note : In case of fluctuating values of


performance we consider the average case.
324

DME: A Simple Solution


In a simple solution to distributed mutual
exclusion, a site, called the control site,
is assigned the task of granting permission
for the CS execution.
To request the CS, a site sends a REQUEST
message to the control site.
The control site queues up the requests
for the CS and grants them permission,
one by one
This method to achieve mutual exclusion
in distributed systems requires only three
messages per CS execution.
325

Non token-based algorithms


A site communicates with the set of other sites to arbitrate who
should execute the CS next.
For a site Si , the request set Ri contains ids of all those site
from where Si must acquire permission to enter the CS.
Uses timestamps to order request for the CS, which also helps in
resolving the conflict.
Generally a smaller timestamp request have priority over the
larger timestamp requests.
Maintain the logical clock and update them by the Lamports
scheme.
Depending upon the way a site carries out its ascertains, there
are numerous non token-based algorithms.

Lamports algorithm
Ricart-Agrawala algorithm
Maekawas algorithm

326

Lamports Algorithm
Lamport proposed DME algorithm
which was based on his clock
synchronization scheme
In Lamports algorithm
Every

site

Si

keeps

queue,

request_queue,
which
contains
mutual
exclusion
requests
ordered
by
their
timestamps.
Algorithm requires messages to be delivered in
the FIFO order between every pair of sites
327

DME: The Lamports


Algorithm
Requesting the CS :

1. When a site Si wants to enter the CS, it sends


a REQUEST(tsi,i) message to all the sites in its
request set Ri and places the request on
request_queuei,
Note :(tsi, i) is the timestamp of the request.
(2,1)

S1

s2

(1,2)

S3
Sites S1 and S2 are making the REQUEST for the
CS

328


2.

DME: The Lamports


Algorithm
Requesting the CS
:

When a site Sj receives the REQUEST (tsi,i) message


from site Si, it returns a timestamped REPLY message
to Si and places site Sis request on request_queuej
Note : (i) low timestamp have the priority for CS in the
queue.
(ii) wrt time, queue direction is
(2,1) (2,1),(1,2)

S1

s2
S3

(1,2)
(2,1),(1,2)
(1,2)

(2,1),(1,2)

Sites S1 and S2 are making the REPLY message

329

DME: The Lamports


Algorithm
Executing the CS. :

Site Si enters the CS when the two following


conditions hold:
L1: Si has received a message with timestamp
larger than (tsi, i) from all other sites.
L2: Sis request is at the top of request_queue
(2,1) (2,1),(1,2)

S1

s2
S3

(1,2)
(2,1),(1,2)
(1,2)

(2,1),(1,2)
S2 enters the CS
330

DME: The Lamports


Algorithm
Releasing the CS.
:

Site Si, upon exiting the CS, removes its request


from the top of its request queue and sends a
timestamped RELEASE message to all the sites in
its request set
When a site Sj receives a RELEASE message from
site Si, it removes Sis request from its request
queue
(2,1) (2,1),(1,2)

S1

s2
S3

(2,1)

S1 enters the CS

(1,2)
(2,1)

(2,1),(1,2)
(1,2)

(2,1),(1,2)

(2,1)

S2 exits the CS
331

Another Example

Note : Queue direction can be


executed first.

, , so (1,B) is

332

DME: Lamports Algorithm


Correctness :
Lamports algorithm achieves mutual exclusion.

Performance
Requires 3(N-1) messages per CS invocation:
(N-1) REQUEST, (N-1) REPLY, and (N-1) RELEASE
messages, Synchronization delay is T

Optimization
Can be optimized to require between 3(N-1) and 2(N-1)
messages per CS execution by suppressing REPLY
messages in certain cases
E.g. suppose site Sj receives a REQUEST message from site
Si after it has sent its own REQUEST messages with
timestamp higher than the timestamp of site Sis request
In this case, site Sj need not send a REPLY message to site
Si.
333

DME: Token based


Algorithms

In token-based algorithms, a unique token


is shared among all sites.
A site is allowed to enter its CS if it
possesses the token.
Token-based algorithms use sequence
numbers instead of timestamps.
Every request for the token contains a
sequence number and the sequence
numbers of sites advance independently.
A site increments its sequence number
counter every time it makes a request for
the token.
A primary function of the sequence
numbers is to distinguish between old and
334
current requests.

DME: Token based


Algorithms
Depending upon the way a site carries out
its search for the token, there are
numerous token-based algorithms.
Suzuki-Kasamis broadcast algorithm
Singhals heuristic algorithm
Raymonds tree-based algorithm

335

Suzuki-Kasamis broadcast algorithm


The Main idea :
1. Completely connected network
of processes.
2. There is one token in the
network. The holder of the token
has the permission to enter CS.
3. Any other process trying to enter
CS must acquire that token.

Request to enter CS

Request to enter CS

4. The token will move from one


process to another based on
demand.
336

SK Algorithms Requirements :
Process j broadcasts REQUEST (j, num),
where num is the sequence number of the req
request.

req

Q
queue

Each process maintains


-an array req: RN [i] denotes the
sequence no num of the latest request
from process i
req
Additionally, the holder of the token
maintains
-an array last : LN[i] denotes the
sequence number of the latest visit to CS
for process i.
- a queue Q of waiting processes

last

req
req

req: array[0..n-1] of integer


last: array [0..n-1] of integer

337

Algorithm :
If a node wants TOKEN, it broadcasts a REQUEST message to
all other nodes :
Requesting the CS :
At node:
REQUEST(j, n)
1. node j requesting n-th CS invocation n = 1, 2, 3, ... , seq #
(number)
2. node i receives REQUEST from j
update RNi[j]=max(RNi[j], n )
where RNi[j] = largest seq # received so far from node j

Activated channel

338

Executing the CS
3. The node i executes the CS when it has received the TOKEN.
where TOKEN(Q, LN ) ( suppose at node i )
Q -- queue of requesting nodes
LN -- array of size N such that

LN[j] = the sequence of the request of node j


granted most recently
Releasing the CS
When node i finished executing CS, it does the following
4.Set LN[i] = RNi[i] to indicate that current request of node i has
been granted ( executed )
5.For all node k such that RNi[k] > LN[i]
(i.e. node k requesting ) is appended to Q if its not there

6. When these updates are complete, if Q is not empty,


the front node is deleted and TOKEN is sent there
FirstComeFirstServe
339

Example
There are three processes, p1, p2, and p3.
p1 and p3 seek mutually exclusive access
to a shared resource.
Initially: the token is at p2 and the token's
state is LN = [0, 0, 0] and Q empty;
p1's state is: n1 ( seq # ) = 0, RN1 = [0, 0,
0];
p2's state is: n2 = 0, RN2 = [0, 0, 0];
p3's state is: n3 = 0, RN3 = [0, 0, 0];

p1 sends REQUEST(1, 1) to p2 and


p3;
p1: n1 = 1, RN1 = [ 1, 0, 0 ]

340

Meaning while p3 sends REQUEST(3, 1) to


p1 and p2;
p3: n3 = 1, RN3 = [ 0, 0, 1 ]
But p2 receives REQUEST(1, 1) from p1;
p2: n2 = 1, RN2 = [ 1, 0, 0 ], holding
token
p2 sends the token to p1
p1 receives REQUEST(3, 1) from p3: n1 =
1, RN1 = [ 1, 0, 1 ]
p2 receives REQUEST(3, 1) from p3:
RN2 = [ 1, 0, 1 ]

p3 receives REQUEST(1, 1) from p1;


p3: n3 = 1, RN3 = [ 1, 0, 1 ]

p1 receives the token from p2


p1 enters the critical section
p1 exits the critical section and
sets the token's state to LN = [ 1,
0, 0 ] and Q = ( 3 )
341

p1 sends the token to p3;


p1: n1 = 2, RN1 = [ 1, 0, 1 ], holding token;
token's state is LN = [ 1, 0, 0 ] and Q
empty

p3 receives the token from p1;


p3: n3 = 1, RN3 = [ 1, 0, 1 ],
holding token
p3 enters the critical section
p3 exits the critical section
sets the token's state to LN = [ 1,
0, 1 ] and Q empty
Note : Algorithm can terminate if
there is no more request
342

Correctness:
Mutual exclusion is guaranteed because there is only
one token in the system and a site holds the token
during the CS execution.
Theorem: A requesting site enters the CS in finite time.
Proof:
Token request messages of a site Si reach other sites
in finite time.
Since one of these sites will have token in finite time,
site Si s request will be placed in the token queue in
finite time.
Since there can be at most N 1 requests in front of
this request in the token queue, site Si will get the
token and execute the CS in finite time.
343

Performance:
No message is needed and the synchronization
delay is zero if a site holds the idle token at the
time of its request.
It requires at most N message exchange per CS
execution ( (N-1) REQUEST messages + TOKEN
message )
Synchronization delay in this algorithm is 0 or T
Deadlock free ( because of TOKEN requirement )
No starvation ( i.e. a requesting site enters CS in finite
time )
344

Consequences & Related Problems


Comparison of Lamport and Suzuki-Kazami Algorithms

Lamports Algorithm
Algorithm

Suzuki Kasami Broadcast

The essential difference is in who keeps the


queue.
In one case every site keeps its own local copy of
the queue.
In the other case, the queue is passed around
within the token.
345

Chap. 4:Distributed Deadlock Detection


Introduction
Issues
Centralized Deadlock-Detection Algorithms
1. The Completely Centralized Algorithm
2. The Ho-Ramamoorthy Algorithms
Distributed Deadlock-Detection Algorithms
1. A Path-Pushing Algorithm
2. An Edge-Chasing Algorithm
3. A Diffusion Computation Based
Algorithm
4. Global State Detection Based Algorithm
346

Deadlocks An
Introduction

What Are DEADLOCKS ?


A Blocked Process which can never be
resolved unless there is some outside
Intervention.
For Example: Resource R1 is requested by Process P1 but
is held by Process P2.

347

Condition for deadlock


Mutual exclusion :The resource can be used
by only one process at a time
No preemption: Resources are released
voluntarily; neither another process nor the
OS can force a process to release a resource.
Hold and wait: A process holds a resource
while waiting for other resources
Circular wait: A closed cycle of processes is
formed, where each process holds one or
more resources needed by the next process
in the cycle
348

Illustrating A Deadlock

Wait-For-Graph (WFG)

Nodes Processes in the


system
Directed Edges Wait-For blocking
relation
Held By Resource 1
Waits For

Process 1
Waits For

Process 2
Resource 2

Held By

Wait-for Graphs (WFG): P1 -> P2 implies P1 is waiting for a


resource from P2.
In fig . P1is blocked and waiting forP2to release Resource 2.

A Cycle represents a Deadlock

There are
models
AND model

two

basic

deadlock
349

AND Model
Presence of a cycle.
P1

P2

P4

P3

P5

350

OR Models
OR Model
Presence of a knot.
Knot: Subset of a graph such that starting from
any node in the subset, it is impossible to leave
the knot by following the edges of the graph.

P1

P2

P5

P4

P3

P6
351

Cycle vs Knot
P3

P1

Deadlock in AND Model


But no Deadlock in OR
Model
P4

P2

Cycle but no
Knot

P5
Deadlock in both AND & OR
Model

P1

P3

P4

P2

P5

Cycle & Knot

352

Distributed Deadlock
Detection
Assumptions:
1.

System has only reusable resources


(CPU, Main-memory, I/O Devices etc)

2.

Only exclusive access to resources


(where Only one copy of each resource is present)

3. States of a process: running or blocked


Running state: process has all the resources
Blocked state: waiting on one or more resource
. Types of Distributed Deadlocks:
1. Resources Deadlocks
2. Communication Deadlocks
353

Resource vs Communication
Deadlocks

Resource deadlocks:
Set of deadlocked processes,
where each process waits for a
resource held by another process
(e.g., data object in a database, I/O
resource on a server)
Communication deadlocks:
Set of deadlocked processes, where
each process waits to receive
messages (communication) from
other processes in the set.
354

Basic Issues
Deadlock detection and
addressing two basic issues:

resolution

entails

First, detection of existing deadlocks and


Second resolution of detected deadlocks.

The detection of deadlocks involves two Issues:


maintenance of the Wait For Graph(WFG) and
search of the WFG for the presence of cycles (or knots)

In distributed systems, a cycle may involve


several sites, so the search for cycles greatly
depends upon how the WFG of the system is
represented across the system.
Depending upon the manner in which WFG
information is maintained and the search for
cycles is carried out, there are centralized,
distributed, and hierarchical algorithms for
deadlock detection in distributed systems.
355

Basic Issue
A correct deadlock detection algorithm
must satisfy the following two conditions:
1. Progress : (no undetected deadlock)
Algorithm should be able to detect all
existing deadlocks in finite time.
Continuously able to detect the deadlock.
Avoid to support any new formation of
deadlock.

2. Safety: (No false deadlock)


Algorithm should not report non-existed deadlocks.
Should take care of the global state .
Eg. Segment exists at different instant of time but
complete cycle does not exists.

356

Basic Issue (contd.)


Deadlock resolution involves breaking
existing wait-for dependencies in the WFG
system so as to resolve the deadlock.
It involves rolling back one or more
processes that are deadlocked and assigning
their resources to blocked processes in the
deadlock so that they can resume execution.
It also involves the timely cleaning of the
deadlock information from the system as it
may lead to form false deadlocks.
357

Control Organisation
Centralized Control:
A control site constructs wait-for graphs (WFGs) and
checks for directed cycles.
WFG can be maintained continuously (or) built ondemand by requesting WFGs from individual sites.
Distributed Control:
WFG is spread over different sites. Any site can
initiate the deadlock detection process.
Hierarchical Control:
Sites are arranged in a hierarchy.
A site checks for cycles only in descendents.

358

Centralized DeadlockDetection
The Completely Centralized
Algorithm
The Ho-Ramamoorthy Algorithms
359

The Completely Centralized


Algorithm

A designated site called as the Control


Site
(co-ordinator) , maintains the
WFG of the entire system

It checks the WFG for the existence of


deadlock cycles whenever a request edge is
added to the WFG.

Sites request or release through


REQUEST and RELEASE message for all
resources, whether local or remote.
However, it is highly inefficient due to
concentration of all messages.

It
imposes
larger
delays,
large
communication
overhead,
and
the
congestion of communication links.
Moreover, the reliability is poor due to single 360
point of failure.

Example of Completely Centralized


Algorithm

361

Ho-Ramamoorthy 1-phase Algorithm


Each site maintains 2 status tables: resource
status table and process status table.
Resource table: keeps track of transactions.
transactions that have locked or are waiting for
resources.
Process table: keeps track of resources locked
by or waited on by transactions.
One of the Sites Becomes the Central Control
site.
The Central Control site periodically collects
these 2 tables from each site.
Constructs a WFG from transactions common to
both the tables.
362
No cycle, no deadlocks.

Shortcoming
s Occurance of Phantom
Deadlocks.
High Storage & Communication

Costs.
Example
of Phantom
Deadlocks
P0

P2

System A

System B
S

P1

P1 releases resource S and asks-for


resource
T. sent to Control Site:
2 Messages
1. Releasing
2. S.
Waiting-for
T.
Message
2 arrives at Control Site first.
Control Site makes a WFG with cycle,
detecting a phantom deadlock.
363

Ho-Ramamoorthy 2-phase Algorithm


Each site maintains a status table of all processes
initiated at that site: includes all resources locked & all
resources being waited on.
Controller requests (periodically) the status table from
each site.
Controller then constructs WFG from these tables,
searches for cycle(s).
If no cycles, no deadlocks.
Otherwise, (cycle exists): Request for state tables
again.
Construct WFG based only on common transactions in
the 2 tables.
If the same cycle is detected again, system is in
deadlock.
Note :Later proved-> cycles in 2 consecutive reports
need not result in a deadlock. Hence, this algorithm
364
detects false deadlocks.

Distributed Deadlock
Detection Algorithms
A Path-Pushing Algorithm
An Edge-Chasing Algorithm
A Diffusion Computation Based
Algorithm
Global State Detection Based
Algorithm
365

An Overview
All sites collectively cooperate to detect a
cycle in the state graph that is likely to be
distributed over several sites of the system.
The algorithm can be initiated whenever a
process is forced to wait.
The algorithm can be initiated either by the
local site of the process or by the site where
the process waits.

366

These algorithm can be divided into four classes,


1.Path-pushing
Path information is transmitted and accumulated
Distributed deadlocks are detected by maintaining
an explicit global WFG (constructed locally and
pushed to neighbors)
2.Edge-chasing (single resource model, AND model)
The presence of a cycle in a distributed graph
structure is be verified by propagating special
messages called probes, along the edges of the
graph.
The formation of cycle can be detected by a site if
it receives the matching probe sent by it previously.
3. Diffusion computation (OR model, AND-OR model)
deadlock detection computation is diffused through
the WFG of the system.
4. Global state detection (Unrestricted, P-out-of-Q model)
Take a snapshot of the system and examining it for
367
the condition of a deadlock.

Obermarcks Path-Pushing
Algorithm
Individual Sites maintain local WFG
A virtual node x exists at each site.
Node
x
represents
external
processes.
Detection Process
Case 1: If Site Sn finds a cycle not
involving x -> Deadlock exists.
Case 2: If Site Sn finds a cycle involving
x -> Deadlock possible.
Contd
369

If Case 2 ->
Site Sn sends a message containing its detected cycles to
other sites. All sites receive the message, update their
WFG and re-evaluate the graph.
Consider Site Sj receives the message:

Site Sj checks for local cycles. If cycle found not involving x (of
Sj) -> Deadlock exists.
If site Sj finds cycle involving x it forwards the updated
message to other sites.

Process continues till possible deadlock found.

If a process sees its own label come back then it is part of a cycle
, deadlock is finally detected.
Note :Algorithm detects false deadlocks, due to
asynchronous snapshots at different sites
370

Path pushing algorithm


Performance :
O(n(n-1)/2) messages complexity to
detect deadlock where n is no. of
sites.
O(n) message size
O(n) delay to detect deadlock

371

Obermarks Algorithm Example


Intial State

S1

S4

S2

S3

372

Obermarks Algorithm Example


Iteration 1

Iteration 2

373

Iteration 3

Iteration 4

Note : If a process
sees its own label
come back then it is
part of a cycle,
deadlock is finally
detected.
374

2. Edge Chasing algorithm


Desigend by Chandy-Misra-Haas for AND request
model.
The block process sends a probe massage to the
resource holding process.
The probe message is a triplet (i,j,k) where
i : detection initiated by Pi,
j : message sent by site ofPjand
k : message sent to site ofPk

When probe is received by blocked process it


forwards it to processes holding the requested
resources.
Deadlock is finally detected when a probe returns to
its initiator.

375

ALGORITHM:
Let Pi be the initiator
ifPiis locally dependent on
itself

then
declare a deadlock
else
send probe (i,j,k) to home
site ofPkfor eachj,ksuch
that all of the following
holds:
Piis locally dependent
onPj
Pjis waiting onPk
PjandPkare on different
sites

376

On receipt of probe (i,j,k)


check the following conditions
Pkis blocked
dependentk(i) =false
Pkhas not replied to all requests
ofPj
if these are all true, do the following
setdependentk(i) =true
ifk=i
declare thatPiis deadlocked
else
send probe (i,m,n) to the home
site ofPnfor everymandnsuch
that the following all hold
Pkis locally dependent onPm
Pmis waiting onPn
PmandPnare on different

377

Note : k=i, declare thatPiis

deadlocked.
Otherway :Deadlock is

finally detected when a


probe returns to its
initiator.

Analysis :

Message complexity : m(n-1)/2 messages


formprocesses atnsites.
message length :fixed :3-integer words ( too
small)
message delay :O(n) delay to detect deadlock
378

A book Example
P2
P1

Probe (1,9,1)

P3

Site S1

Probe (1,6,8)

P9

Probe (1,3,4)

P4

P6

P8

P5

P10

Site S3

P7

Probe (1,7,10)

Site S2

379

3. Diffusion Computation Based


Algorithm

Designed by Chandy for OR request model


Processes are active or blocked
A blocked process may start a diffusion.
If deadlock is not detected, the process will
eventually unblock and terminate the algorithm
message =query(i,j,k)
i= initiator of check
j= immediate sender
k= immediate recipient

reply =reply(i,k,j)
Numi(K) = number of message query sent by i to k
380

ALGORITHM
Initiate the process Pi by sending
query (i,i,j) to all Pj on which Pi
depends.
receipt ofquery(i,j,k) byk (some
blocked process)
if not blocked then

discard the query

if blocked

if
this
is
anengagingquery
propagate query(i,k,m) to
dependent set ofm

else
if not continously blocked
since engagement
discard the query
else
sendreply(i,k,j) toj
381

On
receipt
of
reply(i,j,k) byk
if this is not the last
reply
then just decrement the
awaited reply count
numk(i)= numk(i)-1

if this is the last reply


then
ifi=k
report a deadlock

else
send reply(i,k,m) to
the
engaging processm
At this point, a knot has been

382

4. Global State Detection Based Algorithm


Take snapshot of distributed WFG.
Global state detection based deadlock
detection algorithms exploit the following facts:
A consistent snapshot of a distributed system can
be obtained without freezing the underlying
computation and
If a stable property holds in the system before the
snapshot collection is initiated, this property will still
hold in the snapshot.
2

Use graph reduction to check for deadlock


while there is an unblocked process, remove the
process and all (resource-holding) edges to it
there is deadlock if the remaining graph is non-null

A.

383

384

385

386

Unit III : Distributed Resource


Management
Chapter 1.Distributed File Systems
-Architecture
-Mechanisms
-Design Issues
-Case Study: Sun Network File System
Chapter 2:Distributed Shared Memory
-Architecture
-DSM Algorithms (4)
-Memory Coherence : Protocols
-Design Issues.

387

Chapter 3 : Distributed Scheduling


- Issues in Load Distributing
- Components of a Load Distributing
Algorithm
- Load Distributing Algorithms(4)
- Load Sharing Algorithms.

388

Chapter 1: Distributed File


System

A DFS is a resource management component of


a distributed operating system.
It implements a common file system that can be
shared by all the autonomous computer in the
system.
Two important goals of distributed file systems
Network Transparency

To provide the same functional capabilities to access files


distributed over a network.
Users do not have to be aware of the location of files to
access them.

High Availability

Users should have the same easy access to files,


irrespective of their physical location.
System failures or regularly scheduled activities such as
backups or maintenance should not result in the
389
unavailability of files.

Architecture
Files can be stored at any machine and
computation can be performed at any machine.
A machine can access a file stored on a remote
machine where the file access operations (like
read operation) are performed and the data is
returned.
Alternatively, File Servers are provided as
dedicated to storing files and performing
storage and retrieval operations.
Two most important services in a DFS are

Name Server: a process that maps names specified


by clients to stored objects, e.g. files and directories
Cache Manager: a process that implements file
caching, i.e. copying a remote file to the clients
machine when referred by the client.
390

Architecture of DFS
Cache manager can
be present at both
client and file server.
Cache manager at
client subsequently
reduces the access
delay due to network
latency.
Cache manager at
server cache file s in
the main memory to
reduce the delay due
to disk latency.
391

Data Access Actions in DFS

392

Mechanisms for Building


DFS

1.Mounting

binding together of different filename spaces

2. Caching

reduce delays in the accessing of data

3. Hints

alternative of caching which helps in recovery

4.Bulk data transfer

reduces the high cost of communication


protocols

5. Encryption

provides security aspect to the DFS

393

1.Mounting
Allows the binding together of different
filename spaces to form a single
hierarchically structured name space.
A name space (or the collection of files) can
be bounded to or mounted at a internal
node or a leaf node of the name space tree.
A node onto which a namespace is
mounted is know as a mount point.
Kernel maintains a structure called the
mount table which maps mount points to
appropriate storage. devices.

394

In case of DFS, file system maintained by remote


server are mounted by the clients.
Two approach to maintain the mount information
are :
1. At client where each client has to individually
mount every required file system (eg. Sun
network file system).

Here every client need not see the identical


filename space.
Clinet needs to update the mount table.

2. At server where each client is able to see an


identical filename space.(eg. Sprite file
system)

Mount information are updated at servers.


395

2. Caching:
To reduce delays in the accessing of data
by exploiting the temporal locality of
reference exhibited by program.
Data can be either cached in the main
memory or on the local disk of the client.
Also data is cached in the main memory (
server cache) at the server to reduce the
disk access latency.
Caching
increase
the
system
performance , reduces the frequency of
access
to
the
file
server
and
communication network.
Improves the scalability of the file
system.
396

3. Hints:
An alternative to cached data as to
overcome inconsistency problem when
multiple clients access shared data.
Hints helps in recoveries when invalid
cache data are discovered.
For example , after the name of file or
directory is mapped to the physical data ,
the address of the object can be stored
as the hint in the cache. If the local
address fails to map to the object the
cached address can be used from cache
397
memory.

4.Bulk Data Transfer:


Used to overcome the high cost of executing
communication protocols, i.e. assembly/disassembly of
packets, copying of buffers between layers etc.
Transferring data in bulk reduces the protocol processing
overhead at both server as well as client.
Multiple consecutive data blocks are transferred from
server to client instead of just block referenced by clients.

5.Encryption:
To enforce security in distributed systems with a
scenario that two entities wishing to communicate
establish a key for conversation.
It is important to note that the conversation key is
determined by the authentication server , which never
398
sent plain text to either of the entities.

Design Issues
1.
2.
3.
4.
5.
6.
7.

Naming and Name Resolution


Caches on Disk or Main Memory
Writing Policy
Cache Consistency
Availability
Scalability
Semantics
399

Naming and Name


Resolution
in file systems is associated with an

Name
object
(e.g. a file or a directory)
Name resolution refers to the process of mapping a
name
to an object,
in case of replication, to multiple objects.
Name space is a collection of names which may or
may not share an identical resolution mechanism
Three approaches to name the files in Distributed
Environment:
Concatenation of host name to the names(unique)
of the file stored on that server
Mounting of remote directories onto local
directories (Sun NFS)
Maintaining
of a single global directory
structure(Sprite and Apollo)
400

The Concepts of Contexts:


The notion of context is used to partition a name
space based on:
Geographical boundaries
Organization boundaries
Specific host
File system type etc
A context identifies the name space in which to
resolve a given name:
In x-Kernel Logical File System : a user defines his
own file space hierarchy where an internal node
correspond to the context.
Tilde Naming Scheme: the name space is
partitioned into sets of logically independent
directory trees called as tilde tree.

401

Name Server:
Resolves the names in distributed systems.
A name server is the process that maps name
specified by the client to store the objects such
as file or directories.
The client can send their query to the single
name server which map the name to the
object.
Drawbacks involved such as single point of
failure, performance bottleneck.
Alternate is to have several name servers, e.g.
Domain Name Servers , where replication of
tables can achieve fault tolarance and high
performance.

402

2. Caches on Disk or Main


Memory

Cache in Main Memory

Diskless workstations can also take advantage of


caching.
Accessing a cache is much faster than access a
cache on local disk
The server-cache is in the main memory, and
hence a single cache design for both
Disadvantages

It competes with the virtual memory system for physical


memory space
A more complex cache manager and memory
management system
Large files cannot be cached completely in memory

Cache in Local Disk

Large files can be cached without affecting


performance
Virtual memory management is simple

403

3. Writing Policy
Writing policy decides when the modified
cache block at a client should be transferred
to the server
Write-through policy
All writes requested by the applications at clients
are also carried out at the server immediately.

Delayed writing policy


Modifications due to a write are reflected at the
server after some delay.

Write on close policy


The updating of the files at the server is not done
until the file is closed.
404

4. Cache Consistency
Two approaches to guarantee that the data
returned to the client is valid.
Server-initiated approach
Server inform cache managers whenever
the data in the client caches become stale.
Cache managers at clients can then
retrieve the new data or invalidate the
blocks containing the old data.
Client-initiated approach
The responsibility of the cache managers
at the clients to validate data with the
server before returning it to the client.
Both are expensive since communication cost is
high.
405

Alternative approach:
Concurrent-write sharing approach
A file is open at multiple clients and at least
one has it open for writing.
When this occurs for a file, the file server
informs all the clients to flushed their cached
data items belonging to that file.
Major issue:
Sequential-write sharing issues causes cache
inconsistency when
Client opens a file, it may have outdated
blocks in its cache
Client opens a file, the current data block
may still be in another clients cache waiting
to be flushed. (e.g. happens in Delayed
writing policy)
406

5. Availability
Immunity to the failure of server or the
communication network
Issue: what is the level of availability of files
in a distributed file system?
Resolution: use replication to increase
availability, i.e. many copies (replicas) of files
are maintained at different sites/servers.
It is expensive because
Extra storage space required
The overhead incurred in maintaining all the
replicas up to date
Replication Issues involve
How to keep replicas consistent?
How to detect inconsistency among replicas?
407

Causes of Inconsistency :

A replica is not updated due to failure of server


All the file servers are not reachable from all the clients
due to network partition
The replicas of a file in different partitions are updated
differently

Unit of Replication:
File
Group of files
a) Volume: group of all files of a user or group or all
files in a server
Advantage: ease of implementation
Disadvantage: wasteful, user may need only
a subset replicated
b) Primary pack vs. pack
Primary pack: all files of a user
Pack: subset of primary pack. Can receive a
different degree of replication for each pack
408

Replica Management:
Deals with the maintenance of replicas and in
making use of them to provide increased
availability
Concerns with the consistency among replicas
A weighted voting scheme (e.g. Roe File
System)
Latest updates of read/write
based
timestamp are maintain.
Designated agents scheme (e.g. Locus)
Designate one or more process/ site
( also called as current synchronization
site ) as agent for controlling the access
to the replicas of files.
Backups servers scheme (e.g. Harp File
System)
409
Designated site -> primary & Other

6. Scalability
The suitability of the design of a system to provide to
the demands of a growing system.
As the system grow larger, both the size of the server
state and the load due to invalidations increase.
The structure of the server process also plays a major
role in deciding how many clients a server can support.
If the server is designed with a single process, then
many clients have to wait for a long time whenever
a disk I/O is initiated.
These waits can be avoided if a separate process is
assigned to each client.
An alternate is to use Lightweight processes
(threads).
410

7. Semantics
The
semantics
of
a
file
system
characterizes the effects of accesses on
files.
Expected semantics: A read will return
data stored by the latest write.
To guarantee the above semantics
possible options are
All the reads and writes from various clients
will have to go through the server.
Disadvantage: communication overhead

Use of lock mechanism : sharing will have to be


disallowed either by the server, or by the use
of locks by client applications.
Disadvantage: file not always available

411

Case Study:

The Sun Network File System (NSF)

Developed by Sun Microsystems to provide a


distributed file system independent of the
hardware and operating system
The goal is to share a file system in a transparent
way.
Uses client-server model
NFS is stateless
The server do not maintain any record of past
request.
All client requests must be self-contained with
their information.
Fast Crash Recovery
Major reason behind stateless design
412

Basic Design
Three important parts
The protocol
The client side
The server side

413

1. Protocol
Uses the Sun RPC mechanism and
Sun eXternal Data Representation
(XDR) standard
Defined as a set of remote
procedures.
Protocol is stateless
Each procedure call contains all the
information necessary to complete
the call
Server maintains no between call
414

2. Client side
Provides transparent interface to NFS
Mapping between remote file names and remote file
addresses is done through remote mount
Extension of UNIX mounts
Specified in a mount table

New virtual file system(VFS) interface supports


VFS calls, which operate on whole file system
VNODE calls, which operate on individual files

Treats all files in the same fashion.


Note : Vnode (Virtual Node):
There is a network-wide vnode for every object in the
file system (file or directory)- equivalent of UNIX inode
vnode has a mount table, allowing any node to be a
mount node
415

3. Server side
Server implements a write-through
policy
Required by statelessness
Any blocks modified by a write request
must be written back to disk before the
call completes.

416

NFS Architecture
1. System call interface layer
a) Presents sanitized validated
requests in a uniform way to
the VFS.
2.

Virtual file system ( VFS )


layer -

b) Gives clean layer between


user and file system.
c) Acts as deflection point by
using global vnodes.
d) Understands the difference
between local and remote
names.
e) Keeps
in
memory
information
about
what
should
be
deflected
(mounted directories) and
how to get to these remote
417
directories.

4. NFS client code:


To create an r-node (remote i-node) in its internal
tables as to hold the file handles.The v-node points to
the r-node. Each v-node in the VFS layer will ultimately
contain either a pointer to an r-node in the NFS client
code, or a pointer to an i-node in the local operating
system. Thus from the v-node it is possible to see if a
file or directory is local or remote, and if it is remote,
to find its file handle.

5.Caching to improve the performance:


Transfer between client and server are done in large
chunks, normally 8 Kbytes, even if fewer bytes are
requested. This is known as read ahead.
The same for writes, If a write system call writes
fewer than 8 Kbytes, the data are just accumulated
locally. Only when the entire 8K chunk is full is it sent
to the server. However, when a file is closed, all of its
data are sent to the server immediately.
418

NFS (Cont.)

Naming and location:


Workstations are designated as clients or file servers
A client defines its own private file system by mounting a
subdirectory of a remote file system on its local file system
Each client maintains a table which maps the remote file
directories to servers
Mapping a filename to an object is done the first time a
client references the field. Example:
Filename: /A/B/C
Assume A corresponds to vnode1
Look up on vnode1/B returns vnode2 for B
wherevnode2 indicates that object is on server X
Client asks server X to lookup vnode2/C
file handle returned to client by server storing that file
Client uses file handle for all subsequent operation on
that file
419

NFS (Cont.)

Caching:

Caching done in main memory of clients


Caching done for: file blocks, translation of filenames to vnodes, and
attributes of files and directories

(1) Caching of file blocks

Cached on demand with time stamp of the file (when last modified on the
server)
Entire file cached, if under certain size, with timestamp when last modified
After certain age, blocks have to be validated with server
Delayed writing policy: Modified blocks flushed to the server after certain delay

(2) Caching of filenames to vnodes for remote directory names

Speeds up the lookup procedure

(3) Caching of file and directory attributes

Updated when new attributes received from the server, discarded after certain
time

Stateless Server :

Servers are stateless

File access requests from clients contain all needed information (pointer position,
etc)
Servers have no record of past requests

Simple recovery from crashes.


420

Chapter 2: Distributed Share


Memory
- The distributed shared memory
(DSM)
implements
the
shared
memory
model
in
distributed
systems, which have no physical
shared memory.
- The shared memory model provides
a virtual address space shared
between all nodes.
- To overcome the high cost of
communication
in
distributed
systems, DSM systems move data to

421

D S M Architecture
Communication Network

Node 1

Node 2

Node n

Memory

Memory

Memory

Mapping Manager Mapping Manager

Shared Memory
(virtual address space)

Mapping Manager

422

Architecture of DSM
- Programs access data in a shared address space
just they access data as if in traditional virtual
memory.
- Data moves between main memory and secondary
memory (within a node) and between main
memories of different nodes
- Each data object is owned by a node
- Initial owner is the node that created the object
- Ownership can change as object moves from
node to node
- When a process accesses data in the shared
address space, the mapping manager maps shared
memory address to physical memory (local or
remote).
- Mapping manager: a layer of software, perhaps
bundled with the OS or as a runtime library routine.
423

Advantages of distributed shared memory


(DSM)
1. Data sharing is implicit, hiding the data
movement (as opposed to Send/Receive in
message passing model)
2. Passing data structures containing pointers is
easier (in message passing model data
moves between different address spaces)
3. Moving entire object to user takes advantage
of locality difference. Entire block/page of
memory along with the reference data
/object can be moved. This can help in easier
referencing of associated data.
424

Advantages of distributed shared memory


(DSM)
4. Less expensive to build than tightly coupled
multiprocessor system: off-the-shelf hardware, no
expensive interface to shared physical memory.
5. Very large total physical memory for all nodes:
Large programs can run more efficiently.
6.

Tightly coupled multiprocessor systems access


main memory via a common bus. No serial access
to common bus for shared physical memory like
in tightly coupled multiprocessor systems.

7. Programs
written
for
shared
memory
multiprocessors can be run on DSM systems with
minimum changes.
425

Algorithms for implementing DSM

Issues
- How to keep track of the location of remote data
- How to minimize communication overhead when
accessing remote data
- How to access concurrently remote data at several
nodes
Types of algorithms:
1. Central-server
2. Data migration
3. Read-replication
4.Full-replication
426

1. The Central Server


Algorithm
- Central server maintains all shared data
Read request: returns data item
Write request: updates data and returns
acknowledgement message
- Implementation
A timeout is used to resend a request if
acknowledgment fails
Associated sequence numbers can be
used to detect duplicate write requests
If an applications request to access
shared data fails repeatedly, a failure
condition is sent to the application
- Issues: performance and reliability
- central server can become a bottleneck.

- Possible solutions
Partition shared data between several
servers
Use a mapping function to
distribute/locate data

427

2. The Migration Algorithm


- Operation
Ship (migrate) entire data object (page,
block) to requesting location
Allow only one node to access a shared
data at a time
- Advantages
Takes advantage of the locality of reference
DSM can be integrated with VM at each
node
- To locate a remote data object:
Use a location server
Maintain hints at each node
Broadcast query
- Issues
Only one node can access a data object at
a time
Thrashing can occur: to minimize it, set
minimum time data object resides at a
node
Thrashing :If two nodes compete for write access to a single data item, it may
be transferred back and forth at such a high rate that no real work can get
428
done ( a Ping-Pong effect ).

3. Read-replication Algorithm:
Extend migration algorithm:
Replicate data at multiple nodes for read access.
Write operation:
One node write access (multiple readers-one writer protocol)
After a write, invalidate all copies of shared data at various
nodes (or) update with modified value
Data Access Request
Write Operation
in Read-replication
Algorithm :

Node i

Node j
Data Replication
Invalidate

DSM must keep track of the location of all the copies of shared data.
Read cost low, write cost higher.
429

4. The FullReplication Algorithm :

Extension
of
read-replication
algorithm:
multiple nodes can read and multiple nodes
can write (multiple-readers, multiple-writers
protocol)

Issue: consistency of data for multiple writers

Solution: use of gap-free sequencer

All writes sent to sequencer

Sequencer assigns sequence number and


sends write request to all sites that have
copies

Each node performs writes according to


sequence numbers

A gap in sequence numbers indicates a


missing write request: node asks for
retransmission of missing write requests
430

Memory Coherence
The memory is said to be coherent when
value returned by read operation is the
expected value by the programmer (e.g.,
value of most recent write)
In DSM memory coherence is maintained
when the coherence protocol is chosen in
accordance with a consistency model.
Mechanism
that
control/synchronizes
accesses is needed to maintain memory
coherence which is based on following
models:
1. Strict Consistency: Requires total ordering
431
of requests where a read returns the most

2.

Sequential
consistency:
A
system
is
sequentially consistent if the result of any
execution is the same as if the operations of all
processors were executed in some sequential
order, and the operations of each individual
processor appear in this sequence in the order
specified by its program.
3. General consistency : All copies of a memory
location (replicas) eventually contain same data
when all writes issued by every processor have
been completed.
4. Processor consistency: Operations issued by a
processor are performed in the same order they
are issued.
5. Weak consistency : Synchronization operations
are guaranteed to be sequentially consistent.
6. Release consistency: Provides acquire and
432

Coherence Protocols
Issues
- How do we ensure that all replicas have the same
information.
- How do we ensure that nodes do not access
stale(old) data.
1. Write-invalidate protocol
- Invalidate(nullify) all copies except the one being
modified before the write can proceed.
- Once invalidated, data copies cannot be used.
- Advantage: good performance for
Many updates between reads
Per node locality of reference
- Disadvantage
Invalidations sent to all nodes that have
copies.
Inefficient if many nodes access same object.433

Coherence Protocols
2. Write-update protocol
- Causes all copies of shared data to be updated.
- More difficult to implement,
- Guaranteeing consistency may be more difficult as
reads may happen in between write-updates.
Examples of Implementation of memory coherence
1. Cache coherence in PLUS system
2. Type specific memory coherence in the Munin
system
Based on Process synchronization
3. Unifying Synchronization and data transfer in
Clouds
434

Cache coherence in the PLUS System


Based on write-update protocol and supports general
consistency.
Memory Coherence Manager (MCM) running at each
node is responsible for maintaining consistency.
Unit of replication: a page (4 Kbytes)
Unit of memory access and coherence maintenance:
one 32-bit word.
A virtual page corresponds to a list of replicas of a
page.
One of the replica is designated as master copy.
Distributed link list (copy-list) identifies the replicas of
a page. Copy-list has 2 pointers
Master pointer
Next-copy pointer
435

PLUS:
RW
Operations
Read operation:

On a read fault , if address points to local memory,


read it. Otherwise, local MCM sends a read request to
its counterpart at the specified remote node.
Data returned by remote MCM passed back to the
requesting processor.
Write operation:
To maintain consistency write are always performed
first on master copy and then propagated to copies
linked by the copy-list.
On write fault: update request sent to the remote
node pointed to by MCM.
If the remote node does not have the master copy,
update request sent to the node with master copy
and for further propagation.
436

PLUS Write-update
Protocol Distributed copy list
X

Master Next-copy
=1
on 2

X
2

3
4

1. MCM sends write req


to node 2.
2. Update message to
master node
3. MCM updates X
4. Update message to next
copy.
5. MCM updates X
6. Update message to next
copy

2
5

Node 1

Master Next-copy
=1
on 3

Master Next-copy
=1
on Nil

6
Node 3

Node 2

page table

X Node 2 Page p
1

7. Update X
8. MCM sends ack:
Update complete.

Node 4
437

PLUS: Protocol
Node issuing write is not blocked on write
operation.
However, a read on that location (being written
into) gets blocked till the whole update is
completed. (i.e., remember pending writes).
Strong ordering within a single processor
independent of replication (in the absence of
concurrent writes by other processors), but not
with respect to another processor.
write-fence operation: strong ordering with
synchronization among processors. MCM waits
for previous writes to complete.
438

Type specific memory coherence


in Munin System

Use application-specific semantic information to classify shared objects.


Use class-specific handlers.

Shared object classes based on access pattern are :


1. Write-once objects: written at the start, read many times
after that. Replicated on-demand, accessed locally at each
site. For Large object: Portions can be replicated instead of
whole object.
2. Private objects: accessed by a single thread. Not managed
by coherence manager unless accessed by a remote thread.
3. Write-many objects: modified by multiple threads between
synchronization points. Munin employs delayed updates.
Updates are propagated only when thread synchronizes.
Weak consistency.
4. Result objects: Assumption is concurrent updates to
different parts of a result object will not conflict and object
is not read until all parts are updated -> delayed update can
be efficient.
439

5.Synchronization objects: (e.g.,) distributed locks for giving


exclusive access to data objects.
6. Migratory objects: accessed in phases where each phase is a
series of accesses by a single thread: lock + movement, i.,e.,
migrate to the node requesting lock.
7. Producer-consumer objects: written by 1 thread, read by
another. Strategy: move the object to the reading thread in
advance.
8. Read-mostly object: i.e., writes are infrequent. Use broadcasts
to update cached objects.
9. General read-write objects: does not fall into any of the above
categories: Use Berkeley ownership protocol supporting strict
consistency. Objects can be in states such as:
Invalid: no useful data.
Unowned: has valid data. Other nodes have copies of the
object and the object cannot be updated without first
acquiring ownership.
Owned exclusively: Can be updated locally. Can be
replicated on-demand.
Owned non-exclusively: Cannot be updated before
invalidating other copies.
440

Design Issues
1. Granularity: size of the shared memory unit.

For better integration of DSM and


local memory
management: DSM page size can be multiple of the local
page size.
Integration with local memory management provides
built-in protection mechanisms to detect faults, to
prevent it and recover from inappropriate references.
Larger page size:
More locality of references.
Less overhead for page transfers.
Disadvantage: more contention for page accesses.
Smaller page size:
Less contention.
Reduces false sharing that occurs when 2 different
data items are not shared by 2 different processors
but contention occurs as they are on same page.
441

2. Page Replacement :
Needed as physical/main memory is limited.
Data may be used in many modes: shared,
private, read-only, writable etc
Least Recently Used (LRU) replacement policy
cannot be directly used in DSMs supporting data
movement. Modified policies more effective:
Private pages may be removed ahead of shared ones
as shared pages have to be moved across the network
Read-only pages can be deleted as owners will have a
copy

A page to be replaced should not be lost for


ever.
Swap it onto local disk.
Send it to the owner.
Use reserved memory in each node for swapping.

442

Unit III : Chapter 3


Distributed Scheduling
- Issues in Load Distributing
- Components of a Load Distributing
Algorithm
- Load Distributing Algorithms(4)
- Selection of Load Sharing Algorithms.

443

Introduction
Good resource allocation schemes are
needed to fully utilize the computing
capacity of the DS.
Distributed scheduler is a resource
management component of a DOS.
It focuses on judiciously and transparently
redistributing the load of the system
among the computers.
Target is to maximize the overall
performance of the system.
More suitable for DS based on LANs.
444

Issues in Load Distribution


1. Load
Resource queue lengths and particularly
the CPU queue length which are good
indicators of load.
Measuring the CPU queue length is fairly
simple and carries little overhead.
CPU queue length does not always tell
the correct situation as the jobs may
differ in types.
Another load measuring criterion is the
processor utilization.
Requires a background process that
monitors CPU utilization continuously
and imposes more overhead.
445
Used in most of the load balancing

2. Classification of LDA
Basic function is to transfer load from heavily
loaded systems to idle or lightly loaded
systems

These algorithms can be classified as :


(1) Static (load assigned before application
runs)
Does not consider system state.
Uses static information about average
behavior.
Load distribution decisions are hardwired into the algorithm using a prior
knowledge of the system.
Little run-time overhead.
446

(2) Dynamic (load assigned as applications run)


Takes current system state into account to
make load distributing decisions
Further categorized as :
o Centralized (Tasks assigned by the master
or root process)
o De-centralized (Tasks reassigned among
slaves)
Has some overhead for state monitoring
(3)Adaptive
special case of dynamic algorithms in that
they modify the algorithm based on the
system state parameters.
For example, stop collecting information (go
static) if all nodes are busy so as not to
impose extra overhead.
447

3. Load Balancing vs. Load Sharing


Load-balancing approach,.
Tries to equalize the load at all processors.
Moves tasks more often than load sharing; much
more overhead.
In a load balancing algorithm transfers tasks is
at higher rate than a load sharing algorithm.
Load balancing is an NP-Complete problem.
Requires the background processes for processor
utilization.
Load-sharing approach,
Tries to reduce the load on the heavily loaded
processors only.
Probably a better solution; much less overheads.
If the transfer rate for load sharing rises , then it
tends close to load balancing.
448

4. Preemptive vs. Non-preemptive transfer


Can a task be transferred to another processor once it
starts executing?
Non-preemptive transfer (task placement)
It can only transfer tasks that have not yet begun
execution
It have to transfer environment information like
program code and data
environment variables, working directory, inherited
privileges, etc.
It is simple
Preemptive transfers
It can transfer a task that has partially executed
It have to transfer entire state of the task like
virtual memory image, process control block,
unread I/O buffers and messages, file pointers,
timers that have been set, etc.
449
It is expensive

Components of a load distribution algorithm


1.Transfer policy
Determines if a processor is in a suitable state to
participate in a task transfer.
2.Selection policy
Selects a task for transfer, once the transfer policy decides
that the processor is a sender.
3.Location policy
Finds suitable processors (senders or receivers) to share
load
4. Information policy
Decides:
When information about the state of other processors
should be collected?
Where it should be collected from?
What information should be collected?
450

1. Transfer policy
Determines whether a processor is a sender or a
receiver
Sender overloaded processor
Receiver underloaded processor
Threshold-based transfer
Establish a threshold, expressed in units of load
When a new task originates on a processor, if
the load on that processor exceeds the
threshold, the transfer policy decides that that
processor is a sender
When the load at a processor falls below the
threshold, the transfer policy decides that the
processor can be a receiver
451

2. Selection Policy
Selects which task to transfer
Newly originated simple (task just started)
Long (response time improvement compensates
transfer overhead)
small size
with minimum location-dependent system calls
(residual bandwidth minimized)
lowest priority
Priority assignment policy
Selfish local processes given priority
Altruistic remote processes given priority
Intermediate give priority on the ratio of
local/remote processes in the system
452

3. Location Policy
Once the transfer policy designates a processor as a
sender, finds a receiver
Or, once the transfer policy designates a
processor as a receiver, finds a sender
Polling one processor polls another processor to
find out if it is a suitable processor for load
distribution, selecting the processor to poll either:
Randomly
Based on information collected in previous polls
On a nearest-neighbor basis
Can poll processors either serially or in parallel
(e.g., multicast)
Usually some limit on number of polls, and if that
number is exceeded, the load distribution is not
done
453

4. Information Policy
Decides:
When information about the state of other
processors should be collected
Where it should be collected from
What information should be collected
Demand-driven
A processor collect the state of the other
processors only when it becomes either a sender
or a receiver (based on transfer and selection
policies)
Dynamic driven by system state
Sender-initiated senders look for receivers
to transfer load onto
Receiver-initiated receivers solicit load from
senders
454
Symmetrically-initiated combination where

Periodic
- Processors exchange load information at periodic
intervals.
- Based on information collected, transfer policy on a
processor may decide to transfer tasks.
- Does not adapt to system state collects same
information (overhead) at high system load as at low
system load.
State-change-driven
Processors propagates state information whenever
their state changes by a certain degree.
Differs from demand-driven in that a processor
propagates information about its own state, rather
than collecting information about the state of other
processors.
May send to central collection point or may send to
peers.
455

Stability
The two views of stability are,
The Queuing-Theoretic Perspective
A system is termed as unstable if the CPU
queues grow without bound when the long
term arrival rate of work to a system is
greater than the rate at which the system
can perform work.

The Algorithmic Perspective


If an algorithm can perform fruitless actions
indefinitely with finite probability, the
algorithm is said to be unstable.
456

Load Distributing Algorithms

Sender-Initiated Algorithms
Receiver-Initiated Algorithms
Symmetrically Initiated Algorithms
Adaptive Algorithms

457

1. Sender-Initiated
Algorithms
Activity is initiated
by an overloaded node

(sender)
A task is sent to an underloaded node
(receiver)
CPU queue threshold T is decided for all
nodes
Transfer Policy
A node is identified as a sender if a
new task originating at the node
makes the queue length exceed a
threshold T.
Selection Policy
Only new arrived tasks are considered
for transfer
458

Location Policy
Random: dynamic location policy (select any node
to transfer the task at random).
The selected node X may be overloaded.
If transferred task is treaded as new arrival,
then X may transfer the task again.
No prior information exchange.
Effective under light-load conditions.
Threshold: Poll nodes until a receiver is found.
Up to PollLimit nodes are polled.
If none is a receiver, then the sender commits
to the task.
Shortest: Among the polled nodes that where found
to be receivers, select the one with the shortest
queue.
Information Policy
A demand-driven type
Stability
Location policies adopted cause system instability
at high loads
459

Yes

Select Node i
randomly

i is Poll-set

No

Poll Node i

Poll-set=Poll-set U i

Poll-set = Nil
Yes

Transfer task
to i

YesQueueLength at i

<T

Task
Arrives QueueLength+1

No

>T
Yes
No

No. of polls
<
PollLimit

No

Queue the
task locally
460

2. Receiver-Initiated
Algorithms
Initiated from an underloaded node (receiver)
obtain a task from an overloaded node (sender)

to

Transfer Policy

Triggered when a task departs , node compares its CPU queue


length with T, and if smaller, node is a receiver.

Selection Policy

Any approach, preference to non-preemptive transfers.

Location Policy

Randomly poll nodes until a sender is found, and transfer a task


from it. If no sender is found, wait for a period or until a task
completes, and repeat.

Information Policy

A demand-driven type

Stability

At high loads, a receiver will find a sender with high-probability


with a small number of polls. At low-loads, most polls will fail, but
this is not a problem, since CPU cycles are available.
Most transfers are preemptive and therefore expensive
461

Yes

Select Node i
randomly

i is Poll-set

No

Poll Node i

Poll-set=Poll-set U i

Poll-set = Nil
Transfer task
from i to j

Yes

YesQueueLength at I

>T

No

QueueLength
<T
No

Yes

Wait for a
perdetermined period
Task Departure at j

No. of polls
<
PollLimit

No
462

3. Symmetrically Initiated
Algorithms

Both senders and receivers search for receiver and


senders, respectively, for task transfer.
Combine both sender-initiated and receiver-initiated
components in order to get a hybrid algorithm with
the advantages of both.
Care must be taken since otherwise, the hybrid
algorithm may inherit the disadvantages of both
sender and receiver initiated algorithms.
Most popular algorithm : The Above-Average
Algorithm given by Krueger and Finkel.

463

The above-average algorithm of


Krueger and Finkel
Maintain node load at an acceptable range of the system
average load.
Transfer policy

two thresholds are used, equidistant from the node's estimate of


the average load across all nodes.

Nodes are classified as senders, receivers, or OK.

Location policy

Has a sender-initiated and a receiver initiated component

Selection policy: same as before


Information policy

Average system load is determined individually.

The thresholds can be adaptive to system state to control


responsiveness.
464

Location Policy of Krueger&Finkels


Algorithm
Sender-initiated part
sender sends TooHigh msg, sets TooHigh timeout, and
listens for Accept msgs.
receiver that gets a TooHigh msg, cancels its TooLow
timeout, sends Accept msg, increase its load value,
and sets AwaitingTask timeout. If AwaitTask timeout
expires, load value is decreased.
sender receiving Accept msg, transfers task, and
cancels timeout.
if sender receiving a TooLow msg from a receiver, while
waiting for an Accept, sends a TooHigh msg to it.
sender whose TooHigh timeout expires, it broadcasts a
ChangeAverage msg to all nodes to increase the
average load estimate at the other nodes.

465

Location Policy of Krueger&Finkels


Algorithm
receiver-initiated part:
a receiver sends a TooLow msg, sets a
TooLow timeout, and starts listening for
TooHigh msgs.
a receiver getting a TooHigh msg, sends
Accept msg, increase load, and sets
AwaitingTask timeout. If it expires,
decrease load value.
receiver whose TooLow timeout expires,
sends ChangeAverage to decrease load
estimate at other nodes.
466

4. Adaptive Algorithms
1. A Stable Symmetrically Initiated Algorithm
Utilizes the information gathered during polling to classify
the nodes in the system as either Sender, Receiver or OK.
The knowledge concerning the state of nodes is maintained
by a data structure at each node, comprised of a senders list,
a receivers list, and an OK list.
Initially, each node assumes that every other node is a
receiver.
Transfer Policy

Triggers when a new task originates or when a task departs.


Makes use of two threshold values, i.e. Lower (LT) and Upper
(UT)

Location Policy

Sender-initiated component: Polls the node at the head of


receivers list
Receiver-initiated component: Polling in three order

Head-Tail (senders list), Tail-Head (OK list), Tail-Head (receivers list)

Selection Policy: Newly arrived task (SI), other approached


(RI)
Information Policy: A demand-driven type

467

2. A stable sender-initiated algorithm


Use the sender-initiated component of the
previous stable symmetric algorithm A as
follows
augmented the information at each node J with
a state vector
V(I)=sender, receiver, OK depending on whether J
knows that node is in Is Sender, Receiver, or OK list
It keeps track to which lists it belongs at each other
node
The state vector is kept up-to-date during polling

The receiver component of is as follows


Whenever a node becomes a receiver it notifies all
other misinformed nodes using its state vector

468

Selecting a Suitable Load-Sharing Algorithm


Based on the performance trends of LSAs, on can
select a load sharing algorithm that is appropriate
to the system under consideration as follows:
1. If the system under consideration never attains
the high load, sender-initiated algorithms will give
an improved average response time over no load
sharing at all.
2. Stable scheduling algorithms are recommended
for systems that can reach high load. These
algorithms perform better that non adaptive
algorithms for the following reasons:
469

a. Under sender-initiated algorithms, an overloaded


processor must send inquiry messages delaying the
existing tasks. If an inquiry fails, two overloaded
processors are adversely affected because of
unnecessary message handling. Therefore, the
performance impact of an inquiry is quit severe at high
system loads, where most inquiries fail.
b. Receiver-initiated algorithms remain effective at high
loads but require the use of preemptive task transfers.
Note that preemptive task transfers are expensive
compared to non-preemptive task transfers because
they involve saving and communicating a far more
complicated task state.
3. For a system that experiences a wide range of load
fluctuations,
the
stable
symmetrically
initiated
scheduling algorithm is recommended because it
provides improved performance and stability over the
entire spectrum of system loads.
470

4. For a system that experiences wide fluctuations in


load and has a high cost for the migration of partly
executed tasks, stable sender-initiated algorithms
are recommended, as they perform better than
unstable sender-initiated algorithms at all loads,
perform better than receiver-initiated algorithms
over most system loads, and are stable at high
loads.
5. For a system that experiences heterogeneous
work arrival, adaptive stable algorithms are
preferable, as they provide substantial performance
improvement over non-adaptive algorithms.

471

Question bank
1. What are the central issues in load distributing?
2. What are the components of load distributing
algorithm?
3. Differentiate between load balancing & load
sharing.
4. Discuss the Above-average load sharing
algorithm.
5. How will you select a suitable load sharing
algorithm
6. Write short note on (expected any one)

Sender-Initiated Algorithms
Receiver-Initiated Algorithms
Symmetrically Initiated Algorithms
Adaptive Algorithms

472

Unit IV
Chapter 1: Transaction and Concurrency:
Introduction
Transactions
Nested Transactions
Methods of Concurrency Control:
Locks,
Optimistic concurrency control ,
Time Stamp Ordering,

Comparison for concurrency control.


Chapter 2 :Distributed Transactions:
Introduction
Flat and nested Distributed Transactions
Atomic Commit Protocols
Concurrency Control in Distributed Transactions
473

Unit IV: Chapter 1


Introduction :Transaction Concept
Supports daily operations of an
organization
Collection of database operations
Reliably and efficiently processed as
one unit of work
No lost data
Interference among multiple users
Failures
474

Airline Transaction Example


START TRANSACTION
Display greeting
Get reservation preferences from user
SELECT departure and return flight records
If reservation is acceptable then
UPDATE seats remaining of departure flight
record
UPDATE seats remaining of return flight record
INSERT reservation record
Print ticket if requested
End If
On Error: ROLLBACK
COMMIT
475

Transaction concept
Transaction: Specified by a client as a set of
operations on objects to be performed as an
indivisible unit where the servers manage those
objects.
Goal of transaction: Ensure all the objects managed
by a server remain in a consistent state when
accessed by multiple transactions (client side) and
in the presence of server crashes.
Objects that can be recovered after the server crashes
are called as recoverable objects.
Objects on server are stored on volatile memory (RAM)
or on persistent memory (disk)

Enhance reliability
Recovery from failures
Record in permanent storage

476

Introduction to this chapter


Focus is on
transaction.

single

server

A Transaction defines a sequence of server


operations that is guaranteed to be atomic
in the presence of multiple clients and
server crash.

Nested transaction
Methods of concurrency control

All concurrency control protocols are


based on serial equivalence and are
derived from rules of conflicting operations
477

Operations of the Account interface


deposit(amount)
deposit amount in the account
withdraw(amount)
withdraw amount from the account
getBalance() -> amount
return the balance of the account
setBalance(amount)
set the balance of the account to
amount
Operations of the Branch interface
create(name) -> account
create a new account with a given name
lookUp(name) -> account
return a reference to the account with the
given name
branchTotal() -> amount
return the total of all the balances at the
branch

Note : Each
account is
represented by
a remote
object whose
interface
Account
provides the
operations
Note : Each
branch of bank
is represented
by a remote
object whose
interface
Branch
provides the
operations

The banking example

The client side will work on behalf of users which will lookUp and
478
then can perform account interface.

Simple Synchronization (without


Transactions)

Multi-threaded banking server :


Main issue: Unless a server is carefully
designed, its operations performed on behalf
of different clients may sometimes interfere
with one another. Such interference may
result in incorrect values in the object.
The client operation can be synchronized
without recourse of transaction:
(i) Atomic operations at the server.
(ii) Enhancing Client Cooperation by Signaling
(synchronization of server operations)
479

(i) Atomic operations at the server :


The use of multiple threads is beneficial to the
performance. Multiple threads may access the same
objects.
For example, deposit and withdraw methods: the
actions of two concurrent executions of the methods
could be interleaved arbitrarily and have strange
effects on the instance variables of the account
object.
Synchronized keyword can be applied to method in
Java, so only one thread at a time can access an
object.
E.g. in account interface we can declare the method
as synchonized
Public synchronized void deposit(int amount) {}
480

If one thread invokes a synchronized


method on an object, then that object is
locked, another thread that invokes one
of the synchronized method will be
blocked until the lock is released.
Operations
that
are
free
from
interference from concurrent operations
being performed in other threads are
called as atomic operations.
The use of synchronised methods in
java is one way of achieving atomic
operation, which can be achieved by
any mutual exclusive mechanism.
481

(ii) Enhancing Client Cooperation by


Signaling
Clients may use a server as a means
of sharing some resources. E.g. some
clients update the servers objects
and other clients access them.
However, in some applications,
threads need to communicate and
coordinate their actions.
Producer and Consumer problem.
Wait and Notify actions.
482

Failure model for transactions [lamport1981]


Writes to permanent storage may
fail
Write nothing or wrong value
file storage may decay
reading bad data can be detect (by
checksum)

Servers may crash occasionally


Memory recover to the last updated state
continue recovery using information in
permanent storage
no arbitrary failure
An arbitrary delay of a message
483
A message may be lost, duplicated or

Transactions
Transaction concept is
originally from
database management systems.
Clients require a sequence of separate
requests to a server to be atomic in the
sense that:
They are free from interference by operations
being performed on behalf of other concurrent
clients; and
Either all of the operations must be completed
successfully or they must have no effect at all in
the presence of server crashes.
484

E.g. :A clients banking transaction


A client that performs a sequence of operations
on a particular account on behalf of a user.
Consider accounts with names A,B and C.
The clients looks them up and stores the
reference to them in variables a, b and c of
type Account.

Transaction T:
a.withdraw(1
00);
b.deposit(100
);
This is called as an atomic transaction.
c.withdraw(2

485

Two Aspects of Atomicity


All or nothing: A transaction either completes
successfully, and effects of all of its operations are
recorded in the object, or it has no effect at all.
Failure atomicity: effects are atomic even when server
crashes.
Durability: after a transaction has completed
successfully, all its effects are saved in permanent
storage for recover later.

Isolation: Each transaction must be performed


without interference from other transactions. The
intermediate effects of a transaction must not be
visible to other transactions.
486

ACID properties of Transaction


Atomicity
A transaction must be all or nothing.

Consistency
A transaction takes the system from one
consistent state to another consistent state
The state during a transaction is invisible to
another

Isolation
Serially equivalent or serializable.

Durability
Sucessful transaction are saved and are
recoverable.
487

Use a transaction
Transaction coordinator
Each transaction is created and
managed by a coordinator

Result of a transaction
Success
Aborted
Initiated by client
Initiated by server

488

Operations in Coordinator interface


openTransaction() -> trans;
starts a new transaction and delivers a
unique TID trans. This identifier will be
used in the other operations in the
transaction.
closeTransaction(trans)
->
(commit,
abort);
ends a transaction: a commit return
value indicates that the transaction has
committed; an abort return value
489
indicates that it has aborted.

Transaction life histories


Successful

Aborted by client

Aborted by server

openTransaction

openTransaction

openTransaction

operation

operation

operation

operation

operation

operation
server aborts
transaction

operation

operation

operation ERROR
reported to client

closeTransaction

abortTransaction

If a transaction aborts for any reason (self abort or server


abort), it must be guaranteed that future transaction will
not see the its effect either in the object or in 490
their

Major Issues of Transaction


1. Concurrency Control
2. Recoverability from Abort

491

1. Concurrency control
Problems of concurrent transaction
The lost update problem
Inconsistent retrievals
Conflict in operations

492

The lost update problem

a, b and c initially have bank account balance are: 100,


200, and 300. T transfers an amount from a to b. U
transfers an amount from c to b.
Transaction
T:

balance = b.getBalance();
b.setBalance(balance*1.1);
a.withdraw(balance/10)

Transaction
U:

balance = b.getBalance();
b.setBalance(balance*1.1);
c.withdraw(balance/10)

balance = b.getBalance();
$200
balance = b.getBalance();
$200
b.setBalance(balance*1.1);
$220
b.setBalance(balance*1.1);
$220
a.withdraw(balance/10) $80
c.withdraw(balance/10) $280

The final balance of b should be $242


rather than $220

493

The inconsistent retrievals problem


a, b accounts start with 200 both.

Transaction
W:

Transaction
V:
a.withdraw(100)
b.deposit(100)
a.withdraw(100);

aBranch.branchTotal()
$100
total = a.getBalance()

$100

total = total+b.getBalance()$300
total = total+c.getBalance()
b.deposit(100)

$300

W retrieval are inconsistent.


V perform only withdraw part at the time sum is calculated.
The net balance should be 400 instead of 300.

494

How to overcomes these


problems
If these transactions are done one at a time
in some order, then the final result will be
correct.
If we do not want to sacrifice the
concurrency, an interleaving of the
operations of transactions may lead to the
same effect as if the transactions had been
performed one at a time in some order.
We say it is a serially equivalent
interleaving.
495

Serial equivalence
What is serial equivalence?
An interleaving of the operations of
transactions in which the combined
effect is the same as if the
transactions had been performed one
at a time in some order.

Significance
The criterion for correct concurrent
execution
Avoid lost update and inconsistent
retrieval
496

A serially equivalent interleaving of T and U


Transaction
T:
balance = b.getBalance()
b.setBalance(balance*1.1)
a.withdraw(balance/10)

Transaction
U:
balance = b.getBalance()
b.setBalance(balance*1.1)
c.withdraw(balance/10)

balance = b.getBalance()
$200
b.setBalance(balance*1.1)
$220
balance = b.getBalance()
$220
b.setBalance(balance*1.1)
$242
a.withdraw(balance/10) $80
c.withdraw(balance/10) $278

497

A serially equivalent interleaving of


V and W
Transaction
V:

Transaction
W:

a.withdraw(100);
b.deposit(100)

aBranch.branchTotal()

a.withdraw(100);

$100

b.deposit(100)

$300
total = a.getBalance()

$100

$400
total = total+b.getBalance()
total = total+c.getBalance()
...

498

Conflicting operations
When we say a pair of operations
conflicts we mean that their combined
effect depends on the order in which
they are executed. E.g. read and write

Serial equivalence of two transactions


All pairs of conflicting operations of the two
transactions be executed in the same order
at all of the objects they both access.
499

Read and write operation conflict


rules
Operations of different
Conflict
transactions
read

read

No

Reason

Because the effect of a pairread


of operations

does not depend on the order in which they a


executed
read

write

Yes

Because the effect ofread


a and awriteoperation
depends on the order of their execution

write

write

Yes

Because the effect of a pairwrite


of operations
depends on the order of their execution
500

A non-serially equivalent interleaving of operations of


transactions T and U

TransactionT:
x = read(i)
write(i, 10)

write(j, 20)

TransactionU:

y = read(j)
write(j, 30)
z = read (i)

Ordering is not serially equivalent as the pair of conflicting


operations are not done in same order of both objects.
Serially equivalence ordering requires one of the following two
conditions:
1. T accesses i before U and T Accesses j before U
2. U accesses i before T and U Accesses j before T

501

Recoverability from aborts


The two problems here are
Dirty reads
Premature writes

502

Dirty Reads
The isolation property of transaction
requires that the transaction do not see
the uncommitted state of the other
transaction.
The dirty read problem is caused by the
interaction between the read operation in
one transaction and an earlier write
operation in another transaction

503

A dirty read when transaction T aborts


TRANSACTION T

TRANSACTION U

a.getBalance( );
a.setBalance(balance+10);

a.getBalance( );
a.setBalance(balance+20);

balance = a.getBalance( );
a.setBalance(balance+10);

$100
$110
balance = a.getBalance( );
a.setBalance(balance+20);
commit transaction;

$110
$130

abort transaction;

-Recoverability of transactions: delay commits


until after the commitment of any other
transaction whose uncommitted state has
been observed.
-Cascading aborts: the aborting of any
transactions may cause further transactions to
be aborted transactions are only allowed to504

Premature writes
This one is related to the interaction
between the write operations on the same
object belonging to different transactions.
It uses the concept of before image on write
operation.

505

Overwriting uncommitted values


TRANSACTION T

TRANSACTION U

a.setBalance(105);

a.setBalance(110);

a.setBalance(105);

$100
$105
a.setBalance(110);

$110

Strict execution of transaction: service delay


both read and write operations on an object
until all transactions that previously wrote that
object have either committed or aborted.
-

Tentative
versions:
update
operations
performed during a transaction are done in
tentative versions of objects in volatile memory.
506

Nested Transactions
Several transactions may be started from within
a
transaction, allowing transactions to be regarded as
modules that can be selfT possessed.
: top-level transaction
T1 = openSubTransaction
T1 :

T2 = openSubTransaction
commit

T2 :
openSubTransaction openSubTransaction

T11 :

T12 :
prov. commit

prov. commit

prov. commit

openSubTransaction
T21 :

abort
openSubTransaction
T211 :

prov. commit

prov.commit

A sub transaction appears atomic to its parent with


respect to transaction failures and to the concurrent
access.
507
Sub transaction at the same level (say T 1 and T2) can run

The advantages of nested transactions


Additional concurrency
Sub transactions at one level may run
concurrently with other subtransactions
at the same level
E.g.
concurrent
getBalances
in
branchTotal operation
More robust
Subtransactions can commit or abort
independently

508

The Rules for commitment of nested transactions

Transaction commit(abort) after its


child complete
A transaction may commit or abort only after
its child transactions have completed.

Child
completes:
provisionally or abort

commit

When a subtransaction completes, it makes


an independent decision either to commit
provisionally or to aborts. Its decision to abort
is final.

Parent abort, children abort


When
a
parent
aborts,
subtransactions are aborted.

all

of

its
509

The Rules for commitment of nested transactions


Child abort, parent abort or not
When a subtransaction aborts, the parent
can decide whether to abort or not

Top level transaction


provisionally
subtransactions commit

commit, all
committed

If the top-level transaction commits, then


all of the subtransactions that have
provisionally committed can commit too,
provided that none of their ancestors has
aborted
510

Methods For Concurrent Control


1. Locks
2. Optimistic concurrency control
3. Time stamp ordering

511

1.Locks
A simple example of a serializing
mechanism is the use of exclusive locks.
Server can lock any object that is about to
be used by a client.
If another client wants to access the same
object, it has to wait until the object is
unlocked in the end.

512

Simple exclusive locks

Lock any object that is about to be used by any operation of a


clients transaction.
In case of any other request to the same locked object is
suspended until the object is unlocked.

Transaction : T
Transaction : U
al = b.getBalance()
bal = b.getBalance()
b.setBalance(bal*1.1)
.setBalance(bal*1.1)
.withdraw(bal/10)
c.withdraw(bal/10)
Operations
Lock
Operations
Locks
s
openTransaction
al = b.getBalance()
Lock B
openTransaction
.setBalance(bal*1.1)
waits for
.withdraw(bal/10)lock Abal = b.getBalance()
Ts on B
lock
closeTransaction
unlock A,B

lock B
b.setBalance(bal*1.1)
c.withdraw(bal/10) lock
513
unlock
B,C
C,
closeTransaction

Two phase locking


To ensure serial equivalence of
any two transactions
A transaction is not allowed any new
locks, after it has released a lock
Growing phase: acquire locks
Shrinking phase: release locks

Strict two-phase locking


Any locks applied during the progress of a
transaction are held until the transaction
commits or aborts
In fact, the lock between two reads are
unnecessary
514

Two-Phasing Locking

Basic 2PL

When a transaction releases a lock, it may not request another lock

lock point

obtain lock

number
of locks

release lock

Phase 1
BEGIN

Phase 2
END

Conservative 2PL or static 2PL


a transaction locks all the items it accesses
before the transaction begins execution
pre-declaring read and write sets

515

Strict Two-Phasing Locking


Strict 2PL a transaction does not release any
of its locks until after it commits or aborts
leads to a strict schedule for recovery
obtain lock
release lock

number
of locks

BEGIN

period of data END


item use

Transaction
duration

516

Lock Rules
Lock granularity
as
small
concurrency

as

possible:

enhance

Read lock / write lock


Before access an object, acquire its lock
firstly

Lock compatibility
If a transaction T has already performed a
read operation on an object, then a
concurrent transaction U must not write
that object until T commits or aborts
If a transaction T has already performed a
write operation on an object, then a
517
concurrent transaction U must not read or

Lock rules continued


Prevent lost update and
inconsistent retrieval
Promotion of a lock
From read lock to write lock
Promotion can not be conducted if
the read lock is shared by another
transaction

518

Locking rule for nested transactions


Locks that are acquired by a successful
subtransaction is inherited by its parent
& ancestors when it completes. Locks
held
until
top-level
transaction
commits/aborts.
Parent transactions are not allowed to
run concurrently with their child
transactions.
Subtransactions at the same level are
519
allowed to run concurrently.

Definition:

Deadlocks

A state in which each member of a group of transactions


is waiting for some other member to release a lock.

Prevention:

Lock all the objects used by a transaction when it starts


not a good way.
Request locks on objects in a predefined order
premature locking & reduction in concurrency.

Detection:

Finding cycle in a wait-for graph select a transaction


for aborting to break the cycle. (Choice of transaction to
be aborted is not simple.)

Timeouts:

Each lock is given a limited period in which it is


untouched (safe).

Transaction is sometimes aborted but actually there is no


deadlock.
Appropriate length of a timeout.

520

Two schema of locking


Increasing concurrency in locking
schema:
Two-version locking:
the setting of exclusive locks is delayed until
a transaction commits.

Hierarchic locks:
mix-granularity
locks are used.
Branch

Account
521

Lock compatibility for two-version


locking Lock to be set
For one object
Lock already set

none
read
write
commit

Read

write

commit

OK
OK
OK
Wait

OK
OK
wait
wait

OK
wait
-------

Two-version locking: allows one transaction to write tentative


versions of objects when other transactions read from the
committed version of the same objects.
- read operations are delayed only while the transactions
are being committed rather than during entire execution.
- read operations can cause delay in committing other
transactions.

522

Lock compatibility for hierarchic


locks
For one object
Lock already set

none
read
write
I-read
I-write

Read

Lock to be set
write
I-read

OK
OK
wait
OK
Wait

OK
wait
wait
wait
wait

OK
OK
wait
OK
OK

I-write
OK
wait
wait
OK
OK

523

Drawbacks of locking:
Lock maintenance represents an overhead that
is not present in systems that do not support
concurrent access to shared data.
Deadlock. Deadlock prevention reduces
concurrency. Deadlock detection or timeout not
wholly satisfactory for use in interactive
programs.
To avoid cascading abort, locks can not be
release until the end of transaction reduce
potential concurrency.

524

2. Optimistic concurrency control:


Observation
In most applications the probability of two transactions
accessing the same object is low.
Scheme
No checking while the transaction is executing.
Check for conflicts after the transaction.
Checks are all made at once, so low transaction
execution overhead.
Relies on little interference between transactions
Updates are not applied until closeTransaction
Updates are applied to local copies in a transaction space.

Basic Idea

Transactions are allowed to proceed as though there were


no possibility of conflict with other transactions until the
client completes its task and issues a closeTransaction
525
request.

Three Phases of a Transaction


Working phase:
Each transaction has a tentative version of each of
the objects that it updates.
Initially, it is a copy of the most recently committed version
Read are performed on the tentative version
Written values are recorded as tentative version
Reading set / write set per transaction.
Validation phase:
Check the conflicts between overlapped
transactions when closeTransaction is issued
Success: commit
Fail: abort
Update phase:
Updates in tentative versions are made permanent.
526

Purpose : Validation of transaction


Transaction number
Each transaction is assigned a transaction number
(in ascending sequence) when it enters the validation phase.
Transactions enter validation phase according to the their
transaction number.
Transactions commit according to the transaction number.

Since the validation and update phase are short, so there is only one
transaction at a time.

Conflict rules
Validation uses the read-write conflict rules to ensure that the
scheduling of a particular transaction is serially equivalent
with respect to all other overlapping transactions.
E.g. :Tv is serializable with respect to an overlapping
transaction Ti , their operations must conform to the following
rules.

527

The validation test on transaction Tv is based


on conflicts between operations in pairs of
transaction Ti and Tv.
Tv

Ti

Rule

write

read

1.

must
Ti
not read objects written Tby
v

read

write

2.

must
Tv
not read objects writtenTby
i

write

write

3.

must
Ti
not write objects writtenTby
and
v
Tv must
not write objects written by
Ti

Serializability of transaction T with respect to transaction Ti

Note : The validation of a transaction must


ensure that rule 1 and rule 2 are obeyed
by testing for overlaps between the
528
objects of pair of transaction Tv and Ti.

Forms of Validation
1. Backward Validation: Checks the transaction
undergoing
validation
with
other
preceding
overlapping transactions- those that entered the
validation phase before it.
2.Forward Validation: Checks the transaction undergoing
validation with other later transactions, which are still
active ( lagging behind in respective validation
phases).

529

Validation forms of transactions


Working

Validation Update

T1

Earlier committed
transactions

T2

1. Backward
form

T3
Transaction
being validated

2. Forward form
Later active
transactions

Tv

active 1
active

1. Figure indicates overlapping transaction considered in the


validation of a transaction Tv.
2. Time increase from left to right.
3. The earlier committed transaction are T1,T2 and T3.
4. T1 is committed before Tv start and T2 and T3 committed
before Tv finished its working phase.
5. There are two later active transactions having
transaction identifiers but not transaction numbers.

530

i]. Backward validation


Test the previous overlapped transactions
:
Rule 1 is satisfied as read operation of earlier
transaction are not affected by the write
operation of the current transaction Tv.

To resolve any conflict Abort


transaction undergoing validation.

the

Rule 2 indicates that the read set of Tv must


be compared with the write set of T 2 and T3.
If there is a overlap, the validation fails.

Transaction that have no read operation


(only write operation) need not to be
checked.
531

Backward validation algorithm

startTn
The biggest transaction number assigned to some other committed
transaction at the time when transaction Tv started its working
phase.
finishTn
The biggest transaction number assigned to some other committed
transaction at the time when Tv entered the validation phase

T2

T3

boolean valid = true;


for ( int Ti = startTn +1; Ti <= finishTn; Ti
++)
{
if (read set of Tv intersects write set of
T i)
validof=all false;
Serial equivalence
committed transactions
Since
} backward validation can ensure the result that T commits
v

after all previously committed transactions, so all transactions are


committed in a serial equivalent order.
532

ii]. Forward validation


Test the still active (but lagging behind), latter
overlapped transactions
Rule 1:
Write set of the transaction being validated is
compared with the read sets of other overlapping
active transactions (still in working phase).

Rule 2:
Automatically fulfilled because the active transaction
do not write untill after Tv has completed.

Forward validation algorithm:

boolean valid = true;


for ( int Tid = active1 ; Tid <= activen; Tid ++)
{
if (write set of Tv intersects read set of Tid)
valid = false;
}

533

Ways to resolve a conflict in


Forward validation

Suspend the validation until a later time,


when the conflicting transactions have
finished.
Some conflicting transactions may have
aborted.
Abort all the conflicting active transactions
and commit the transaction being
validated.
Abort the transaction being validated.
The future conflicting transactions may
abort,
so
the
aborting
becomes
unnecessary.
534

Comparison of forward and backward validation


Backward validation
Overhead of comparison
For read set is bigger than write set, so
comparison in backward validation is
heavier than that in forward validation.
Overhead of storage
Storing old write sets until they are no
longer needed.

Forward validation
Overhead of time
To validate a transaction must wait until
all active transactions finished.
535

3:Timestamp Ordering
Basic Idea:
Each
transaction
has
a
Timestamp(TS)
associated with it.
TS is not necessarily real time, can be a logical
counter.
TS is unique for a transaction.
New transaction has larger TS than older
transaction.
Larger TS transactions wait for smaller TS
transactions and smaller TS transactions die
and restart when confronting larger TS
transactions.
No deadlock.
536

Basic time stamp ordering


Rule
A transactions request to write an
object is valid only if , that object
was last read and written by earlier
transactions.
A transactions request to read an
object is valid only if , that object
was last written by an earlier
transactions.
537

Operation conflicts for timestamp ordering

Rul Tc
e

Ti Condition
1.write read Tc must not write an object that has
been read by any Ti where Ti > Tc,
this requires that Tc the maximum
read timestamp of the object.
2.write write Tc must not write an object that has
been written by any Ti where Ti>Tc,
this requires that Tc > the write
timestamp of the committed object.
3.read write Tc must not read an object that has
been written by any Ti where Ti>Tc,
this requires that Tc
> write
538

Timestamp ordering write rule:


Let D be an object and Tc is a transaction
requesting write operation :
Based on Rule 1 and 2:
if (Tc maximum read timestamp on D &&
Tc > write timestamp on committed version of D)
perform write operation on tentative version of D with
write timestamp Tc

else

/*write is too late*/


abort transaction Tc

539

Write operations and timestamps


(b)T3 write

(a) T3 write
Before

T2

After

T2

Before T1
T3

After

T1

Key:

T2
T2

Committed

T3
Time

Time

(c)

T3 write

(d)T3 write

Before

T1

T4

After

T1

T3

T4
Time

Before

T4

After

T4

Ti

Transaction
aborts

Ti
Tentative

object produced
by transaction Ti
(with write timestamp T
T1<T2<T3<T4

Time

540

Timestamp ordering read rule:


Decision is made : to accept , to wait or to reject a read
operation requested by Tc on the object D.
Based on rule 3:
If (Tc> write timestamp on committed version of D )
{ let Dselected be the version of D with the maximum write timestamp Tc
if (Dselected is committed)
perform read operation on the version D selected

else
wait until the transaction that made version D selected
reapply the read rule

commits or aborts then

}
else
abort transaction Tc

541

Read operations and timestamps


(b) T3 read

(a) T3 read

Key:

read
proceeds

T2

Selected

T2

Time

T2

Selected

Ti

Time

Committed

Ti

(d) T3 read

(c) T3 read
T1

read
proceeds

T4

Tentative

read waits

Selected

T4
Time

Transaction
aborts

object produced
by transaction Ti
(with write timestamp
T1 < T2 < T3 < T4

Time

542

Multiversion Timestamp ordering


Basic Idea:
A list of old committed versions as well
as tentative versions is kept for each
object.
Read operations that arrive too late
need not be rejected.
The server direct the read operation to
the most recent version of an object.

543

Comparative study of 3 methods


2PL is the most popular choice because
it is simple.
2PL is a pessimistic protocol because it
achieves serializability by restricting
the operation in a transaction.
Timestamp ordering is less pessimistic,
allows operations to execute freely.
Optimistic concurrency control ignores
conflicts during execution, but requires
very elaborate validation.
546

Timestamp ordering vs. two phase locking


Timestamp ordering
Decide the serialization order statically
Better than locking for read-dominated
transactions

Two phase lock


Decide the serialization order dynamically
Better than timestamp ordering for
update-dominated transactions

Both are pessimistic methods


547

Pessimistic methods vs. optimistic methods


Optimistic methods
Efficient when there are few conflicts
A substantial amount of work may have to
be repeated when a transaction is aborted

Pessimistic methods
Less concurrency but simple in relative to
optimistic methods

548

Question Bank 4
Explain the concepts of Concurrency control and
Recoverability from abort used in the transactions.
Discuss the locking mechanism for the concurrency
control.
Write short note on :
(i) Nested transaction
(ii) Timestamp
ordering
What is the purpose of Validation of Transactions?
Explain the various forms of transaction with suitable
examples.
Illustrate a comparative study of the three methods of
concurrency control with suitable example.
549

Unit IV: Chapter 2


Introduction

In previous chapter, we discussed transactions


accessed objects at a single server. In the general
case, a transaction will access objects located in
different computers which communicate with the
remote objects in the server.
Distributed transaction accesses objects managed by
multiple servers.
The atomicity property requires that either all of the
servers involved in the same transaction commit the
transaction or all of them abort. Agreement among
servers are necessary.
Transaction recovery is to ensure that all objects
are recoverable. The values of the objects reflect all
changes made by committed transactions and none
of those made by aborted ones.
550

Transactions May Need on More than


One Server
Begin transaction BookTrip
book a plane from Nagpur
book hotel from Shimla
book rental car from New Delhi
End transaction BookTrip
The Two Phase Commit Protocol is a classic solution.

551

Focus on this Chapter


Distributed transaction
A flat or nested transaction that accesses
objects managed by multiple servers

Atomicity of transaction
All or nothing for all involved servers
Two phase commit

Concurrency control
Serialize locally + serialize globally
3 concurrency methods wrt to Dist.
Transaction
552

Distributed transactions

(1) Flat transaction

Flat transaction send out requests


to different servers and each
request is completed before client
goes to the next one. (bookTrip
example)
In figure Transaction T is a flat
transaction that invokes operation
on objects in servers X,Y,Z
Each
transaction
accesses
servers objects sequentially.
When servers use locking , a
transaction can only be waiting
for one object at a time.

553

Distributed transactions

2) Nested transaction

Here , the top-level transaction


can open sub-transaction, and
each sub-transaction can open
further sub-transactions down to
any depth of nesting (a parent
child relationship).
Each child start after its parent
and finish before it.
Nested transaction allows subtransactions at the same level to
execute concurrently.
figure
T1
and
T2
are
In
concurrent , and as they invoke
objects in different servers. Also
the four sub transaction T11,

554

Nested Banking transaction


X
Client

a.withdraw(10)

b.withdraw(20)

c.deposit(10)

d.deposit(20)

T
Y

T = openTransaction
openSubTransaction
a.withdraw(10);
openSubTransaction
b.withdraw(20);
openSubTransaction
c.deposit(10);
openSubTransaction
d.deposit(20);
closeTransaction

Z
T
T

3
4

Note : If this transaction is structured as a set of four nested


transaction, the four request(two deposit and two withdraw)
can run in parallel and the overall effect can be achieved
with better performance than a simple transaction in which
555
the four operation invoked sequentially.

The architecture of distributed transactions


The coordinator ( in any server)
Accept client request
Coordinate behaviors on different
servers
Send result to client
Record a list of references to the
participants

The participant (in every server)


Manages object accessed by a
transaction
Keep track of all recoverable objects at
each server
556
Cooperate with the coordinator

Coordination in Distributed
Transactions
Each server has a special participant process. Coordinator process
(leader) resides in one of the servers, talks to trans. &
participants.
Coordinato
r

join

Participant
X

join

Participant

join
Y

Participant
C

Coordinator &
Participants

Open
Transacto
n TID

Coordinato
r

Close
Transactio
n
Abort
Transactio
n
3
1
a.method (TID
Join (TID, ref)
)

Participant
2

The Coordination
Process
557

A distributed (flat) banking transaction


coordinator
join

openTransaction
closeTransaction
.

participant
A

a.withdraw(4);

join
BranchX

T
Client

participant
b.withdraw(T, 3);

T = openTransaction
a.withdraw(4);
c.deposit(4);
b.withdraw(3);
d.deposit(3);
closeTransaction

Note: client invoke an operation b.withdraw(),


B will inform participant at BranchY to join coordinator.

B
join

b.withdraw(3);

BranchY
participant
C

c.deposit(4);

d.deposit(3);

BranchZ

the coordinator is in one of the servers, e.g. BranchX


558

Working of Coordinator
Servers for a distributed transaction need to coordinate
their actions.
A client starts a transaction by sending an openTransaction
request to a coordinator. The coordinator returns the TID
to the client. The TID must be unique (serverIP and number
unique to that server)
Coordinator is responsible for committing or aborting it.
Each other server in a transaction is a participant.
Participants are responsible for cooperating with the
coordinator in carrying out the commit protocol, and keep
track of all recoverable objects managed by it.
Each coordinator has a set of references to the participants.
Each participant records a reference to the coordinator.

559

Interface for Coordinator

openTransaction() -> trans;

starts a new transaction and delivers a unique TID


trans. TID contains two parts( the server identifier
say its IP address which has created it and a
number unique to the server). The identifier will be
used in the other operations in the transaction.

join (trans, reference to participant)

/*additional

method*/

informs a co-ordinator that a new participant has


joined the transaction trans.

closeTransaction(trans) -> (commit, abort);


ends a transaction: a commit return value indicates
that the transaction has
committed; an abort
return value indicates that it has aborted.

abortTransaction(trans);

560

Atomic Commit Protocols


Atomic Commitment
When a distributed transaction comes to
an end, either all or none of its
operations are carried out.
Due to atomicity, if one part of a
transaction is aborted, then the whole
transaction must also be aborted.
The Coordinator has the responsibility to
either commit or abort the transaction.
One phase atomic commit protocol
Two phase atomic commit protocol

561

One-phase atomic commit protocol


The protocol
Client request to end a transaction
The
coordinator
communicates
the
commit or abort request to all of the
participants and to keep on repeating the
request
until
all
of
them
have
acknowledged that they had carried it out.

The problem
some servers commit, some servers abort
How to deal with the situation that some
servers decide to abort?

Go for Two phase atomic commit protocol


562

Introduction to two-phase commit protocol


Allow for any participant to abort
First phase
Each participant votes to commit or abort

The second phase


All participants reach the same decision
If any one participant votes to abort, then all
abort
If all participants votes to commit, then all
commit

The challenge
work correctly when error happens

Failure model

563

The two-phase commit protocol (working)


When the client request to abort
The coordinator informs all participants to
abort

When the client request to commit


First phase
The coordinator ask all participants if they
prepare to commit
If a participant prepare to commit, it saves in
the permanent storage all of the objects that it
has altered in the transaction and reply yes.
Otherwise, reply no

Second phase
The coordinator tell all participants to commit
564
( or abort)

Operations for two-phase commit


protocol
Three participant interface methods
canCommit?(trans)-> Yes / No
Call from coordinator to participant to ask
whether it can commit a transaction.
Participant replies with its vote.
doCommit(trans)
Call from coordinator to participant to tell
participant to commit its part of a
transaction.
doAbort(trans)
Call from coordinator to participant to tell
participant to abort its part of a transaction.
565

Operations for two-phase commit


protocol
Two coordinate interface methods
haveCommitted(trans, participant)
Call from participant to coordinator to
confirm that it has committed the
transaction.
getDecision(trans) -> Yes / No
Call from participant to coordinator to ask
for the decision on a transaction after it has
voted Yes but has still had no reply after
some delay. Used to recover from server
crash or delayed messages.
566

The two-phase commit protocol


Phase 1 (voting phase):
1. The coordinator sends a canCommit?
request to each of the participants in
the transaction.
2. When a participant receives a
canCommit? request it replies with its
vote (Yes or No) to the coordinator.
Before voting Yes, it prepares to
commit
by
saving
objects
in
permanent storage. If the vote is No
the participant aborts immediately.
567

The two-phase commit protocol


Phase 2 (completion according to outcome of
vote):
3. The coordinator collects the votes (including
its own).
(a) If there are no failures and all the
votes are Yes the coordinator decides
to commit the transaction and sends a
doCommit request to each of the
participants.
(b) Otherwise the coordinator decides
to abort the transaction and sends
doAbort requests to all participants
that voted Yes.
4. Participants that voted Yes are waiting for a
doCommit or doAbort request from the568

Timeout actions in the two-phase commit protocol

Coordinator

Participant

step status

step status

1
3

prepared to commit
(waiting for votes)
committed

canCommit?
Yes

prepared to commit
(uncertain)

committed

doCommit
haveCommitted

done

Communication in two-phase commit


protocol
569

Timeout actions in the two-phase commit protocol


New processes to mask crash failure
Crashed process of coordinator and
participant will be replaced by new
processes
Time out for the participant
Timeout of waiting for canCommit: abort
Timeout of waiting for doCommit
Uncertain status: Keep updates in the
permanent storage
getDecision request to the coordinator
Time out for the coordinator
Timeout of waiting for vote result: abort
Timeout of waiting for haveCommited: do
nothing
570
The protocol can work correctly without the confirmation

Performance of two-phase commit


protocol
Provided that all servers and communication
channels do not fail, with N participants
N number of canCommit? Messages and replies
Followed by N doCommit messages
The cost in messages is proportional to 3N
The cost in time is three rounds of message.
The cost of haveCommitted messages are not
counted, which can function correctly without
them- their role is to enable server to delete
stale coordinator information.
571

Failure of Coordinator
When a participant has voted Yes and is waiting for
the coordinator to report on the outcome of the vote,
such participant is in uncertain stage. If the
coordinator has failed, the participant will not be able
to get the decision until the coordinator is replaced,
which can result in extensive delays for participants in
the uncertain state.
One alternative strategy is allow the participants to
obtain a decision from other participants instead of
contacting coordinator. However, if all participants are
in the uncertain state, they will not get a decision.

572

Two-phase commit protocol for Nested transactions


Structure

Top level transaction


Subtransaction at any depth
Parent child relationship

Nested transaction semantics


Subtransaction completes to make independent
decision

Commit provisionally (local decision)


Abort

Parent transaction

Abort: all subtransactions abort


Commit: exclude aborting subtransactions

A two-phase commit protocol is needed for nested


transactions
it allows servers of provisionally committed transactions
that have crashed to abort them when they recover.
573

Distributed nested transaction

Each (sub)transaction has a coordinator which


has an interface for two operations:

openSubTransaction(trans)subTrans
Open a subtransaction whose parents is trans and
returns
a
unique
subtransaction
identifier.
(extension of its parent TID)
getStatus(trans)commited, aborted, provisional
Asks the coordinator to report on the status of the
transactions trans. Return values representing one
of the following: committed, aborted, provisional

Each sub-transaction starts after its parent


starts and finishes before its parent finishes.
When a subtransaction completes
provisionally committed updates are not saved in the
permanent storage.

574

Working of 2PCP for nested


transaction

The two operations provides the


interface for the coordinator of a
subtransaction.

It allows to open further subtransactions


It allows its subtransactions to enquire
about its status

Client starts by using openTransaction


to open a
top-level transaction.

This returns a TID for the top-level


transaction.
The TID can be used to open a
subtransaction
The subtransaction automatically joins
the parent and a TID is returned.

The client finishes a set of nested


transactions
by
calling
closeTransaction or abortTransacation
in the top-level transaction.
575

Example 2PC in Nested


Transactions
T1
1

T
1

Abort
A

Client

T1
2

T
T
2

Provision
al
T

B
N

T2
1

K
D
T2
2

Nested Distributed
Transaction

Yes

T1
1

Provision
al
T1

N
o
Yes

T
2

Abor
t

T2
1

N
o

Provision
al

Yes
Yes

T2
2

Provision
al

Yes

Bottom up decision in
2PC
576

An Example of Nested Transaction

Server status
T

abort (at M)

11

T1

provisional commit (at X)

12

provisional commit (at N)

T21 provisional commit (at N)


T

aborted (at Y)
T

22

provisional commit (at P)

Client status
Transaction T decides whether to commit
577

Information Held by Coordinators of Nested


Transactions
Coordinator
of
T
T1
T2
T11
T12, T21
T22

Child
subtrans
T1, T2
T11, T12
T21, T22

Participant

Provisional
commitlist
T1, T12
T1, T12

yes
yes
no(aborted)
no(aborted)
T12butnot T21 T21, T12
no(parentaborted) T22

Abortlist
T11, T2
T11
T2
T11

en each sub-transaction was created, it joined its parent sub-transaction.


e coordinator of each parent sub-transaction has a list of its child sub-transaction
en a nested transaction provisionally commits, it reports its status and the statu
descendants to its parent.
en a nested transaction aborts, it reports abort without giving any information a
descendants.
e top-level transaction receives a list of all sub-transactions, together with their s
578

Execution process of the Two phases


Coordinator

Participant

step status

step status

prepared to commit
1
(waiting for votes)
3

committed

canCommit?
Yes
doCommit
haveCommitted

done

Two-phase commit protocol

2prepared to commit
(uncertain)
4

committed

Note :
Phase I:
Step 1
Step 2
Phase II:

Conducted on the participant of T, T1 and T12 (in example)


Step 3
Performes canCommit in either
Step 4
Hierarchic manner

Flat manner

The second phase of Two-phase commit protocol is same


as for the non-nested case i.e.

Coordinates collects the votes for doCommit/ doAbort (step 3)


Participants makes haveCommit call in case of commit. (step579
4)

Hierarchic Two-Phase Commit Protocol for


Nested Transactions
canCommit?(trans,subTrans)>Yes/No

Callacoordinatortoaskcoordinatorofchildsubtransactionwhetheritcancommita
subtransactionsubTrans.Thefirstargumenttransisthetransactionidentifieroftop
leveltransaction.ParticipantreplieswithitsvoteYes/No.

The coordinator of the top-level sub-transaction sends


canCommit? to the coordinators of its immediate child subtransactions. The latter, in turn, pass them onto the coordinators of
their child sub-transactions.
Each participant collects the replies from its descendants before
replying to its parent.
T sends canCommit? messages to T1 (but not T2 which has
aborted); T1 sends CanCommit? messages to T12 (but not T11).

If a coord. finds no subtrans. matching 2nd paramater, then it must


have crashed, so it replies NO.
580

Flat Two-Phase Commit Protocol for Nested


Transactions
canCommit?(trans,abortList)>Yes/No

Call from coordinator to participant to ask whether it can commit a transaction.


ParticipantreplieswithitsvoteYes/No.

The coordinator of the top-level sub-transaction sends canCommit?


messages to the coordinators of all sub-transactions in the provisional
commit list (e.g., T1 and T12).

If the participant has any provisionally committed sub-transactions


that are descendants of the transaction with TID trans:
Check that they do not have any aborted ancestors in the
abortList. Then prepare to commit.
Those with aborted ancestors are aborted.
Send a Yes vote to the coordinator giving the good subtransactions

If the participant does not have a provisionally committed descendent,


it must have failed after it performed a provisional commit. Send a NO
581
vote to the coordinator.

Time-out actions in nested


2PC
With nested transactions delays can occur in the
same three places as before
when a participant is prepared to commit
when a participant has finished but has not yet
received canCommit?
when a coordinator is waiting for votes

Fourth place:
provisionally committed subtransactions of aborted
subtransactions e.g. T22 whose parent T2 has aborted
use getStatus on parent, whose coordinator should
remain active for a while
If parent does not reply, then abort

582

Concurrency Control in Distributed


Transactions
Concurrency
control
for
distributed
transactions: each server applies local
concurrency control to its own objects,
which ensure transactions serializability
locally.
However, the members of a collection of
servers of distributed transactions are jointly
responsible for ensuring that they are
performed in a serially equivalent manner.
Thus global serializability is required.
583

Methods of concurrency control for


nested transaction
1. Locking
2. Timestamp
3. Optimistic Concurrency control

584

1. Locking
Each participant locks on objects locally
strict two phase locking scheme
Lock manager at each server decide whether to
grant a lock or make the requesting transaction wait.

Atomic commit protocol


A server can not release any locks until it knows that
the transaction has been committed or aborted at
all.
Note :A lock managers in different servers set their locks
independently of one another. It is possible that
different servers may impose different orderings on
transactions.
585

Locking
T U
Write(A) at X locks A
Write(B) at Y locks B
Read(B) at Y
waits for U
Read(A) at X waits for T
***************************************************
T before U in one server X and U before T in
server Y. These different ordering can lead to
cyclic dependencies between transactions and a
distributed deadlock situation arises.
586

2.Timestamp ordering concurrency control


Globally unique
timestamp

transaction

Be issued to the client by the first


coordinator
accessed
by
a
transaction
The transaction timestamp is passed
to the coordinator at each server
Each server accesses shared objects
according to the timestamp

Resolution of a conflict

587

3.Optimistic concurrency control


The validation
takes place during the first phase of
two phase commit protocol
Commitment deadlock
T
Read (A) At X

U
Read (B) At Y

Write (A)
Read(B) At Y
Write (B)

Write (B)
Read(A) At X
Write (A)
588

Optimistic concurrency control


Parallel
Robinson)

validation(Kung

&

Suitable for distributed transaction.


write-write conflict must be checked
as well as write-read for backward
validation.
Possibly different validation order on
different server.
Measure1:global validation check
after
individual
server
is
serializable.
Measure2: each server validates
589
according to a globally unique

Question Bank 5.
Describe the Flat and Nested distributed
transaction. How these are utilized in a
distributed banking transaction?
How a transaction can be completed in
atomic manner? Explain in details the
working of two-phase commit protocol .
Discuss in details the concept of two-phase
commit protocol for nested transactions.
How can you achieve the concurrency
control in the distributed transactions.
590

Unit V: Resource Security and


Protection

Chapter 1:Access and Flow


control

Introduction
The Access Matrix Mode
Implementation of Access Matrix Model(3)
Safety in the Access Matrix Model
Advanced Models of Protection(3)

Chapter 2: Data Security


Introduction
Modern Cryptography:
Private Key Cryptography,
Public key Cryptography.
591

Chapter 1:Access and Flow control


Introduction

Deals
with
the
control
of
unauthorized use of software and
hardware.
Business
applications
such
as
banking requires high security and
protection during any transaction.
Security techniques should not only
prevent the misuse of secret
information but also its destruction.
592

Basics
Potential Security Violations [By AnderSon]:
1. Unauthorized information release : unauthorized
person is able to read information, unauthorized use
of computer program.
2. Unauthorized
information
modification:
unauthorized person is able to modify information
e.g changing grade of a university student,
changing account balances in bank databases
3. Unauthorized denial of service : Unauthorized
person should not succeed in preventing an
authorized person from accessing the information593

External vs Internal Security


1. External Security :

Also called physical security


Deals with regulating the access to locate of computer
systems [ e.g hardware, disks, tapes]
Can be enforced by placing a guard at the door, by giving
a secret key to authorized person.
Issues to be dealt are administrative

2. Internal Security :

Deals with the use of computer hardware and software


information stored in computer systems
Requires an issue of authentication [Logs into]
594

Policies and Mechanisms


Policy
1.
2.
3.
4.

What should be done?


Policy gives assignment of the access rights to users to various
resources.
Policies Decides which user has access to what resources
Policies can change with Time and application

Mechanism
1.
2.
3.
4.
5.

How it should be done?


Protection mechanism provides a set of tools that can be used to design
or specify a wide array of protection policies
Protection mechanism in OS controls user access to system resources.
Protection Scheme must be amenable to a wide variety of policies.
Protection is a mechanism and Security is a policy.

Separation of policies and mechanism enhances design flexibility


595

Protection Domain of a
Process
Specifies Resources that a process can access and type
of operation that a process can perform on the
resources.
Required for enforcing security
Allow the process to use only those resources that it
requires.
Every process executes in its protection domain and
protection domain is switched appropriately whenever
control jumps from process to process.
Advantage :
Eliminates the possibility of a process violating security
maliciously or unintentionally and increases accountability 596

Design Principles for a Secure System


[By Saltzer & Schroeder]
1. Economy : Protection mechanism should be
economical to develop and use. Should add extra
high costs for the system.
2. Complete Mediation : Requires that every request to
access an object be checked for the authority to do
so.
3. Open Design: A protection mechanism should work
even if its underlying principles are known to the
attcker.
4. Separation of Privileges: Protection Mechanism
requires two keys to unlock a lock.
597

Design Principles

cont

5. Least Privilege : Subject should be given bare


minimum rights for completion of task.
6. Least Common Mechanism : Portion common to more
than one user should be minimized. [Coupling among
users represents potential information path between
users and hence a potential threat to their security]
7. Acceptability : Protection Mechanism must be simple
to use.
8. Fail Safe Defaults : Default case should mean lack of
access.
598

Access Matrix Model

Model proposed by Lampson. Enhanced and Refined


further by Graham, Denning and Harrison.
Protection System consists of mechanism
to control user access for various resources
or
to control information flow.
Basic concept
Objects: the protected entities, O
Subjects: the active entities acting on the objects, S
Rights: the controlled operations subjects can
perform on objects, R
599

Access Matrix Model


3 Components :
1. Current Objects : Finite set (O) of entities to
which access is to be controlled. [Files]
2. Current Subjects: Finite set (S) of entities that
access current objects. E.g subject may be a process.
Subjects themselves can be treated as objects and
can be accessed like an object by other subjects.
[Users]
3. Generic Rights : A finite set of generic rights
R={r1,r2,r3,rm} gives various access rights
that subjects can have to objects. E.g read, write ,
execute, own , delete etc.
600

Access Matrix Model cont..


Protection State of a System : Protection state of a
system is represented by a triplet (S,O,P)

(S,O,P)

Set of current
subjects

Set of current
objects

Access
Matrix

Note :
Access Matrix has a row for every current subject and a
column for every current object.
601

Access Matrix Model cont..


Objects
o

s
Subject
s

P[s,o]

P[s,o] is a subset of generic rights subset R


It also denotes the access rights which subjects s has
to object o.
602

Access Matrix Representing


Protection State
O1

O2

O3
(S1)

O4
(S2)

O5
(S3)

S1

read,
write

own,
delete

own

sendmail

recmail

S2

execute

copy

recmail

own

block,
wakeup

S3

own

read,
write

sendmail

block,
wakeup

own

603

Access : A schematic view


A user requests access operations for
objects/resources.
The reference monitor checks request
validity and return either granting access
or denying access.
Access
Request

Reference
Monitor

Grant/ Deny

604

Access Matrix Model cont


Enforcing a Security Policy
1. A security Policy is enforced by validating every user
access for appropriate access rights.
2. Every Object has a monitor that validates all accesses
to that object in the following manner:
(i) A subject s requests an access to object o.
(ii) Protection System presents triplet(s,,o) to
monitor of o
(iii) Monitor looks into access rights of s to o. If
belongs to subset of P[s,o] then access is
permitted
Else it is denied.
605

Implementation of Access
Matrix Model
Three Implementations of Access matrix model
1. Capabilities Based
2. Access Control List
3. Lock-key Method

606

Capabilities

Capability based method corresponds to the row-wise


decomposition of the access matrix.
Each subject s is assigned a list of tuples (o, P [s , o])
for all objects o that it is allowed access. These tuples
are known as capability.
Typical view of capability
Object
Descript
or

Access Rights
read , write, execute etc.

Capability has two fields.


Object Descriptor is identifier for objects and
Allowed Access Rights for the object.

607

s1
s2
s3

O1
r1

Capability
Lists
O O
2

r3

r2
r4

r5

grouped by subject
s1

(r1, O1)

(r2, O3)

s2

(r3, O2)

(r4, O3)

s3

(r5, O1)

Capability Lists
608

Capabilities cont..
Possession of a capability treated as a evidence that
user has authority to access the object in the ways
specified in the capability.
At any point of time, a subject is autorized to access
only those objects for which it has capabilities.

609

Capability Based Addressing


1. Capabilities can be used for addressing mechanism
by the system using object descriptor.
2. The Main advantage of using capability as an
addressing mechanism that it provides an address
that is context independent[ Absolute Address].
3. However, System must allow embedding of
capabilities in user programs and data structures.

610

Capability Based Addressing cont..

er Request to access a word within an object

An address(of request) in a
program
Capability id
Offset
What object to be accessed in main
memory |
Relative location of word within an
object

Capability list of the user


to locate

length

base

Object Table
Entry for the object

offse
t

Access
Rights

lengt
h

Object
Descript
or
611

Capability Based Addressing cont..


A user Program issues a request to access a word with an
object.
Address contains capability ID of the object and an offset with
in the object
System uses capability ID to search the capability list of the
user to locate the capability that contains the allowed access
rights and an object descriptor.
System checks the access rights.
Object descriptor is used to search the object table to locate
entry for the object.
Object entry contains the base address of the object in main
memory.
612

Capability Based Addressing cont..


Two Salient features :
1. Relocatability : An object can be relocated any where
within main memory without changing the capability.
2. Sharing:Several programs can share the same object with
different names for the same object.
. Implementation Considerations:
1. To maintain a forgery-free capability, a user should not be
able to access [read, modify or construct] a capability.
2. Two ways for implemenattion:
(i) Tagged approach
(ii) Partitioned approach
613

1. Tagged approach

One or more bits are attached to each memory


location and every processor.
Tag indicates whether a memory word or register
contains a capability.
If tag = ON , the information is capability otherwise
ordinary data.
When tag =ON user can not manipulate the word.
Example: Burroughs B6700 and the Rich Research
Computer
614

2. Partitioned Approach:

Capabilities and Ordinary data are partitioned


[ stored separately]
Every object has two segments : one for data other
for capabilities.
Processor has two sets of registers : one for data
other for capabilities.
Users cannot manipulate segment and register
storing capabilities.
Examples : Chicago Magic Number Machine, and
Plessey System
615

Advantages Drawbacks of
Capabilities
Advantages
1. Efficient : validity can be easily tested
2. Simple : due to natural correspondence between structural
properties of capabilities and semantic properties of
addressing variables.
3. Flexible : user can decide which of his address contain
capabilities

616

Disadvantages:
1. Control of propagation :
Copy of capability is passed from one subject to
other subject without knowledge of 1st subject.

2. Review:
Determination of all subject accessing one object is
difficult.
3. Revocation of access rights
Destroy of object , which prevent all the undesired
subjects from accessing it.

4. Garbage Collection
When capabilities of an object disappear from the
system , the object is left inaccessible to user and
becomes garbage.
617

II. Access Control List


Method

Column wise decomposition of the access matrix.


Each object o is assigned a pairs (s, P[s,o]) for all
subjects s that are allowed to access the object.
P[s,o] denotes the access rights that subject s has to
o
When a subject s requests access to object o,
it is executed in the following manner:
1. System searches the access control list of o to find
out if an entry(s,) exists for subject s.
2. If exists then system checks for whether access is
permitted ( belongs to )
3. If yes access is granted otherwise a Exception is
raised.
618

Access
Control
Lists
O O O
1

s1
s2
s3

r1

r3

r2
r4

r5

Grouped by object
O1

O2

(s1, r1)

(s2, r3)

O3

(s1, r2)
(s2, r4)

(s3, r5)

Access Control Lists


619

Schematic of an access
control list

Subject
s
Smith

read,write,execute

Jones

read

Lee
Grant

Access Rights

write
execute

Execution Efficiency of the access control list


method is poor because an access control list must
be searched for every access to a protected object.
620

Access Control List Method


cont..
Main features :
1.

2.
.
1.

2.

Easy Revocation: Revocation of access rights is simple, fast


and efficient. Can be achieved simply by removing subjects
entry from objects access control list.
Easy review of an access: Can be easily determined what
subjects have access rights to an object
Implementation Considerations :
Efficiency of Execution : Since access control list needs to be
searched for every access to a protected object, it can be very
slow. [Can be avoided using shadow registers]
Efficiency of storage: List may require a huge amount of storage
[ Can be avoided using protection groups]
621

Lock and Key Method


subjects possess
a set of keys:
Key
Key

(O, k)

Lock
(l,y) (k, {r 1 , r 2 ,...})
objects are associated
with a set of locks
622

Lock Key Method

Hybrid of the capability-based method and access control list


method.

Every subject has a capability list that contains tuples of the


form (O,k) indicating that the subject can access Object O using
key k.

Every Object has an access control list that contains tuples of


the form (l,y) called a lock entry. It indicates that any subject
which can lock l can access this object in modes contained in y.

When a subject makes a request to access object o in , the


system is executed in the following manner:
1. System locates tuple (o,k) in the capability list of the subject.
If no such tuple is found access is not permitted
2. Otherwise access is permitted only if there exists a lock entry
(l,y) in the access control list of the object o such that k=l and
belongs to y.
623

Comparison of methods
Capability list

Access Control List

propagation

Good

Bad

review

Bad

revocation

Good

reclamation

Bad

Locks & Keys


Good

Good

Bad

Good

Good

Good

Good

1. need copy bit/count for control


2. need reference count
3. need user/hierarchical control
4. need to know subjectkey mapping
624

Changing The Protection State


Access matrix is itself a protected object
Commands for changing protection state
Set of commands C for changing protection state
defined in the form of the following primitive operations
enter r into P [s, o]
delete r from P [s, o]
create subject s
create object o
destroy subject s
destroy object o

Primitive operations: define changes to be made to the


access matrix of P
Example: Primitive operation delete r from P [s, o]
deletes access right r from the position P [s, o] in the
access matrix, i.e., access right r of subject s to object o
is withdrawn
625

Changing The Protection State (cont.)

Before the operation is performed (e.g., the delete in


previous example), a verification should be made
that the process has the right to perform this
operation on the access matrix:
Command syntax:
command < command id > (<formal parameters>)
if < conditions >
then
< list of primitive operations >
end.

Command execution
All checks in the condition part are evaluated. The
<conditions> part has checks in the form r in
P[s,o].
If all checks pass, primitive operations in <list of
primitive operations> are executed.
626

Changing The Protection State (cont.)

All accesses are validated by a mechanism


called a reference monitor: the reference
monitor can reject an access not allowed by the
access matrix.
Each object has an owner
If s is the owner of o, then own P [ s, o ]
The owner of an object can give a right to the
object to another subject

Example: command to create a file and assign own and read rights
to it
command create-read (process, file)
create object file
enter own into P [process, file]
enter read into P [process, file]
end.

627

Changing The Protection State (cont.)

Example: command owner of a file gives write


access rights to another process
command confer-write (owner, process, file)
if own P [ owner, file ]
then
enter write into P [process, file]
end.

628

Safety in Access Matrix


Model
AMM is safe if a subject cannot acquire an access

right to an object without consent of the object


owner.
A command may leak,
right r from a state
Q=(S,O,P) , if it enters r in a cell of P that did not
have r.
AMM is safe if a subject can determine whether its
actions can resolve the leakage of a right to
unauthorized subjects.
A State Q is unsafe for r, if there exists command
that leaks r form Q else we say Q is safe for r.
Safety is undecidable for general protection
systems.
Safety can be decided for mono-operational system.
629

Mono-Operational
Commands
Single primitive operation in a
command
Example: Make process p the owner
of file g
command makeowner(p, g)
enter own into A[p, g];
end

Note : Mono-operation commands can


also be conditional or biconditionals.
630

Advanced model of
Protection
1. Take Grant model
2. Bell Lapadula model
3. Lattice model

631

1.Take-Grant Model

Principles:

Uses directed graphs to model access control


Protection state of system represented by
directed graph
More efficient than (sparsely populated)
access matrix.

632

1.Take-Grant Model

Model:
Graph nodes: subjects and objects
An edge from node x to node y indicates that
subject x has an access right to the object y: the
edge is tagged with the corresponding access rights
Access rights
Read (r), write (w), execute (e)
Special access rights for propagating access
rights to other nodes
Take: If node x has access right take to node
y, then subject x can take any access right
that it has on y to another node
Grant: If node x has access right grant to node
y, then the entity represented by node y can
be granted any of the access rights that node
633
x has

Example: take operation


Node x has take access to node y
Node y has read and write access to node
z
Node x can take access right read from y
and have this access right for object z : a
directed edge
take
labeled r is added from
y r, w
x
node x to node z
z

take
y r, w

634

Example: grant operation


Node x has grant access to node y and
also has read and write access to node z
Node x can grant read access for z to
node y ( a directed edge labeled r from
y to z is added in the graph)
grand

r, w

z
grand

r, w

z
635

State and state transitions:


The protection state of the system is
represented by the directed graph
System changes state (state transition) when
the directed graph changes
The directed graph changes with the following
operations
Take
Grant
Create: A new node is added to the graph

When node x creates a new node y, a directed edge is


added from x to y

Remove: A node deletes some of its access rights to


another node

636

2. Bell-LaPadula Model
Used to control information flow
Model components
Subjects, objects, and access matrix
Several ordered security levels
Each subject has a (maximum) clearance and
a current clearance level
Each object has a classification (I.e., belongs
to a security level)

637

Subjects can have the following access rights to objects

Read-only
Append: subject can only write object (no read permitted)
Execute: no read or write
Read-write: both read and write are permitted

Subject that creates an object has control attribute to


that object and is the controller of the object

Subject can pass any of the four access rights of the


controlled object to another subject

Properties for a state to be secure


simple security property (restricts reading up)
the star-property (prohibits writing down)

Tranquility principle
no operation may change the classification of an
active object
638

Bell-LaPadula Model (cont.)


Restrictions on information flow and access control (reading
down and writing up properties):
1. The simple security property
A subject cannot have read access to an object with
classification higher than the clearance level of the subject.
2. The -property (star property)
A subject has append (I.e., write) access only to objects
which have classification (I.e., security level) higher than or
equal to the current security clearance level of the subject.
A subject has read access only to objects which have
classification (I.e., security level) lower than or equal to the
current security clearance level of the subject.
A subject has read-write access only to objects which have
classification (I.e., security level) equal to the current
security clearance level of the subject.
639

Bell-LaPadula Model (cont.)


Level n

can write

.
.
.

Level i+1

Level i
Level i-1
.
.
.
Level 1

can read

640

Star property indication


classification
clearance

level n
w
i

level i

r,w

objects

subject
level 1
*-property
641

3. The Lattice Model


The best-known Information Flow Model
Based upon the concept of lattice whose
mathematical meaning is a structure
consisting of a finite partially ordered set
together with a least upper bound and
greatest lower bound operator on the set.
Lattice is a Directed Acyclic Graph(DAG)
with a single source and sink.
Information is permitted to flow from a
lower class to upper class.

642

The lattice model


(continued)

643

The lattice model


(continued)
This satisfies the definition of
lattice. There is a single source
and sink.
The least upper bound of the
security classes {x} and {z} is
{x,z} and the greatest lower
bound of the security classes
{x,y} and {y,z} is {y}.

sink
{x,y,
z}

{x,z
}

{y,z
{x,y}
{z}

{x}
{}
source

644

{y}

Flow Properties of a Lattice


The relation is reflexive, transitive and
antisymmetric for all A,B,C SC.
Reflexive: A A
Information flow from an object to another object at the
same class does not violate security.

Transitive: A B and B C implies A C .


This indicates that a valid flow does not necessarily
occur between two classes adjacent to each other in the
partial ordering

Antisymmetric: A B and B A implies A=B


If information can flow back and forth between two
objects, they must have the same classes
645

Flow Properties of a Lattice (Contd..)


Two other inherent properties are as follows
Aggregation: A C and B C implies A U B C
If information can flow from both A and B to C , the
information aggregate of A and B can flow to C.

Separation: A U B C implies A C and B C


If the information aggregate of A and B can flow to C
,information can flow from either A or B to C

646

Application on Lattice model


Military Security model:
The objects are related to the information which is
to be protected.
The objects are ranked (R) in 4 security categories:
Unclassified : Least sensitive e.g {}
Confidential : single entities e.g. {x},{y} {z}
Secret : next level combination e.g. {x,y},{y,z} {x,z}
Top secret : most sensitive or the highest level
combination e.g. {x,y,z}

Note : The ranks can be provided as per the need of


the information flow.
647

The object are associated with one or more


compartments (C)
The compartments are based on subject relevance
and are enforce the need-to-know rule.
The subjects has also the security levels and the
compartments.
The class are associated with objects such that :
O=(Ro,Co)
The clearance are associated with subject such that :
S=(Rs,Cs)
The dominates relations between classes of objects
and clearances of subject defines a partial order
that turns out to be a lattices.

648

A lattices for a military security model with two ranks


say unclassified (1) and confidential (2) is given by
(2,
{p,s})
(2,
{s})
(1,
{s})

(1,
{p,s})
(2,
{})

(2,
{p})

(1,
{p})

(1,
{})

The largest element is the class (2, {p,s}) and the


smallest element is (1, {})
649

Mode of Information Flow:


The information flow from object x to object
y is denoted as xy
It indicates that the information stored in x
is used to derive information transferred to
y
Information flow can be
explicit flow if y=x i.e. y directly depend on x
implicit flow if y=x+1 i.e. y
conditionally
depend on x

650

Question Bank 6
Explain various implementation of Access
matrix with suitable example .
Explain the Take-grant model of information
flow with suitable example
How the Bell-LaPadula model deals with the
control of information flow.
Explain the Lattice model of information flow
with suitable example.
Write short note on
(i) Protection State
(ii)Safety in the access matrix model
651

Chapter 2 :Data Security


Introduction
Unauthorized User can gain access to confidential
information.
User may by pass protection mechanism of system.
To add extra protection techniques are needed to
ensure the an intruder is unable to understand or
make use of any information obtained by wrongful
access.
Cryptography can be used for extra protection.
Converting one piece text in to cryptic form before
652
storing it on to computer.

Model of Cryptography
Terminology:
Plaintext [cleartext or original message]
Ciphertext [message in encrypted form]
Encryption [ Process of converting Plaintext to ciphered
text]
Decryption [Process of converting ciphered to Plaintext text]
Cryptosystem [System for encryption and decryption of
information]
Symmetric Cryptography : If the key is same for both
encryption and decryption
Asymmetric Cryptography : If the key is not same for both
encryption and decryption
653

General Structure of a Cryptographic


System
SI

CA

C = Eke(M)

Ke
Encryption key

Kd
Decryption key

M = Plain text , C = Ciphertext =


EKe(M)
EKe = Encryption operation using Ke
DKd=M,
SI = side information

CA

Potential threats:
1. Ciphertext only attack
2. Known-plaintext attack
3. Chosen-plaintext attack
654

Design Principles
Shannons principle :
(Supports the conventional cryptography)
1.Principle of Diffusion : Spreading the correlation and
dependencies among key- string variables over substrings
as much as possible so as to maximize the length of the
plaintext needed to break the system
2.Principle of Confusion : Change the piece of information
so that output has no obvious relation with the input.
Exhaustive search principle:
(Supports the modern cryptography)
3.Determination of key needed to break the system
4.Requires exhaustive search of a space.

655

Classification of Cryptographic
Systems
Cryptographic Systems

Conventional

Modern

Systems

Systems

Open design

Private key
Systems

Public key
Systems

656

Conventional Cryptography
Based on substitution cipher
1. Caesar Cipher ( no. of keys <=25)

2.

A letter is transformed into third letter following in the


alphabetical sequence
E : M(M+3)%26 where 0<=M<=25

Simple sunstitution (no. of keys =26! almost >1026)

3.

Any permutation of letters can be mapped to English Letters


Positional correlation is eliminated

Polyalphabetic Ciphers: (no. of keys = (26!)n)

Uses periodic sequence of n substitution alphabetic ciphers


System switches among n substitution alphabet ciphers
periodically
Eg vegenere cipher : periodic seq. of int is 11,19,4,22,9,25
1,7,13->>11, where as 2,8,14->> 19 and so on.
657

Modern Cryptography
1. Private key Cryptography

Based on Data Encryption Stds. developed by IBM


Two basic operations :
1.

Permutation : permutes the bits of a word. [ To provide


diffusion]
2. Substitution : replaces m-bit input by an n-bit output.
[ No simple correlation between input and output. To
provide confusion]
(i) Convert m-bit input to decimal form
(ii) Decimal output is permuted to give
another decimal number
(iii) Final decimal output is converted into n-bit
output.
658

Data Encryption
Standard
[DES]

DES is a block cipher that crypts 64-bit data blocks


using 56-bit key
Error detection is provided by adding 8-bit parity
Basic Components involves :
Plaintext: X
Initial Permutation: IP( )
Roundi: 1 i 16
32-bit switch: SW( )
Inverse IP: IP-1( )
Ciphertext: Y
659

Three steps:
1.Plain text undergoes initial permutation(IP) in which 64 bits of the
block is permuted
2.Permuted block goes a complex transformation using a key and
involves 16 iterations and then 32 bit switch (SW)
3.The output of step(2) goes a final permutation which is the inverse of
step(1)
<< The output of step(3) is ciphered text>>
64-bit plaintext (X)
Initial Permutation (IP)
Key i
Round

(i)

56-bit key (K)

Key Generation
(KeyGen)
Key I of 48 bits

Inversion of Initial Permutation


(IP-1)
64-bit ciphertext (Y)
660

Iterative Transformation
Li-1

Ri-1

Iterative Transformation step


consists of 16 functionally
identical iterations
Let Li = left 32-bit halfs and Ri=
Right 32-bit half after ith iteration
Li=Ri-1 and Ri=Lif(Ri-1,Ki)
where Ki is 48 bit key

Ki
Key

Ri=Lif(Ri-1,Ki)

Li
Li

Ri

661

f Steps:
1. 32-bit Ri-1 is expanded to 48-bit E(Ri-1), depending on permutation
and duplication .
2. Ex-OR operation is performed between 48-bit key Ki and E(Ri-1).
48 bit output is partitioned into 8 partitions S1,S2,.S8 of 6bit each
3. Each Si, i<=i<=8 is fed into a separate 6-to-4 substitution box.
4. 32-bit output of 8 substitution boxes is fed to a permutation box
whose 32 bit output is f.

S-box

[
1

662

Decryption
The same algorithm as encryption.
Reversed the order of key (Key16, Key15,
Key1) based on.
Ri-1 =Ri
Li-1=Ri f(Li,Ki)
For example:
IP undoes IP-1 step of encryption.
The 3rd decryption step undoes the
permutation IP performed in the 1st
encryption step , yielding the
original[
663
1
plain text block

2. Public Key Cryptography

Encryption Procedure E is in public domain


Decryption Procedure is secret
Encryption procedure E and Decryption procedure D must
satisfy following properties:
1.For every message M, D(E(M)) = M
2.E and D can be efficiently applied to any message M
3.Knowledge of E does not compromise security.

<< It should be impossible to derive D from E>>


Public key cryptography allows two users to have secure
communication even if they have not communicated before
664

Rivest-Shamir-Adleman
Method
Popularly known as RSA method.
Binary Plaintext is divided into blocks. Each block is
represented by an integer between 0 and n-1.
Encryption key is a pair (e , n) where e is positive
integer
Message M is encrypted by raising it to eth power
moduloe n.
C = M modulo n
Cipher text C is an integer between 0 and n-1.
Encryption does not increase the length of plaintext
Decryption key (d, n) is a pair where d is a positive
665
integer.

Rivest-Shamir-Adleman
cont..
Cipher text block C is decrypted by raising it to dth
power modulo n.
M =C dmodulo n.
User possesses an encryption key(eX, nX) and a
decryption key(dX, nX) where as encryption key is
available in public domain but decrpytion key is
known to user only

666

Rivest-Shamir-Adleman
cont..
M

e
M mod n

e
C = M mod
n

d
C mod n

(e ,
(d ,
n)
n)
<< Encryption Key for user>> << Decryption Key for user>>

667

Determination of Keys

1. Chose two large prime numbers p and q and define n


as
n=p*q
2. p and q should be chosen such that it will be practically
impossible to determine p and q by factoring n.
3. Chose any large integer d as follows:
GCD(d,(p-1)*(q-1)) == 1
4. Compute integer e such that it is multiplicative inverse
of d in modulo (p-1)*(q-1)

668

Example of RSA
Let
p=5
and
q=11
such
that
n=pxq=>n=55
Therefore (p-1)x(q-1) =40
Let d=23
as 23 and 40 are relatively prime i.e.
gcd(23,40)=1.
Choose e such that dxe(modulo 40)=1.
Note e=7
M
M
C=M mod C
M=C mod
55
55 0 to 55 to
Consider any
integer between
8
209715 2
8388608
8
execute
encryption and decryption by
2
7

478296

23

70368744177664

23

669

Question Bank 7
What do you mean by data security?
Explain in detail the model of Cryptography.
Explain the concept of Public Key
Cryptography with suitable example.
Explain the concept of Private Key
Cryptography with suitable examples.
Write a note on Data encryption standards.
Discuss the Rivest Shamir Adleman method
with suitable example.
670

Unit VI : Application of Dist.


System

Chapter 1 : Distributed Multimedia Systems:

Introduction

Characteristics of multimedia system

Quality of Service Management

Resource Management

Stream Adaptation

Case Study

Chapter 2:Designing Distributed System: (Google


Case Study)

Introducing

Google- Overall architecture and Design Paradigm

Communication Paradigm

Data Storage and Coordination Services


671

Distributed Computation Services

Chapter 1: Distributed Multimedia


Systems

Introduction:
Modern computers can handle streams of
continuous, time-based data such as digital
audio and video applications.
This capability has led to the development of
distributed multimedia applications.
The requirements of multimedia applications
significantly differ from real-time applications:
Multimedia applications are highly distributed and
therefore
compute
with
other
distributed
applications for network bandwidth and computing
resources.
The
resource
requirements
of
multimedia
applications are dynamic.

672

A distributed multimedia system


Video camera
and m ike

Local network

Local network

Wide area gateway

Video
server

Digital
TV/radio
server

The above figure illustrates a typical distributed


multimedia system, capable of supporting a variety
of applications
Non-interactive applications: net radio and TV,
video-on-demand, e-learning, ...
[ may be
one way communication]
673
Interactive application: voice &video conference,

Basic requirement of such


systems
Characteristics
applications

of

multimedia

Timely delivery of streams of multimedia


data to end-users
Audio sample, video frame

To meet the timing requirements


QoS( quality of service)
Different from traditional real time system

674

Typical Multimedia Applications


without
QoS
Web-based multimedia
Provides best effort to access streams of audio/video data
via web
Extensive buffering affects the performance
Effective when there is little need for the synchronization
of data streams.

Network phone and audio conference


Requires relatively low bandwidth
Efficient compression techniques
High interactive latency

Video-on-demand services
Supply video information in digital form, from large online
storage systems to the user display
Require sufficient dedicated network bandwidth
Assumes that the video server and the receiving stations
are dedicated
675

QoS management
Traditional real-time system
E.g. avionics, air traffic control, telephone
switching
Small quantities of data, strict time requirement
QoS management: fixed schedule that ensures
worst-case requirements are always met

Different requirements of multimedia


app.
General environment
Compete with other distributed app. for network
bandwidth, computing resource

Dynamic resource requirements


E.g. the number of participants of a video conference
may vary

User participant in the control of resource


consumption

676

Highly Interactive Applications


Examples
Videoconference (cooperative, involves several users)
Distributed online ensemble (synchronous, close
coordination)

Requirements
Low-latency communication
round trip delays of 100-300 ms => interaction between user
to be synchronous

Synchronous distributed state


If one user stops a video on a given frame, the other users
should see it stopped at the same frame

Media synchronization
All participants in a music performance should hear the
performance at approximately the same time

External synchronization (other formats )


Sometime, other information need to be synchronized with the
time-based multimedia streams

Expecting rigorous QoS management

677

The Window of Scarcity


Many of todays computer systems
provide some capacity to handle
multimedia data, but the necessary
resources are very limited.
Especially, when dealing with large
audio and video streams many systems
are constrained in the quantity and
quality of streams they can support.
This situation is depicted as the Window
of Scarcity.
678

The Window of scarcity for computing and


communication
interactive
video

high-quality
audio

insufficient
resources

scarce
resources

abundant
resources

network
file access

remote
login
1980

1990

2000

A history of computer systems that support distributed data access.


679

The Window of Scarcity


operation
If a certain class of application lies within this
window, a system needs to allocate and
schedule its resources carefully in order to
provide the desired service.
Before the window of scarcity is reached, a
system has insufficient resources to execute
relevant applications.
Once an application class has left the window
of scarcity, system performance will be
sufficient to provide the service even under
adverse circumstances.
680

Characteristics of
Multimedia data
Multimedia data (video and audio) is continuous and
time-based.
Continuous data is represented as sequence of
discrete values that replace each other over time.
Refer to the users view of the data
Video: a image array is replaced 25 times per second
Audio: the amplitude value is replaced 8000 times per
second

Time-based (or isochronous data) is so called


because timed data elements in audio and video
streams define the semantics or content of the
stream.
The time at which the values are played effect the validity
of the data. Hence, the timing should be preserved.
The delivery delay for each element is bounded in a
value.
681

Multimedia systems are often bulky. Hence the data


should be moved with greater throughput.
Following table shows typical data rates and
frame/sample frequencies.

The resource bandwidth requirements for some are


very large especially for video of reasonable quality.
A standard TV/Video stream requires more than 120
Mbps.
The figures for HDTV are even higher and in video682
conferencing there is a need to handle multiple

Data compression
Reduce bandwidth requirements by factors
between 10 and 100.
Available in various formats like GIF, TIFF,
JPEG, MPEG-1, MPEG-2, MPEG-4.
It imposes substantial additional loads on
processing resources at the source and
destination.
E.g. the video and audio coders/ decoders found on
video cards.

The compression methods in MPEG video


formats is asymmetric with a complex
compression
algorithm
but
simpler
decompression algorithms.
The Varity in the modern
gadgets also
requires the transcoding approaches for data
compression and decompression to maintain
683
the quality of the digital data.

QoS Management
When multimedia run in networks of PCs, they compete
for resources at workstations running the applications and
in the network.
In multi-tasking operating system, the central processor is
allocated to individual tasks in a Round-Robin or other
scheduling scheme.
The key feature of these schemes is that they handle
increases in demand by spreading the available resources
more thinly between the competing tasks.
The timely processing and transmission of multimedia
streams in crucial. In order to achieve timely delivery,
applications need guarantees that the necessary
resources will be allocated and scheduled at the required
times.
The management and allocation of resources to provide
such guarantee is referred to as Quality of Service
Management (QoS Management)
684

QoS management is based on


Architecture of a typical system
Provides infrastructure for various components
of multimedia applications
Source
Stream processors

Connections
Network connection
In-memory transfer

Target
Each process must be allocated adequate CPU time, memory
capacity and network bandwidth

Resource requirement
Provides QoS specifications for components of
multimedia applications
QoS Manager

685

Typical infrastructure components


for multimedia applications
PC/workstation

PC/workstation
Window system

Camera

Microphones

Screen

K
Codec

Mixer

G
Codec

L
Network
connections

C
D

Video file system


Codec

Video
store

Window system
: multimediastream
White boxes represent media processing components,
many of which are implemented in software, including:
codec: coding/decodingfilter
mixer: sound-mixingcomponent
686

The above figure shows the most commonly


used abstract architecture for multimedia
software.
Continuously flowing streams of media data
elements are processed by a collection of
processed and transferred between the
processes by inter-process connections.
The
processes
produce,
transform
and
consume continuous streams of multimedia
data.
The connections link the processes in a sequence
from a source of media elements to a target.
For the elements of multimedia data to arrive at
their target on time , each process must be
allocated adequate resources to perform its task
and must be scheduled to use the resources
sufficiently frequently to enable it to deliver the
data elements in its stream to the next process on
687
time.

QoS specifications for components


Component

Bandwidth

Camera

Out:

Codec

Mixer

Window
system

Network
connection
Network
connection

In:
Out:
In:
Out:
In:
Out:
In/Out:

In/Out:

10frames/sec,rawvideo
640x480x16bits
10frames/sec,rawvideo
MPEG1stream
244kbpsaudio
144kbpsaudio
various
50frame/secframebuffer
MPEG1stream,approx.
1.5Mbps
Audio44kbps

Latency

Lossrate

Resourcesrequired

Zero
Interactive Low
Interactive

Verylow

Interactive Low
Interactive Low
Interactive

Verylow

10msCPUeach100ms;
10MbytesRAM
1msCPUeach100ms;
1MbytesRAM
5msCPUeach100ms;
5MbytesRAM
1.5Mbps,lowloss
streamprotocol
44kbps,verylowloss
streamprotocol

The above table sets out the resource requirements for


the main software components and network connections
in the previous Figure.
The required resources can be guaranteed only if there
is a system component responsible for the allocation
688
and scheduling of those resources.

QoS Managers Tasks


The QoS Managers two main subtasks
are:
Quality of Service Negotiation
Apps. specify the resource requirements
QoS manager evaluates the feasibility
Give a positive or negative response

Admission control
Applications run under a resource contract
Recycle the released resource
689

The QoS managers task


Flowchart
Admissioncontrol

QoS negotiation
Application components specif y their QoS
requirements to QoS manager

Flow spec.
QoSmanagerevaluatesnewrequirements
agains t the
available
res ources .
S ufficient?

Yes
Res ervethereques ted
res ources

Resource contract
Allowapplication
to proceed

Application runs with res ources as


per res ource contract

No
Negotiatereducedres ourceprovis ion
withapplication.
Agreement?

Yes

No

Do notallowapplication
to proceed
Application
notifiesQoS managerof
increas ed
res ourcerequirements
690

QoS Negotiation
The application indicates its resource
requirements to the QoS manager.
To Negotiate QoS between an application and
its underlying system an application must
specify its QoS requirements to the QoS
manager.
This is done by transmitting a set of
parameters.

691

QoS Negotiation Parameters


Bandwidth: The rate at which data flows
through a multimedia stream.
Latency: It is the time required for an
individual data element to move through a
stream from the source to the destination.
Loss Rate: The rate at which the data elements
are dropped due to untimely delivery.

692

The usage of resource requirements spec.

Describe a multimedia stream


Describe the characteristics of a
multimedia stream in a particular
environment
E.g. a video conference
Bandwidth: 1.5Mbps; delay: 150ms, loss rate:
1%

Describe the resources


Describe the capabilities of resources to
transport a stream
E.g. a network may provide
Bandwidth: 64kbps; delay: 10ms; loss rate:

693

Specify the QoS parameters for streams

Bandwidth
Specified as minimum-maximum
value or average value
Required bandwidth varies according to the
compression rate of the video. E.g., 1:50 1:100 of MPEG video

Specify burstiness
Different traffic patterns of streams with the
same average bandwidth
LBAP model: Rt + B, where R is the rate, B is
the maximum size of burst
694

Specify the QoS parameters for streams (2)


Latency
The frames of a stream should be
processed with the same rate at which
frames arrive
No human perception
E.g. 150ms for interactive apps, 500ms for demand on
video

No jitter
Jitter: the variation in the period between the delivery
of two adjacent frames

Loss rate
Typically be expressed as a probability
Be calculated based on worst-cast assumptions or on
695
standard distributions

Traffic Shaping
Traffic shaping is the term used to describe
the use of output buffering for the smooth
the flow of data elements.
The bandwidth parameter of a multimedia
stream
provides
an
idealistic
approximation of the actual traffic pattern.
The closer the actual pattern matches the
description, the better the system will
handle the traffic.
696

LBAP Model of bandwidth variations


This calls for the regulation of burstiness
of the multimedia streams.
Any stream can be regulated by inserting
a buffer at the source and by defining a
method by which data elements leave the
buffer.
This can be illustrated using following
algorithms:
Leaky Bucket
Token Bucket
697

Leaky Bucket Algorithm


The bucket can be filled arbitrarily
with water until it is full. Through a
leak at the bottom of the bucket water
will flow out.
The algorithm ensures that a stream
will never flow at a rate higher than R.
The size of the buffer B defines the
maximum burst a string an incur
without losing elements.
This algorithm completely eliminates
bursts.
698

Token Bucket Algorithm


The elimination of bursts in the
previous
algorithm
is
not
necessary as long as bandwidth
is bounded over any time
interval.
The token bucket algorithm
allows larger bursts to occur
when the stream has been idle
for a while.
Tokens are generated at a rate R
and collected in a bucket of size
B. Data can be sent only when
atleast S tokens are in bucket.
This ensures that over any
interval t the amount of data
sent is not larger than Rt+B

699

The RFC 1363 Flow Spec


Protocol version
Maximum transmission unit
Bandwidth:

Token bucket rate


Token bucket size
Maximum transmission rate

Delay:

Minimum delay noticed


Maximum delay variation
Loss sensitivity

Loss:

Burst loss sensitivity


Loss interval
Quality of guarantee

700

Flow Specifications
A collection of QoS parameters is
typically known as a flow
specification, or flow spec for short.
Several examples for flow spec
exists. In Internet RFC 1363 , a flow
spec is defined as a 16-bit numeric
values, which reflect the QoS
parameters.
701

QoS Admission Control


Admission control regulates access to
resources to avoid resource overload.
It protect resources from requests
that they cannot fulfill.
An admission control scheme is
based on the overall system capacity
and the load generated by each
application.
702

QoS Admission Control

Bandwidth reservation:

A common way to ensure a certain QoS level


for a multimedia stream is to reserve some
portion of resource bandwidth for its exclusive
use.
Used for applications that cannot adapt to
different QoS levels, e.g. x-ray video.

Statistical multiplexing:
Reserve minimum or average bandwidth.
Handle burst that cause some service drop
level occasionally.
Hypothesis
a large number of streams the aggregate
bandwidth required remains nearly constant
regardless of the bandwidth of individual streams.703

Resource Management
To provide a certain QoS level to an
application, a system needs to have sufficient
resources, it also needs to make the resources
available to an application when they are
needed (scheduling).
Resource Scheduling: A process needs to have
resources assigned to them according to their
priority. Following 2 methods are used:
Fair Scheduling
Round-robin
Packet-by-packet
Bit-by-bit

Weighted fair queuing

Real-time scheduling

Earliest-deadline-first (EDF)

704

(i)Fair Scheduling
If several streams compete for the same
resource, it becomes necessary to
consider fairness and to prevent illbehaved streams taking too much
bandwidth.
A straight forward approach is to apply
round-robin scheduling to all streams in
the same class, to ensure fairness.
In Nagle, a method was introduced on a
packet-by-packet basis that provides
more fairness w.r.t varying packet sizes
and arrival times. This is called Fair
Queuing.

705

(ii)Real-time scheduling
Several algorithms were developed to
meet
CPU
scheduling
needs
of
applications.
Traditional
real-time
scheduling
methods suit the model of regular
continuous multimedia streams very
well.
Earliest-Deadline-First
(EDF)
scheduler uses a deadline i.e.
associated with each of its work items
to determine the next item: The item
with earliest deadline goes in first.

706

Stream Adaptation
The simplest form of adjustment
when QoS cannot be guaranteed is
adjusting
its
performance
by
dropping pieces of information.
Two methodologies are used:
Scaling
Filtering

707

Scaling
Best applied when live streams are sampled.
Scaling algorithms are media-dependent,
although overall scaling approach is the
same: to subsample a given signal.
A system to perform scaling consists of a
monitor process at the target and a scalar
process at the source.
Monitor keeps track of the arrival times of
messages in a stream. Delayed messages
are an indication of bottle neck in the
system.
Monitor sends a scale-down message to the
source that scales up again .
708

Filtering

It is a method that provides the best


possible QoS to each target by applying
scaling at each target by applying scaling
at each relevant node on the path from
source to the target.
Filtering requires that a stream be
partitioned into a set of hierarchical
substreams, each adding a higher level of
quality.
A substream is not filtered at an
intermediate
node
if
somewhere
downstream a path exists that can carry
the entire substream.
709

Case study: The Tiger video


file server
A video storage system that supplies
multiple real-time video streams
simultaneously is an important
component to support consumer
oriented multimedia applications.
One
of
the
most
advanced
prototypes of these is the Tiger video
file server.
710

Design goals
Video-on-demand for a large number of
users
A large stored digital movie library
Delay of receiving the first frame is within a few seconds
Users can perform pause, rewind, fast-forward

Quality of service
Constant rate
a maximum jitter and low loss rate

Scalable and distributed


Support up to 10000 clients simultaneously

Low-cost hardware
Constructed by commodity PC

Fault tolerant
Tolerant to the failure of any single server or disk
711

System architecture
One controller
Connect with each server by low-bandwidth network

Cubs the server group


Each cub is attached by a number of disks ( 2-4)
Cubs are connected to clients by ATM
Controller
low-bandwidth network

n+1

Cub 0

n+2

Cub 1

n+3

Cub 2

n+4

Cub 3

2n+1

Cub n

high-bandwidth
ATM switching network

video distribution to clients

Start/Stop
requests from clients
712

Storage organization
Stripping
A movie is divided into blocks
The blocks of a movie are stored on disks
attached to different cubs in a sequence of the
disk number
Deliver a movie: deliver the blocks of the movie
from different disks in the sequence number
Load-balance when delivering hotspot movies

Mirroring
Each block is divided into several portions
(secondaries)
The secondaries are stored in the successors
If a block is on a disk i, then the secondaries are stored
on disks i+1 to i+d
713

Distributed Schedule

Slot

The work to be done to play one block of a movie

Deliver a stream
Deliver the blocks of the stream disk by disk
Can be viewed as a slot moving along disks step by step

Deliver multiple streams


Multiple slots moving along disks step by step

Viewer state

Network address of client


File ID for current movie
Number of next block
Viewers next play slot

2
slot 0
viewer 4
state

slot 1
free

block play timeT


slot 2
free

slot 3
viewer 0
state

block service
time t

1
slot 4
viewer 3
state

slot 5
viewer 2
state

slot 6
free

slot 7
viewer 1
state
714

Distributed schedule (conti)


Block play time - T
The time that will be required for a viewer to display a
block on the client computer
Typically about 1 second for all streams
The next block of a stream must begin to be delivered
T time after the current block begin to be delivered

Block service time t ( a slot )


Read the next block into buffer
Deliver it to the client
Update viewer state in the schedule and pass the
updated slot to the next cub
T / t typically result in a value > 4

The maximum streams the Tiger system can


support simultaneously
T/t * the number of disks
715

Performance and scalability


Initial prototype [1994]
5 x cubs: 133MHz Pentium PCs(48M RAM,
2G SCSI disk, Windows NT), ATM network
68 simultaneous streams with perfect quality
One cub failed, the loss rate is 0.02%

14 cubs: each 4 disks, ATM network


[1997]
602 simultaneous streams ( 2Mbps)
Loss rate < 0.01%; with one cub failed,
loss rate < 0.04%

The designers suggested that Tiger


could be scaled to 1000 cubs
supporting 30,000 clients.

716

Question bank
Explain the quality of service management and
resource management in multimedia applications.
Discuss the importance of Quality of service
negotiation and Admission control in the multimedia
applications.
What are the characteristics of multimedia streams?
Explain the impacts of Scaling and Filtering on
Stream adaptation
What is the purpose of Traffic shaping? What are
various approaches to avoid bursting of stream ?
Discuss the impact distributed multimedia in the
Tiger Video file server.
717

Chapter 2 :Google Case study


Google is a US-based corporation with its
headquarter in Mountain View, CA.
offering Internet search and broader web
applications and earning revenue largely
from advertising associated with such
services.
The name is a play on the word googol,
the number 10^100 ( or 1 followed by a
hundred zeros), emphasizing the sheer
scale of information in Internet today.
Google was born out of a research project
at Standford with the company 718launched

Google Distributed System: Design


Strategy

Google has diversified and as well as providing


a search engine is now a major player in cloud
computing.
88 billion queries a month by the end of 2010.
The user can expect query result in 0.2
seconds.
Good performance in terms of scalability,
reliability, performance and openness.
We will examine the strategies and design
decision behind that success, and provide
insight into design of complex distributed
system.
719

Google Search Engine

Consist of a set of services


Crawling: to locate and retrieve the contents of the web
and pass the content onto the indexing subsystem.
Performed by a software called Googlebot.
Indexing: produce an index for the contents of the web
that is similar to an index at the back of a book, but on a
much larger scale. Indexing produces what is known as
an inverted index mapping words appearing in web
pages and other textual web resources onto the position
where they occur in documents. In addition, index of
links is also maintained to keep track of links to a given
site.
Ranking: Relevance of the retrieved links. Ranking
algorithm is called PageRank inspired by citation
number for academic papers. A page will be viewed as
important if it is linked to by a large number of other
720
pages.

Outline architecture of the original Google search engine


[Brin and Page 1998]

721

Google as a cloud
provider
Google is now a major player in cloud computing which is
defined as a set of Internet-based application, storage and
computing services sufficient to support most user's needs, thus
enabling them to largely or totally dispense with local data
storage and application software.
Software as a service: offering application-level software over
the Internet as web application. A prime example is a set of
web-based applications including Gmail, Google Docs, Google
Talk and Google Calendar. Aims to replace traditional office
suites. ( more examples in the following table)
Platform as a service: concerned with offering distributed
system APIs and services across the Internet, with these APIs
used to support the development and hosting of web
applications. With the launch of Google App Engine, Google
went beyond software as a service and now offers it distributed
system infrastructure as a cloud service. Other organizations to
run their own web applications on the Google platform.
722

Example Google applications

723

Google Physical
The keymodel
philosophy of Google in terms of physical infrastructure is

to use very large numbers of commodity PCs to produce a costeffective environment for distributed storage and computation.
Purchasing decision are based on obtained the best performance
per dollar rather than absolute performance. When Brin and Page
built the first Google search engine from spare hardware
scavenged from around the lab at Standford university.

Typical spend is $1k per PC unit with 2 Terabytes of disk


storage and 16 gigabytes of memory and run a cut-down
version of Linux kernel.

Physical Architecture of Google is constructed as:

724

Commodity PCs are organized in racks with between 40 to 80


PCs in a given rack. Each rack has a Ethernet Switch.
30 or more Racks are organized into a cluster, which are a key
unit of management for placement and replication of services.
Each cluster has two switched connected the outside world or
other data centers.
Clusters are housed in data centers that spread around the
world.

Physical model
Organization of the Google physical infrastructure

(To avoid clutter the Ethernet connections are shown from only one of the clusters to
the external links)

725

Key Requirements

Scalability: i). Deal with more data ii) deal with more
queries and iii) seeking better results
Reliability: There is a need to provide 24/7 availability.
Google offers 99.9% service level agreement to paying
customers of Google Apps covering Gmail, Google
Calendar, Google Docs, Google sites and Google Talk.
The well-reported outage of Gmail on Sept. 1 st 2009
(100 minutes due to cascading problem of overloading
servers) acts as reminder of challenges.
Performance: Low latency of user interaction. Achieving
the throughput to respond to all incoming requests
while dealing with very large datasets over network.
Openness: Core services and applications should be
open to allow innovation and new applications.
726

The overall Google systems


architecture

727

Google infrastructure

728

Google Infrastructure

The underlying communication paradigms, including services for both


remote invocation and indirect communication.

The protocol buffers offers a common serialization format including


the serialization of requests and replies in remote invocation.

The publish-subscribe supports the efficient dissemination of events


to large numbers of subscribers.
Data and coordination services providing unstructured and semistructured abstractions for the storage of data coupled with services to
support access to the data.
GFS offers a distributed file system optimized for Google application and services
like large file storage.
Chubby supports coordination services and the ability to store small volumes of
data
BigTable provides a distributed database offering access to semi-structure data.
Distributed computation services providing means for carrying out
parallel and distributed computation over the physical infrastructure.
MapReduce supports distributed computation over potentially very large datasets
for example stored in Bigtable.
Sawzall provides a higher-level language for the execution of such distributed
computation.
729

Protocol buffers example

730

Summary of design choices related


to communication paradigms - part 1

731

Summary of design choices related


to communication paradigms - part 2

732

Data Storage and Coordination


Service
1. Namespace for files
2. Access control
3. Mapping of file to set
of chunks and each
chunk is replicated on
three chunkservers.

64Mega Each
chunk

NFS and AFS are general-purpose distributed file system


offering file and directory abstraction. The GFS offers similar
abstractions but is specialized for storage and access to very
large quantities of data (not huge number of files but each file
is massive 100Mega or 1Giga) and sequential reads and
sequential write as opposed to random reads and writes. Must
also run reliably in the face of any failure condition.
733

Chubby API
Four distinct capabilities:
1.Distribute locks to synchronize
distributed activities in a largescale asynchronous
environment.
2.File system offering reliable
storage of small files
complementing the service
offered by GFS.
3.Support the election of a
primary in a set of replicas.
4.Used as a name service within
Google.
It might appear to contradict the
over design principle of
simplicity doing one thing and
doing it well. However, we will
see that its heart is one core
service that is offering a solution
to distributed consensus and
734
other facets emerge from this

Overall architecture of Chubby

735

Message exchanges in Paxos


(in absence of failures) - step 1

736

Message exchanges in Paxos


(in absence of failures) - step 2

737

Message exchanges in Paxos


(in absence of failures) - step 3

738

The table abstraction in Bigtable

For example, web pages uses


rows to represent individual
web pages, and the columns
to represent data and
metadata associated with
that given web page.

For example, Google earth


uses rows to represent
geographical segments and
columns to represent
different images available for
that segment.

GFS offers storing and accessing large flat file which is accessed relative to byte
offsets within a file. It is efficient to store large quantities of data and perform
sequential read and write (append) operations. However, there is a strong need for a
distributed storage system that provide access to data that is indexed in more
sophisticated ways related to its content and structure.
Instead of using an existing relational database with a full set of relational operators
(union, selection, projection, intersection and join). However, the performance and
scalability is a problem. So Google uses BigTable in 2008 which retains the table
model but with a much simpler interface.
Given table is a three-dimensional structure containing cells indexed by a row key, a
739
column key and a timestamp to save multiple versions.

Overall architecture of Bigtable

A Bigtable is broken up into tablets, with a given tablet being


approximately 100 to 200 megabytes in size. It use both GFS and
Chubby for data storage and distributed coordination.
Three major components:
A library component on the client side
A master server
A potential large number of tablet servers
740

The storage architecture in Bigtable

741

The hierarchical indexing scheme


adopted by Bigtable

A Bigtable client seeking the location of a tablet starts the search by


looking up a particular file in Chubby that is known to hold the
location of a root tablet (containing the root index of the tree
structure).
The root contains metadata about other tablets specifically about
other metadata tablets, which in turn contain the location of the
actual data tablets.
742

Summary of design choices related


to data storage and coordination

743

Distributed Computation
Services
It is important to support high performance distributed computation
over the large datasets stored in GFS and Bigtable. The Google
infrastructure supports distributed computation through MapReduce
service and also the higher level Sawzall language.
Carry out distributed computation by breaking up the data into
smaller fragments and carrying out analyses (sorting, searching and
constructing inverted indexes) of such fragments in parallel, making
use of the physical architecture.
MapReduce {Dean and Ghemawat 2008} is a simple programming
model to support the development of such application, hiding
underlying detail from the programmer including details related to the
parallelization of the computation, monitoring and recovery from
failure, data management and load balancing onto the underlying
physical infrastructure.
Key principle behind MapReduce is that many parallel
computations share the same overall pattern that is:
Break the input data into a number of chunks
Carry out initial processing on these chunks of data to produce
744
intermediary results ( map function)

Distributed Computation Services:


MapReduce
For example, search web with words distributed system book:
Assume map and reduce function is supplied with a web page
name and its contents as input, the map function searches linearly
through the contents, emitting a key-value pair consisting of the
phrase followed by the name of the web document containing this
phrase.
The reduce function is in this case is trivial, simply emitting the
intermediary ressults ready to be collated together into a
complete index.
The MapReduce implementation is responsible for breaking the
data into chunks, creating multiple instances of the map and
reduce function, allocating and activating them on available
machines in the physical infrastructure, monitoring the
computations for any failures and implementing appropriate
recovery strategies, dispatching intermediary results and ensuring
optimal performance of the whole system.
745

Google reimplemented the main production indexing system in

Examples of the use of MapReduce

746

The overall execution of a MapReduce


program

The first stage is to split the input file into M pieces, with each piece being
typically 16-64 megabytes in size (no bigger than a single chunk in GFS). The
intermediary results is also partitioned into R pieces. So M map and R reduce.
The library then starts a set of worker machines from the pool available in the
cluster with one being designed as the master and other being used for executing
map or reduce steps.
A worker that has been assigned a map task will first read the contents of the
input file allocated to that map task, extract the key-value pairs and supply them
as input to the map function. The output of the map function is a processed set of
key/value pairs that are held in an intermediary buffer.
The intermediary buffers are periodically written to a file local to the map
computation. At this stage, the data are partitioned resulting in R regions.
Unusually apply hash function to key then modulo R to the hashed value to
produce R partitions.
When a worker is assigned to carry out a reduce function, it reads
747 its
corresponding partition from the local disk of the map workers using RPC. The

The overall execution of a Sawzall


program

748

Summary of design choices related to


distributed computation

749

Question Bank
Discuss the overall Google architecture for
distributed computing.
Discuss in details the data storage and coordination
services provided in the Google infrastructure.
What is the purpose of distributed computing
services?
Explain how the Google infrastructure supports
distributed computation?
Write short note on
(i) Chubby
(ii) Communication paradigm
750

S-ar putea să vă placă și