Session 04 - Paper 57

Proceedings of International Conference on Computing Sciences
WILKES100 ICCS 2013

ISBN: 978-93-5107-172-3
Efficient check-pointing techniques for distributed systems
Harjinder Kaur
1*
and Rachit Garg
2

1
Lecturer,School of Computer Applications, Lovely Professional University, PB, India
2
Assistant Professor ,School of Computer Applications, Lovely Professional University, PB, India
Abstract
A checkpoint is the state of a process on stable storage and Checkpointing is a technique that is used for to recover to a fault
tolerant state. A state is said to be consistent if it contains no inconsistent state. In checkpointing processes take checkpoints
that result in consistent global state. During failure, the system restarts its execution from a previous consistent state which
should be global and finally saves on the stable storage the last checkpointed state and only the computation done after that
needs to be redone. In this paper, we present some inconsistencies in existing in checkpointing for Distributed Systems and
give guidelines for reviewing such protocols [1, 4, 5-10].
2013 Elsevier Science. All rights reserved.
Keywords: Checkpointing, happened before, global state, Clock ordering
1. Introduction
A Distributed System is a collection of autonomous processes which are spatially distributed and
communication between these processes is implemented using communication channels through which the
processes exchange information. Each process in Distributed system is having certain events [1, 2, 3, 4].
The problem here is the decision concerning which event to occur first. To overcome this problem we are
introducing the partial ordering which is defined by happened before relationship. The happened before
relationship for two events is defined as follows:
The event e1 is decided to occur before e2 then using happened before relationship it is represented as: e1-
>e2
Check-pointing in Distributed systems
The checkpoint is defined as the saved state of a process. Checkpionting is difficult to implement in
distributed systems because in distributed systems there are multiple streams of execution at a time and there is
no global clock [5].
Due to the absence of global clock it is difficult to start checkpoint in all streams at the same instance of time.
In order to permit consistent rollback recovery implementation the checkpoints from individual streams are
selected in such a way that they the selected checkpoints are concurrent. The following are the various methods
which help in selecting one checkpoint per process which forms global consistent checkpoint which allow global
rollback recovery [4, 6].
Corresponding Author: Harjinder Kaur
415 Elsevier Publications, 2013
Harjinder Kaur and Dr Rachit Garg
Ordering of events in distributed systems
In distributed environment different processors exchange information which results in dependency among
events of different processors making it difficult to implement Total Ordering. Lamports proposed a solution to
this problem known as happens before relation which introduce partial ordering of events in distributed
systems which is solution to total ordering [5-8]. The following are some definitions which articulate about
various events and checkpoints in distributed environment;
Lamports happen before relation (Definition 1)
1. If a and b are two events occurring in the same process and if a occurs before b then it is defined by a->b
2. If a is the event of sending a message and b is the event of receiving the same message in another
process then a->b.
Concurrent events (Definition 2)
Two events a and b are said to be concurrent iff a does not occur before b and b does not occur before a.
Local checkpoint (Definition 3)
Local checkpoint is the event which records the state of process of a processor at given instance of time.
Global checkpoints (Definition 4)
Its a collect ion of the entire local checkpoints one from each processor.
Consistent global checkpoint (Definition 5)
A consistent Global checkpoint G
c
is a collection of all the checkpoints one from each processor in such a way
that each local checkpoint is concurrent to every other local checkpoint.
The Partial ordering
The occurring of events is represented in terms of time. J ust take one example for that if we said some event
is happened at 2:30 then that event is to be considered happened if it occurred before 2:31. If the specifications of
an event are represented in terms of physical clocks the system must have real clocks in order to observe the
events. A system is collect ion of different processes which in turn is a collection of different events. For
example in communication process sending and receiving messages are considered as two different events. We
can represent these events with the help of happened before relationship which is represented by ->. Suppose
sending of messages occurred before receiving then with happened before relationship it is represented by
sending->receiving [8-12].
Logical clocks
The difficulties related to physical clocks are overcome by the implementation of logical clocks. It is the way
of assigning the number to event which is considered as the time in which the event is going to happen. The
clock is defined as Ci for process Pi which assigns a number Ci(e) where e is an event in a process. The concept
of logical clock is implemented with the help of counters [5, 7]. If we have multiple events then the order for the
events in which they occur is defined based on some condition known as Clock Condition which is defined as
follows:
Clock Condition: For any events e1 and e2 if e1->e2 the C(e1)<C(e2) which means that the event e1 must
occurred before event e2.
This condition will not hold true for concurrent events means the events that occurred at the same time. So in
order to ensure the clock condition is satisfied if the following two conditions were satisfied;
1. If e1 and e2 are events in a process Pi and e1 comes before e2 then Ci(e1)<Ci(e2).
2. If e1 is the sender of a message of process Pi and e2 is the recipient of process Pj then Ci(e1)<Ci(e2).
Checkpoint algorithm assumptions for message passing system
There are number of Check-pointing Algorithms available for message passing but here we are discussing the
one proposed by Chandy and Lamports. According to Chandy and Lamports the distributed system consists of
finite set of processors and finite number of channels which allow the communication possible between the
available processors. The following are the molds on which the algorithm is based:
1. The distributed system consists of finite set of processors and finite set of Channels.
2. All the communication between processors is through the available communication channels.
3. All the channels are fault free.
4. The global state of processors =Local state of all the processors +state of communication channels.
5. State of Channel refers to the set of messages sent through the channels but not yet received by the
destination through the channel.
6. Infinite capacity buffers are available.
7. Termination of algorithm ensures fault-free communication.
Types of Check-pointing
In distributed systems different types of Checkpoints [5, 7, 8-10] are available. In this section we are describing
each type of check-pointing:
Centralizes vs. distributed checkpoints: In Centralizes check-pointing single node initiate the checkpoints and co-
ordinated with other participating nodes. The problem with centralized approach is that all other participating
nodes have to initiate the checkpoints once the centralized node decide the checkpoint whereas in distributed
there is no single node that is going to initiate the checkpoint. In distributed check-pointing individual node can
initiate the checkpoint independently.
Complete vs. selective check-pointing/ rollback: In complete check-pointing nodes has to participate in every
global check-pointing. In selective check-pointing groups of nodes that are dependent upon each other participate
in the process check-pointing. In completer rollback force all the nodes in the system to rollback and restart to
maintain the consistency. But in selective Rollback only the group of dependent nodes needs to be rollback and
others can continue with their operations.
Static vs. dynamic check-pointing: In Static Check-pointing the location of the checkpoints are identified before
the program execution starts. Static check-pointing is best suitable for uniprocessor systems whereas in dynamic
check-pointing the locations of checkpoints are identified during the execution of the program by initiating the
checkpoint algorithm.
Periodic vs. non-periodic checkpoints: Periodic checkpoint algorithm forces the nodes to initiate checkpoints at
predertmined times whereas nonperiodic algorithms do not force the nodes to initiate checkpoints at predertimed
times. The cost incurred in Periodic algorithms will be measured in constructing global consistent state which is
not in the case of aperiodic in which there will be no concurrent checkpoints.
Example of Distributed System
To demonstrate the definition of distributed system consider a system consisting of two processes S1 and S2 and
two channels D1 and D2
Harjinder Kaur and Dr Rachit Garg
Fig 1: The simple distributed system
In order to illustrate the single-token conversation system consider a system contains a single token that is to
be passed from one process to another. Based on token concept each process is having two sates T1 and T2 where
T1 is a state that does not keep the token and T2 is the state which does.
Fig 2: State transition diagram of a process
Fig.3 Global states and transitions of the single token conservation system
In the above example only one token that has to passed from one process to another. Each process is having
two events 1) sending of token which means transition from T2 to T1 Because T2 state is holding the token 2)
receiving of token which means transition from T1 to T2. The conversation is shown in Fig.3.
Conclusion
We have pointed out some inconsistencies in some checkpointing protocol techniques for distributed systems
and investigate the problems [1, 8, 9, 11-13] and present some findings to avoid them. Hence, our work will
enable to further design efficient checkpointing techniques which will be non inconsistent in nature [10].
References
[1] Garg, R.,Checkpointing with light Checkpoints for Mobile Distributed Computing System, International Conference on Advanced
Computing and Communication Technologies, 16th November, 2013, InderScience Publishers, Geneva, Switzerland and Guangdong
University of Technology, China (Accepted).
[2] Malhotra, N., Garg, R., Mahajan, R., Quantitative Detection of AODV against Black Hole and Worm Hole Attacks in MANET
International J ournal of Computer Application Volume 68 - Number 11, J anuary 2013. (Foundation of Computer Science , New York , USA
).
[3]Thind.T.,&Garg.R.,"Mobiledistributed system: concepts, issues, challenges", National Conference on Emerging Trends in Computer
Science & Engineering (ETCSE-2012) , 11th-12th May,2012, Guru Kashi University, Talwandi Sabo, Punjab, India
[4] Khunteta A., Sharma P., & Garg R., New and efficient Low Overheads Algorithm for Mobile Distributed Systems, ICWET11, ACM
Digital Library New York USA, February 2011.
[5] Garg, R., & Kumar P., A Review of Fault Tolerant Checkpointing Protocols for Mobile Computing System, International J ournal of
Computer Applications, Vol 3, No2, J une 2010. (Foundation of Computer Science , New York , USA )
[6] Kumar P, & Garg R., An Efficient Synchronous Checkpointing Protocol for Mobile Distributed Systems, Global Journal of Computer
Science and Technology, Vol. 10 Issue 5 J une/J uly 2010.
[7] Garg, R., & Kumar P., A Review of Checkpointing Fault Tolerance Techniques in Distributed Mobile Systems, International Journal
on Computer Science and Engineering, J une/J uly 2010.
[8] Garg, R., & Kumar P., A Non-blocking Coordinated Checkpointing Algorithm for Mobile Computing System, International Journal of
Computer Science Issues, Vol 3, Issue 3, No3, May 2010.
[9] Garg, R., & Kumar P., Low Overhead Checkpointing Protocols for Mobile Distributed Systems: A Comparative Study, International
Journal on Engineering Science and Technology, J une/J uly 2010 .
[10] Kumar P, & Garg R., Soft-Checkpointing Based Coordinated Checkpointing Protocol for Mobile Distributed System, International
Journal of Computer Science Issues, Vol 3, Issue 3, No5, May 2010.
[11] Garg, R., Sensor Networks: Opportunities and Challenges,. Computer Society of India, CSI-Communications (Monthly J ournal). Volume
No. 30, Isuue No. 6, September 2006, pp-50-54.
[12] Garg, R., Singh, M. and Singh, Baldev. 2006 Sensor Networks: Technology Trends, Proc. Second National Conference on Electronic
Circuits and Communication Systems ECCS-2006 on February 9-10, 2006, Thapar Institute of Engineering and Technology (Deemed
University) Patiala, India pp-433,437.
[13] Kuljeet Kaur, Technologies to Overcome from Intimidation of Wireless Network Security, International J ournal of Applied
Information Systems 2(1):25-29, May 2012. Published by Foundation of Computer Science, New York, USA
Index

C
Chat bot, 413
Corpora, 413
Corpora collection, 413

M
Market analysis, 411412

N
Natural language processing, 413

P
Poem editor, 412
Poem viewer, 413
Potential impact
natural language processing, 413
outreach, 414
psychology, 414
Psychology, 414

R
Rhyming word dictionary, 413

S
Social media, 413

W
Word predictor, 413
Word syllabificator, 413

Session 04 - Paper 57

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Session 04 - Paper 57

Încărcat de

Drepturi de autor:

Formate disponibile

Proceedings of International Conference on Computing Sciences

WILKES100 ICCS 2013

S-ar putea să vă placă și