Sunteți pe pagina 1din 96

International Journal of Computer Science

and Business Informatics


(IJCSBI.ORG)

ISSN: 1694-2507 (Print)


VOL 8, NO 1
ISSN: 1694-2108 (Online) DECEMBER 2013
IJCSBI.ORG
Table of Contents VOL 8, NO 1 DECEMBER 2013

An Integrated Distributed Clustering Algorithm for Large Scale WSN ................................................... 1


S. R. Boselin Prabhu, S. Sophia, S. Arthi and K. Vetriselvi

An Efficient Connection between Statistical Software and Database Management System ................... 1
Sunghae Jun

Pragmatic Approach to Component Based Software Metrics Based on Static Methods ......................... 1
S. Sagayaraj and M. Poovizhi

SDI System with Scalable Filtering of XML Documents for Mobile Clients ............................................... 1
Yi Yi Myint and Hninn Aye Thant

An Easy yet Effective Method for Detecting Spatial Domain LSB Steganography .................................... 1
Minati Mishra and Flt. Lt. Dr. M. C. Adhikary

Minimizing the Time of Detection of Large (Probably) Prime Numbers ................................................... 1


Dragan Vidakovic, Dusko Parezanovic and Zoran Vucetic

Design of ATL Rules for TransformingUML 2 Sequence Diagrams into Petri Nets..................................... 1
Elkamel Merah, Nabil Messaoudi, Dalal Bardou and Allaoua Chaoui
International Journal of Computer Science and Business Informatics

IJCSBI.ORG

An Integrated Distributed Clustering


Algorithm for Large Scale WSN
S. R. BOSELIN PRABHU
Assistant Professor, Department of Electronics and Communication Engineering
SVS College of Engineering, Coimbatore, India.

S. SOPHIA
Professor, Department of Electronics and Communication Engineering
Sri Krishna College of Engineering and Technology, Coimbatore, India.

S. ARTHI & K. VETRISELVI


UG Students, Department of Electronics and Communication Engineering
SVS College of Engineering, Coimbatore, India.

Abstract
Latest researches in wireless communications and electronics has imposed
the progress of low-cost wireless sensor nodes. Clustering is a thriving
topology control approach, which can prolong the lifetime and increase
scalability for wireless sensor networks. The admired criteria for clustering
methodology are to select cluster heads with more residual energy and to
rotate them periodically. Sensors at heavy traffic locations quickly deplete
their energy resources and die much earlier, leaving behind energy hole and
network partition. In this paper, a model of distributed layer-based
clustering algorithm is proposed based on three concepts. First, the
aggregated data is forwarded from cluster head to the base station through
cluster head of the next higher layer with shortest distance between the
cluster heads. Second, cluster head is elected based on the clustering factor,
which is the combination of residual energy and the number of neighbors of
a particular node within a cluster. Third, each cluster has a crisis hindrance
node, which does the function of cluster head when the cluster head fails to
carry out its work in some critical conditions. The key aim of the proposed
algorithm is to accomplish energy efficiency and to prolong the network
lifetime. The proposed distributed clustering algorithm is contrasted with
the existing clustering algorithm LEACH.

Keywords: Wireless sensor network (WSN), distributed clustering


algorithm, cluster head, residual energy, energy efficiency, network lifetime.

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 1


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
1. INTRODUCTION
Wireless sensor network (WSN) is a collection of huge number of small,
low-power and low-cost electronic devices called sensor nodes. Each sensor
node consists of four major blocks: sensing, processing, power and
communication unit and they are responsible for sensing, processing and
wireless communications (figure 1). These nodes bring together the relevant
data from the environment and then transfer the gathered data to base station
(BS). Since WSNs has many advantages like self organization,
infrastructure-free, fault-tolerance and locality, they have a wide variety of
potential applications like border security and surveillance, environmental
monitoring and forecasting, wildlife animal protection and home
automation, disaster management and control. Considering that sensor
nodes are usually deployed in remote locations, it is impossible to recharge
their batteries. Therefore, ways to utilize the limited energy resource wisely
to extend the lifetime of sensor networks is a very demanding research issue
for these sensor networks.

Figure 1: Various components of a wireless sensor node

Clustering [2-7] is an effectual topology control approach, which can


prolong the lifetime and increase scalability for these sensor networks. The
popular criterion for clustering technique (figure 2) is to select a cluster head
(CH) with more residual energy and to spin them periodically. The basic
idea of clustering algorithms is to use the data aggregation [8-11]
mechanism in the cluster head to lessen the amount of data transmission.
Clustering goes behind some advantages like network scalability, localizing

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 2


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
route setup, uses communication bandwidth [17] efficiently and takes
advantage of network lifetime [12-16]. By the data aggregation process,
unnecessary communication between sensor nodes, cluster head and the
base station is evaded. In this paper, a well-defined model of distributed
layer-based clustering algorithm is proposed based of three concepts: the
aggregated data is forwarded from the cluster head to the base station
through cluster head of the next higher layer with shortest distance between
the cluster heads, cluster head is elected based on the clustering factor and
the crisis hindrance node does the function of cluster head when the cluster
head fails to carry out its work. The prime aim of the proposed algorithm is
to attain energy efficiency and increased network lifetime.

Figure 2: Cluster formation in a wireless sensor network

The rest of this paper is structured as follows. A literature review of existing


distributed clustering algorithms, talking about their projected advantages
and shortcomings is profoundly conversed in Section 2. An evaluation of
the existing clustering algorithm LEACH (Low Energy Adaptive Clustering
Hierarchy) and the basic concept behind this algorithm is briefed in Section
3. Section 4 sketches a precise model of the proposed distributed layer-
based clustering algorithm, enumerating the precious hiding concepts
behind it. Finally, the last section gives the conclusion creatively.

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 3


International Journal of Computer Science and Business Informatics

IJCSBI.ORG

2. A REVIEW OF EXISTING CLUSTERING ALGORITHMS


Bandyopadhyay and Coyle anticipated EEHC [18], which is a randomized
clustering algorithm which categorizes the sensor nodes into hierarchy of
clusters with an objective of minimizing the total energy spent in the system
to communicate the information gathered by the sensors to the information
processing center. It has variable cluster count, the immobile cluster head
aggregates and relays the data to the BS. It is valid for extensive large scale
networks. The peculiar negative aspect of this algorithm is that, some nodes
remain un-clustered throughout the clustering process.

Barker, Ephremides and Flynn proposed LCA [19], which is chiefly


developed to avoid the communication collisions among the nodes by using
a TDMA time-slot. It makes utilization of single-hop scheme thereby
attaining high degree of connectivity when CH is selected randomly. The
restructured version of LCA, the LCA2 was implemented to lessen the
number of nodes compared to the original LCA algorithm. The key
drawback of this algorithm is that, the single-hop clustering leads to the
creation of more number of clusters.

Nagpal and Coore proposed CLUBS [20], which is executed with an idea to
form overlapping clusters with maximum cluster diameter of two hops. The
clusters are created by local broadcasting and its convergence depends on
the local density of the wireless sensor nodes. This algorithm can be
implemented in asynchronous environment without dropping efficiency.
The main difficulty is the overlapping of clusters, clusters having their CHs
within one hop range of each other, thereby both the clusters will collapse
and CH election process will get restarted.

Demirbas, Arora and Mittal brought out FLOC [21], which shows double-
band nature of wireless radio-model for communication. The nodes can
commune reliably with the nodes in the inner-band and unreliably with the
nodes that are in the outer-band. The chief disadvantage of the algorithm is,
the communication between the nodes in the outer band is unreliable and the
messages have maximum probability of getting lost during communication.
Ye, Li, Chen and Wu proposed EECS [22], which is based on a supposition
that all CHs can communicate directly with the BS. The clusters have
variable size, those closer to the CH are larger in size and those farther from
CH are smaller in size. It is really energy efficient in intra-cluster
communication and shows an excellent improvement in network lifetime.

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 4


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
EEUC is anticipated for uniform energy consumption within the sensor
network. It forms dissimilar clusters, with a guessing that each cluster can
have variable sizes. Probabilistic selection of CH is the focal shortcoming of
this algorithm. Few nodes will be gone without being part of any cluster.
Yu, Li and Levy proposed DECA, which selects CH based on residual
energy, connectivity and a node identifier. It is greatly energy efficient, as it
uses lesser messages for CH selection. The main trouble with this algorithm
is that high risk of wrong CH selection which leads to the discarding of
every packets sent by the wireless sensor node.

Ding, Holliday and Celik proposed DWEHC, which elects CH on the basis
of weight, a combination of nodes residual energy and its distance to the
neighboring nodes. It produces well balanced clusters, independent of
network topology. A node possessing largest weight in a cluster is
designated as CH. The algorithm constructs multilevel clusters and the
nodes in every cluster reach CH by relaying through other intermediate
nodes. The foremost problem occurs due to much energy utilization by
several iterations until the nodes settle in most energy efficient topology.

HEED is a well distributed clustering algorithm in which CH selection is


done by taking into account the residual energy of the nodes and intra-
cluster communication cost leading to prolonged network lifetime. It is clear
that it can have variable cluster count and supports heterogeneous sensors.
The problems with HEED are its application narrowed only to static
networks, the employment of complex methods and multiple clustering
messages per node for CH selection even though it prevents random
selection of CH.

3. AN EVALUATION OF LEACH ALGORITHM


LEACH [1] is one of the most well-liked clustering mechanisms for WSNs
and it is considered as the representative energy efficient protocol. In this
protocol, sensor nodes are unified together to form a cluster. In each cluster,
one sensor node is chosen arbitrarily to act as a cluster head (CH), which
collects data from its member nodes, aggregates them and then forwards to
the base station. It disperses the operation unit into many rounds and each
round consists of two phases: the set-up phase and the steady phase. During
the set-up phase, initial clusters are fashioned and cluster heads are selected.
All the wireless sensor nodes produce a random number between 0 and 1. If
the number is lesser than the threshold, then the node selects itself as the
cluster head for the present round. The threshold for cluster head selection
in LEACH for a particular round is given in equation 1. Gone selecting

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 5


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
itself as a CH, the sensor node broadcasts an advertisement message which
has its own ID. The non-cluster head nodes can formulate an assessment,
which cluster to join based on the strength of the received advertisement
signal. After the decision is made, every non-cluster head node should
transmit a join- request message to the chosen cluster head to specify that it
will be a member of the cluster. The cluster head fashions and broadcasts a
time division multiple access (TDMA) schedule to exchange the data with
non-cluster sensor nodes without collision after it receives all the join-
request messages.

(1)

where p is the preferred percentage of cluster heads, r is the current round


number and G is the set of nodes which have not been chosen as cluster
head for the last 1/p rounds.

The steady phase commences after the clusters are fashioned and the TDMA
schedules are broadcasted. All of the sensor nodes transmits their data to the
cluster head once per round during their allotted transmission slot based on
the TDMA schedule and in other time, they turn off the radio in order to
trim down the energy consumption. However, the cluster heads must stay
awake all the time. Therefore, it can receive every data from the nodes
within their own clusters. On receiving the data from the cluster, the cluster
head carries out data aggregation mechanism and onwards it to the base
station directly. This is the entire mechanism of the steady state phase. After
a certain predefined time, the network will step into the next round. LEACH
is the basic clustering protocol which processes cluster approach and it can
prolong the network lifetime in comparison with other multi-hop routing
and static routing. However, there are still some hiding problems that should
be considered.

LEACH does not take into account the residual energy to elect cluster heads
and to construct the clusters. As a result, nodes with lesser energy may be
elected as cluster heads and then die much earlier. Moreover, since a node
selects itself as a cluster head only according to the value of the calculated
probability, it is hard to guarantee the number of cluster heads and their
distribution. Also in LEACH clustering algorithm, the cluster heads are
selected randomly and hence the weaker nodes drain easily. To rise above

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 6


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
these shortcomings in LEACH, a model of distributed layer-based clustering
algorithm is proposed, where clusters are arranged in to hierarchical layers.
Instead of cluster heads directly sending the aggregated data to the base
station, sends them to their next layer nearer cluster heads. These cluster
heads send their data along with that received from lower level cluster heads
to the next layer nearer cluster heads. The cumulative process gets repeated
and finally the data from all the layers reach the base station. The proposed
model is dedicated with some expensive designs, focusing on reduced
energy utilization and improved network lifetime of the sensor network.
4. THE PROPOSED CLUSTERING ALGORITHM
The proposed clustering algorithm is well distributed, where the sensor
nodes are deployed randomly to sense the target environment. The nodes are
divided into clusters with each cluster having a CH. The nodes throw the
information during their TDMA timeslot to their respective CH which fuses
the data to avoid redundant information by the process of data aggregation.
The aggregated data is forwarded to the BS. Compared to the existing
algorithms, the proposed algorithm has three distinguishing features. First,
the aggregated data is forwarded from the cluster head to the base station
through cluster head of the next higher layer with shortest distance between
the cluster heads. Second, cluster head is elected based on the clustering
factor, which is the combination of residual energy and the number of
neighbors of a particular node within a cluster. Third, each cluster has a
crisis hindrance node, that does the function of cluster head when the cluster
head fails to carry out its work in some conditions.

Figure 3: Aggregated data forwarding in the proposed algorithm

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 7


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
A. Aggregated Data Forwarding
In a network of N nodes, each node is assigned with an exclusive Node
Identity (NID). The NID just serves as a recognition of the nodes and has no
relationship with location or clustering. The CH will be placed at the center
and the nodes will be organized in to several layers around the CH. Every
clusters are arranged into hierarchical layers and layer numbers are assigned
to each clusters. The cluster that is far away from the base station is
designated as the lowest layer and the cluster nearer to the base station is
designated as the highest layer. The main characteristic feature of the
proposed algorithm is that the lowest layer cluster head forwards only its
own aggregated data to the next layer cluster head but the highest layer
forwards all the aggregated data from the preceding cluster heads to the base
station (figure 3). Thus lower workload is assigned to the lower layers but
the higher layers are assigned with greater workload. The workload assigned
to a particular cluster head is directly proportional to the energy utilization
of the cluster head. In order to balance the energy utilization among the
cluster head, the concept of variable transmission power is employed, where
the transmission power reduces with increase in layer numbers. In LEACH,
each cluster head forwards the aggregated data to the base station directly
which uses much energy. The proposed algorithm uses a multi-hop fashion
of data forwarding from cluster head to the base station resulting in reduced
energy utilization.

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 8


International Journal of Computer Science and Business Informatics

IJCSBI.ORG

Figure 4: Mechanism of cluster head selection in the proposed algorithm


B. Cluster Head Selection
The cluster head is elected based on the clustering factor (figure 4), which is
the combination of residual energy and the number of neighbors of a
particular node within a cluster. Residual energy is defined as the energy
remaining within a particular node after some number of rounds. This is
generally believed as one of the main parameter for CH selection in the
proposed algorithm. A neighboring node is a node that remains closer to a
particular node within one hop distance. LEACH selects cluster head only
based on residual energy, but in the proposed algorithm an additional
parameter is included basically to elect the cluster head properly, thereby to
reduce the node death rate. The main characteristic feature of the proposed
algorithm compared to LEACH is that, the base station does not involve in
clustering process directly or indirectly. A node with highest clustering
factor is selected as cluster head for the current round. This is generally
significant in mobile environment, when the sensor nodes move, the number

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 9


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
of neighbors vary which should be taken into account but it is barely not
concentrated in the LEACH clustering mechanism.
C. Alternate Crisis Hindrance Node
In a cluster with large number of nodes, cluster crisis does not affect the
overall performance of the wireless sensor system. But in the case of
network with less number of nodes, cluster crisis greatly affects the wireless
sensor system. Care should be done when cluster head selection process by
applying alternate recovery mechanisms. In addition to the regular cluster
head, additional cluster node is assigned the task of secondary cluster head,
and the particular node is called as crisis hindrance node. Generally the
cluster collapses when the cluster head fails. In such situations, crisis
hindrance node act as cluster head and recovers the cluster. The main
characteristic feature of the proposed algorithm is that, the crisis hindrance
node solely performs the function of recovery mechanism and does not
involve in sensing process. In case of LEACH, the distribution and the
loading of CHs to all nodes in the networks is not uniform by switching the
cluster heads periodically. Hence, there is a maximum probability of a
cluster to be collapsed easily, but it can be avoided in the proposed
algorithm with the help of crisis hindrance node.

6. CONCLUSION AND FUTURE WORK


This paper gives a brief introduction on clustering process in wireless sensor
networks. A study on the well evaluated distributed clustering algorithm
Low Energy Adaptive Clustering Hierarchy (LEACH) is described
artistically. To overcome the drawbacks of the existing LEACH algorithm, a
model of distributed layer-based clustering algorithm is proposed for
clustering the wireless sensor nodes. The proposed distributed clustering
algorithm is based on the aggregated data being forwarded from the cluster
head to the base station through cluster head of the next higher layer with
shortest distance between the cluster heads. The selection of cluster head is
based on the clustering factor, which is the combination of residual energy
and the number of neighbors of a particular node within a cluster. Also each
cluster has a crisis hindrance node. In future, the algorithm will be simulated
using the network simulator and the simulated results will be compared with
two or three existing distributed clustering algorithms.

7. ACKNOWLEDGMENTS
Our sincere gratitude to the management of SVS Educational Institutions
and my Research Supervisor Dr. S. Sophia who served as a guiding light to
come out with this amazing research work.

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 10


International Journal of Computer Science and Business Informatics

IJCSBI.ORG

REFERENCES
[1] W.B.Heinzelman, A.P.Chandrakasan, H.Balakrishnan, (2002), An application specific
protocol architecture for wireless microsensor networks, IEEE Transactions on Wireless
Communication Volume 1, Number 4, Pages 660-670.
[2] O.Younis, S.Fahmy, (2004), HEED: A hybrid energy-efficient distributed clustering
approach for adhoc sensor networks, IEEE Transactions on Mobile Computing, Volume 3,
Number 4, Pages 366-379.
[3] S.Zairi, B.Zouari, E.Niel, E.Dumitrescu, (2012), Nodes self-scheduling approach for
maximizing wireless sensor network lifetime based on remaining energy IET Wireless
Sensor Systems, Volume 2, Number 1, Pages 52-62.
[4] I.Akyildiz, W.Su, Y.Sankarasubramaniam, E.Cayirci, (2002), A Survey on sensor
networks, IEEE Communications Magazine, Pages 102-114.
[5] G.J.Pottie, W.J.Kaiser, (2000), Embedding the internet: wireless integrated network
sensors, Communications of the ACM, Volume 43, Number 5, Pages 51-58.
[6] J.H.Chang, L.Tassiulas, (2004), Maximum lifetime routing in wireless sensor
networks, IEEE/ACM Transactions on Networking, Volume 12, Number 4, Pages 609-
619.
[7] S.R.Boselin Prabhu, S.Sophia, (2011), A survey of adaptive distributed clustering
algorithms for wireless sensor networks, International Journal of Computer Science and
Engineering Survey, Volume 2, Number 4, Pages 165-176.
[8] S.R.Boselin Prabhu, S.Sophia, (2012), A Research on decentralized clustering
algorithms for dense wireless sensor networks, International Journal of Computer
Applications , Volume 57, Number 20, Pages 0975-0987.
[9] S.R.Boselin Prabhu, S.Sophia, (2013), Mobility assisted dynamic routing for mobile
wireless sensor networks, International Journal of Advanced Information Technology ,
Volume 3, Number 1, Pages 09-19.
[10] S.R.Boselin Prabhu, S.Sophia, (2013), A review of energy efficient clustering
algorithm for connecting wireless sensor network fields, International Journal of
Engineering Research & Technology, Volume 1, Number 4, Pages 477481.
[11] S.R.Boselin Prabhu, S.Sophia, (2013), Capacity based clustering model for dense
wireless sensor networks, International Journal of Computer Science and Business
Informatics, Volume 5, Number 1.
[12] J.Deng, Y.S.Han, W.B.Heinzelman, P.K.Varshney, (2005), Balanced-energy sleep
scheduling scheme for high density cluster-based sensor networks, Elsevier Computer
Communications Journal, Special Issue on ASWN04, Pages 1631-1642.
[13] C.Y.Wen, W.A.Sethares, (2005), Automatic decentralized clustering for wireless
sensor networks, EURASIP Journal of Wireless Communication Networks, Volume 5,
Number 5, Pages 686-697.
[14] S.D.Murugananthan, D.C.F.Ma, R.I.Bhasin, A.O.Fapojuwo, (2005) A centralized
energy-efficient routing protocol for wireless sensor networks, IEEE Transactions on
Communication Magazine, Volume 43, Number 3, Pages S8-13.
[15] F.Bajaber, I.Awan, (2009), Centralized dynamic clustering for wireless sensor
networks, Proceedings of the International Conference on Advanced Information
Networking and Applications.

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 11


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
[16] Pedro A. Forero, Alfonso Cano, Georgios B.Giannakis, (2011), Distributed clustering
using wireless sensor networks, IEEE Journal of Selected Topics in Signal Processing,
Volume 5, Pages 707-724.
[17] Lianshan Yan, Wei Pan, Bin Luo, Xiaoyin Li, Jiangtao Liu, (2011), Modified energy-
efficient protocol for wireless sensor networks in the presence of distributed optical fiber
sensor link, IEEE Sensors Journal, Volume 11, Number 9, Pages 1815-1819.
[18] S.Bandyopadhay, E.Coyle, (2003), An energy-efficient hierarchical clustering
algorithm for wireless sensor networks, Proceedings of the 22 nd Annual Joint Conference
of the IEEE Computer and Communications Societies (INFOCOM 2003), San Francisco,
California.
[19] D.J.Barker, A.Ephremides, J.A.Flynn, (1984), The design and simulation of a mobile
radio network with distributed control, IEEE Journal on Selected Areas in
Communications, Pages 226-237.
[20] R.Nagpal, D.Coore, (2002), An algorithm for group formation in an amorphous
computer, Proceedings of IEEE Military Communications Conference (MILCOM 2002),
Anaheim, CA.
[21] M.Demirbas, A.Arora, V.Mittal, (2004), FLOC: A fast local clustering service for
wireless sensor networks, Proceedings of Workshop on Dependability Issues in Wireless
Ad Hoc Networks and Sensor Networks (DIWANS04), Italy.
[22] M.Ye, C.F.Li, G.H.Chen, J.Wu, (2005), EECS: An energy efficient clustering scheme
in wireless sensor networks, Proceedings of the Second IEEE International Performance
Computing and Communications Conference (IPCCC), Pages 535-540.

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 12


International Journal of Computer Science and Business Informatics

IJCSBI.ORG

An Efficient Connection between


Statistical Software and Database
Management System
Sunghae Jun
Department of Statistics, Cheongju University
Chungbuk 360-764 Korea

ABSTRACT
In big data era, we need to manipulate and analyze the big data. For the first step of big data
manipulation, we can consider traditional database management system. To discover novel
knowledge from the big data environment, we should analyze the big data. Many statistical
methods have been applied to big data analysis, and most works of statistical analysis are
dependent on diverse statistical software such as SAS, SPSS, or R project. In addition, a
considerable portion of big data is stored in diverse database systems. But, the data types of
general statistical software are different from the database systems such as Oracle, or
MySQL. So, many approaches to connect statistical software to database management
system (DBMS) were introduced. In this paper, we study on an efficient connection
between the statistical software and DBMS. To show our performance, we carry out a case
study using real application.
Keywords
Statistical software, Database management system, Big data analysis, Database connection,
MySQL, R project.
1. INTRODUCTION
Every day, huge data are created from diverse fields, and stored in computer
systems. These big data are extremely large and complex [1]. So, it is very
difficult to manage and analyze them. But, big data analysis is important
issue in many fields such as marketing, finance, technology, or medicine.
Big data analysis is based on statistics and machine learning algorithms. In
addition, data analysis is depended on statistical software, and the data are
stored in database systems. So, for big data analysis, we should manage
statistical software and database system effectively. In this paper, we
consider R project system as statistical software. R is an environment for
statistical computing including statistical analysis and graphical display of
data [2]. This program provides most of statistical and machine learning
methods for big data analysis. We use MySQL for connecting database
system from R project. The MySQL is a database management system
(DBMS) product that is the most popular open source database in the world,
in addition, this is a free software like R system [3]. So, in our research, we
use R and MySQL for an efficient connection between statistical software
and DBMS. There was a work about DB access through R [4]. This covered

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 1


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
the DB access problems of R, and showed the ODBC (open database
connectivity) drivers for connecting R and DBMS such as MySQL,
PostgreSQL, and Oracle. Also, the authors of this paper introduced the
installation and technological environment for the DB access. But, they did
not illustrate detailed approaches for real applications. That is, their work
was about a conceptual suggestion for the access of R to MySQL. So, in this
paper, we perform more specific study for connection between statistical
software, R to DBMS, MySQL. In our case study, we will show detailed
and efficient connection of R to MySQL using specific data set from the
University of California, Irvine (UCI) machine learning repository [5]. We
will cover our research background in next section. In section 3, our
proposed methodology will be shown. We also introduce an efficient
connection between statistical database and DBMS in section 4. Lastly we
conclude our study and offer our future works for statistical database system.
2. RESEARCH BACKGROUND
2.1 Statistical Software
To analyze data, we can consider diverse approaches using statistical
software. These days, there are so many products for statistical software.
SAS (statistical analysis system) is the most popular software for statistical
analysis [6]. But, this is expensive, so there are not many companies using
SAS except large size companies. SPSS (statistical analysis in social science)
is another representative software [7], but this is also expensive. Minitab [8]
and S-Plus [9] are well used statistics packages and these are all not free.
Recently, R has been used in many works for statistical data analysis, and
this is free. In addition, R also provides most of statistical functions
included in SAS, or SPSS. R is open source program, so we can modify R
functions for our statistical computing. This is very useful advantage of R.
Therefore, we consider R for connection to database system in this research.
2.2 Database Management System
Database is a collection of data, and database management system (DBMS)
is a software for managing database using structured query language (SQL)
[10],[11]. Oracle is one of popular DBMS products [12], but it is expensive.
MySQL is another DBMS, which is widely used open source software in the
world [3]. Also, most functions of MySQL are similar to Oracle [3]. So, in
this paper, we use MySQL for DBMS connecting to statistical software, R.
Using MySQL DBMS efficiently, we use RODBC package supported by R
CRAN in our research [13].

3. STATISTICAL DATABASE SYSTEM


The main goal of our study is to solve the cost problem for constructing
statistical database system, because we should buy additional product to
connect statistical software to DBMS. For example, for the connection

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 2


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
between SAS and DBMS, we need SAS/Access product as supplementary
software. In general, this is expensive. So, we tried to make the connection
between statistical software and DBMS without cost. The efficient of our
paper was about cost. There are many approaches to connect statistical
software and DBMS. To use most of them, we should buy additional
products. But, there are few free approaches. So, we find an approach to
connect statistical software and DBMS without cost. In this paper, we study
an efficient connection between DBMS and statistical software. We select
the MySQL as a DBMS for our research, and use R project as statistical
software because not only they are free but also they have good functions. In
addition, the R and MySQL have strong performance in statistical
computing and DBMS respectively for constructing statistical database
system [14],[15],[16],[17]. In general, big data are transformed to structured
data type for statistical analysis as follow;

Figure 1. From big data to statistical analysis


First, big data are stored in DB by creating table. Second, big data are
changed to structured data by preprocessing based on text mining. All data
by DB and text mining are analyzed by statistical analysis. We find that text
mining process is hard work for data preprocessing [18]. So, we know that
table creation is more effective approach for big data analysis. To construct
MySQL DB, we use console or graphic user interface (GUI) environments
as follow;

Figure 2. User interface of MySQL

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 3


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
In this paper, we use SQL codes in the MySQL console. Also, we use
RODBC as an ODBC database interface between R and MySQL [13]. In
general R system, package is a set of additional R functions. R packages are
not installed in basic R system. If we need to use a package, we have to add
the package to the R system. Also we can search all packages from the R
CRAN, and install them from the CRAN [19]. The RODBC package
provides efficient functions for ODBC database access. So, our research is
based on RODBC package to connect R to MySQL. To install RODBC in R
system, we should select R CRAN mirror site. After RODBC installation,
we load this package on R system as follow;
>library (RODBC)
The R system uses library function for loading a package. By this R code,
we can use all functions provided by RODBC package such as odbcConnect,
sqlFetch, and sqlQuery. They are used in our research for DB accessing and
connecting. To connect MySQL DB, we use odbcConnect function of
RODBC package as follow;
>db_con =odbcConnect("stat_MySQL")
User = , Password = , Database =
The DSN is stat_MySQL and the db_con object of R system includes the
connecting result. Also, in this connecting process, we decide user name,
password, and determined database. If R and MySQL are connected each
other, we can show the tables of MySQL DB using sqlTables function as
follow;
>sqlTables(con)
TABLE_CAT TABLE_SCHEM TABLE_NAME TABLE_TYPE
REMARKS
The result of this function is the information of connected DB and its tables.
3.1 Structure of DB Connection Software
In general, for connecting DBMS to application software, we should use
ODBC connector [20]. R as a statistical software is also needed to ODBC
driver to access MySQL DBMS. In this paper, we consider RODBC
package for efficient connection between R and MySQL. Figure 3 shows
the ODBC connection between DBMS and statistical software, and their
specific products.

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 4


International Journal of Computer Science and Business Informatics

IJCSBI.ORG

Figure 3. Connection between DBMS and statistical software


Oracle and MySQL are representative DBMS products, and SAS and R
system are popular software for statistical analysis. General ODBC program
is used for connecting application software to DBMS. So, there are so many
ODBC drivers for diverse DBMS and application products. Our work is
focused on the connection R and MySQL, and we select RODBC as an
ODBC driver. The RODBC is a package of many R packages for DB
accessing. RMySQL is another R package for R and MySQL [21]. This
package is also R interface to access the MySQL DBMS. In addition to
RODBC and RMySQL, there are some packages for connecting R to
MySQL. In this paper, we use RODBC for MySQL accessing. This is an
ODBC driver like SAS connection to DBMS as follow.

Figure 4. Connection between MySQL/Oracle and SAS


SAS uses some ODBC drivers for diverse DBMS such as MySQL and
Oracle. Also, the drivers use their data source name (DSN). In this research,
we also use DSN for RODBC package. Next, we show more detailed
connection between R and MySQL.
3.2 Efficient Connection between R and MySQL
The RODBC package of R system is an efficient ODBC connector. This
includes diverse functions to access DBMS as follow;
odbcConnect: function for open connections to ODBC
sqlFetch: function for fetching tables from DB

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 5


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
sqlQuery: function for SQL query
sqlSave: function for writing data frame to table in DB
Also, we can use more functions for accessing and manipulating MySQL
DB by RODBC packages. The process of connection between R and
MySQL is as follow;

Figure 5. Connecting process between R and MySQL


Using RODBC package, R system get necessary data from MySQL DB, and
we analyze the connected data. Also, R system accesses to MySQL by
sqlQuery function of RODBC, and create a table for storing analysis result
using R system. Our process of connection between R and MySQL is shown
as follow;

Figure 6. Connecting process between R and MySQL


A table of MySQL DB is transformed to an object in R by RODBC
connector. So, we are able to analyze the object data from the DB table. We
also perform online transaction processing (OLAP) for data summarization
and visualization. Next, we carry out a case study for verifying our work.

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 6


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
4. CASE STUDY
To illustrate a case study in real problem, we used RODBC package from
R-project [13]. This is the software for ODBC database connection between
R and DBMS such as MySQL. Also, we made experiment using an example
data set from the UCI machine learning repository [5].
4.1 UCI Machine Learning Repository
For our case study, we used Abalone data set from the UCI machine
learning repository [5]. This data set consisted of 8 variables (columns) and
4,177 observations (rows). The main goal of the data is to predict the age of
abalone from the physical measurements. Next table shows the variables
and their values [5].
Table 1. Table captions should be placed above the table

Variable Data type Description


Sex Nominal M(male), F(female), I(infant)
Length Continuous Longest shell measurement
Diameter Continuous perpendicular to length
Height Continuous with meat in shell
Whole_weight Continuous whole abalone
Shucked_weight Continuous weight of meat
Viscera_weight Continuous gut weight (after bleeding)
Shell_weight Continuous after being dried
Rings Discrete +1.5 gives the age in years

The last variable (rings) is target variable, and others are all input variables.
We constructed MySQL DB using this data set. The original data from UCI
machine learning repository was text file separated by comma, but the
MySQL needed data file separated by tab key for DB loading file. So, we
transformed the data type using Excel as follow.

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 7


International Journal of Computer Science and Business Informatics

IJCSBI.ORG

Figure 7. Data transformation for MySQL loading


To load text data file on MySQL, we should make a table to save these data.
So, we create the table in next step.
4.2 DB Creation
We used SQL to create table for loading Abalone data set on MySQL
DBMS as follow;
CREATE DATABASE case_study;
USE case_study;
CREATE TABLE abalone( Sex CHAR(3), Length FLOAT(10), Diameter
FLOAT(10), Height FLOAT(10), Whole_weight FLOAT(10),
Shucked_weight FLOAT(10), Viscera_weight FLOAT(10), Shell_weight
FLOAT(10), Rings INT(5));
LOAD DATA INFILE 'd:/data/abalone.txt' INTO TABLE abalone;
SELECT * FROM abalone;
Using above SQL codes, we constructed a table of Abalone data in MySQL
DB(case_study). Next, we connected the table of abalone in case_study DB
to R system.
4.3 Connecting R to MySQL
We used RODBC package for connecting R to MySQL as follow;

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 8


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
>library(RODBC)
>abalone_con=odbcConnect("abalone_ODBC")
>sqlTables(abalone_con)
TABLE_SCHEM TABLE_NAME TABLE_TYPE
case_study abalone TABLE
>vars=sqlQuery(abalone_con, "SELECT sex, diameter, rings FROM
abalone")
Sex Diameter Rings
1 M 0.365 15
2 M 0.265 7
3 F 0.420 9
4 M 0.365 10
5 I 0.255 7

Using above R codes, we saved three variables of abalone data set to vars
R object. We found the abalone table was created well from the SQL query
result by sqlQuery function. This function enabled the usage of SQL in R
system. So, we analyzed abalone data using analytical functions of R system.
Next, the result of data analysis is shown.
4.4 Data Analysis
First, we performed data summarization of three variables using summary
function of R system as follow;
>summary(vars)
sex diameter rings
F:1307 Min. :0.0550 Min. : 1.000
I:1342 1st Qu.:0.3500 1st Qu.: 8.000
M:1528 Median :0.4250 Median : 9.000
Mean :0.4079 Mean : 9.934
3rd Qu.:0.4800 3rd Qu.:11.000
Max. :0.6500 Max. :29.000
This function provided frequency or descriptive statistic according to data
type (continuous or nominal). For example diameter is continuous variable,
so we got minimum, 25 percentile, median, mean, 75 percentile, and
maximum values. Next we carried out data visualization as follow;
>boxplot(vars$diameter)

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 9


International Journal of Computer Science and Business Informatics

IJCSBI.ORG

Figure 8. Boxplot: data visualization of MySQL table


This shows boxplot of diameter variable of abalone table. Using graphical
functions supported by R system, we can also get diverse visualization
results such as histogram, plot, and so on. Lastly we constructed regression
model using reg function as follow;
>regression_result=lm(rings~diameter, data=vars)
>sunnary(regression_result)
Residuals:
Min 1Q Median 3Q Max
-5.19 -1.69 -0.72 0.91 16.00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.3186 0.1727 13.42 <2e-16 ***
diameter 18.6699 0.4115 45.37 <2e-16 ***
R-squared: 0.3302, Adj. R-squared: 0.3301
The regression is popular model in statistical analysis. The dependent and
independent variables are rings and diameter respectively. So, we got the
following regression equation;
Rings=2.3186+18.6699diameter. Therefore, in our case study, we illustrated
a case study of connection between R and MySQL.
5. CONCLUSION
In this paper, we studied on the efficient connection between DBMS and
statistical software. We used R system and MySQL as statistical software
and DBMS respectively. The RODBC package was used for DB connection
in our study. After connecting between R and MySQL, we analyzed the data
of MySQL table. So, this can be expanded to the big data analysis. In our

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 10


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
case study, we illustrated how our approach could be applied in real
application. We selected Abalone data set from the UCI machine learning
repository for our case study. Our result contributes to the works related to
big data analysis. In addition, we can analyze the data in DBMS directly by
statistical methods. In our future works, we will expand the scope of the
connection between DBMS and statistical software to more products.
6. DISCUSSION
The biggest problem of statistical database system is the cost of connecting
between statistical software and DBMS. For example, we should buy
SAS/Access product additionally and install it to SAS base system for
connecting SAS and DBMS. Generally this supplementary product is
expensive, so most users have had difficulty to use statistical databases
system. In this paper, we selected R system as statistical software instead of
SAS, and we used RODBC as ODBC connector instead of SAS/Access,
because R and RODBC are all free. But, their performance is similar to SAS.
Also, in new analytical functions such as statistical leaning theory and
machine learning algorithm, they surpass SAS.

REFERENCES
[1] Sathi, A. Big Data Analytics. An Article from IBM Corporation, 2012.
[2] Heiberger, R. M., and Neuwirth, E.R through Excel A Spreadsheet Interface for
Statistics, Data Analysis, and Graphics. Springer, 2009.
[3] MySQL, The Worlds most popular open source database. http://www.mysql.com,
accessed on October 2013.
[4] Sim, S., Kang, H., and Lee, Y. Access to Database through the R-Language. The
Korean Communications in Statistics, 15, 1 (2008), 51-64.
[5] UCI Machine Learning Repository, http://archive.ics.uci.edu/ml, accessed on October
2013.
[6] SAS, http://www.sas.com,accessed on October 2013.
[7] SPSS, http://www-01.ibm.com/software/analytics/spss/, accessed on October 2013.
[8] Minitab, http://www.minitab.com, accessed on October 2013.
[9] S-Plus, http://solutionmetrics.com.au/products/splus/, accessed on October 2013.
[10] Wikipedia, the free encyclopedia. http://en.wikipedia.org, accessed on October 2013.
[11] Date, C. J.An Introduction to Database Systems. 7th edition, Addition-Wesley, 2000.
[12] Oracle, http://www.oracle.com, accessed on October 2013.
[13] Ripley, B.Package RODBC. CRAN R-Project, 2013.
[14] R-bloggers, On R versus SAS. http://www.r-bloggers.com/on-r-versus-sas/, accessed on
December, 2013.
[15] Linkin,Advanced Business Analytics, Data Mining and Predictive Modeling.
http://www.linkedin.com/groups/SAS-versus-R-35222.S.65098787, accessed on
December, 2013.
[16] Clever Logic, MySQL vs. Oracle Security, http://cleverlogic.net/articles/mysql-vs-
oracle, accessed on December, 2013.

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 11


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
[17] Find The Best, Oracle vs MySQL, http://database-management-
systems.findthebest.com/saved_compare/Oracle-vs-MySQL, accessed on December,
2013.
[18] Han, J., and Kamber, M. Data Mining Concepts and Techniques. Morgan Kaufmann,
2001.
[19] R system, The R Project for Statistical Computing. http://www.r-project.org, accessed
on October 2013.
[20] Spector, P. Data Manipulation with R, Springer, 2008.
[21] James, D. A., and DebRoy, S.Package RMySQL. CRAN R-Project, 2013.

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 12


International Journal of Computer Science and Business Informatics

IJCSBI.ORG

Pragmatic Approach to Component Based


Software Metrics Based on Static Methods
S. Sagayaraj
Department of Computer Science
Sacred Heart College, Tirupattur

M. Poovizhi
Department of Computer Science
Sacred Heart College, Tirupattur

ABSTRACT
Component-Based Software Engineering (CBSE) is an emerging technique for reuse of
software. This paper presents the component based software metrics by investigating the
improved measurement techniques. Two types of metrics are used: static metrics and
dynamic metrics. This research work presents the measured metric value for Complexity
metrics and Criticality metric. The static metrics applied to the E-healthcare application
which is developed with the reusable software components. The value of each metric is
analyzed with the application. The metric measured value is the evidence for the
reusability, good maintainability of component based software system.
Keywords
Component Based Software Engineering, Component Based Software Metrics, Component
Based Software System.
1. INTRODUCTION
The demand for new software applications is currently increasing at the
exponential rate. The number of qualified and experienced professionals
required for creating new software/applications is not increasing
commensurably [1]. Software Reuse applications are built from existing
components, primarily by assembling and replacing interoperable parts. So,
software professionals have recognized reuse as a powerful means of
potentially overcoming the above said software crisis and it promises
significant improvements in software productivity and quality [2].

There are two approaches for reuse of code: develop the reusable code from
scratch or identify and extract the reusable code from already developed
code [3]. The organizations have experience in developing software, there
exists extra cost to develop the reusable components from scratch to build
and strengthen their reusable software reservoir. The cost of developing the
software from scratch can be saved by identifying and extracting the
reusable components from already developed and existing software systems
or legacy systems [4]. But the problem of how to recognize reusable
components from existing systems has remained relatively unexplored. In

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 1


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
both the cases, whether the organization is developing software from scratch
or reusing code from already developed projects, there is a need of
evaluating the quality of the potentially reusable piece of software. Metrics
is very essential to prove the quality of the components [5].

Software metrics are an essential part of the state-of-the-practice in software


engineering. Goodman describes software metrics as: "The continuous
application of measurement-based techniques to the software development
process and its products to supply meaningful and timely management
information, together with the use of those techniques to improve that
process and its products"[6].Software metrics can do one of four functions
such as understand, evaluate, control, predict.

Various attributes, which determine the quality of the software, include


maintainability, defect density, fault proneness, normalized rework,
understandability, reusability etc [5]. To achieve both the quality and
productivity objectives it is always recommended to go for the software
reuse that not only saves the time taken to develop the product from scratch
but also delivers the almost error free code, as the code is already tested
many times during its software development [7].

During the last decade, the software reuse and software engineering
communities have come to better understanding on component-based
software engineering. The development of a reuse process and repository
produces a base of knowledge that improves in excellence after every reuse,
minimizing the amount of development work necessary for future projects,
and ultimately reducing the risk of new projects that are based on repository
knowledge [8].

CBSD centers on building large software systems by integrating previously


existing software components. By enhancing the flexibility and
maintainability of systems, this approach can potentially be used to reduce
software development costs, assemble systems rapidly, and reduce the
spiraling maintenance burden associated with the support and upgrade of
large systems [9].

The paper is organized as follows: The related work on component based


software metric is provided in Section 2. The list of Component based static
and dynamic metrics in section 3. The detail of implementation is presented
in Section 4. The analysis of complexity metrics and criticality metrics
is described in section 5. Finally, the last section concludes the paper and
offers further research in this area.

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 2


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
2. RELATED WORKS
Many works are carried out in the area of Component Based Software
Metrics. Some of the works are listed below:

Nael SALMAN focuses mainly on the complexity that results mainly from
factors related to system structure and connectivity in 2006 [10]. Also, a
new set of properties that a component-oriented complexity metric must
possess are defined. The metrics have been evaluated using the properties
defined. A case study has been conducted to detect the power of complexity
metrics in predicting integration and maintenance efforts. The results of the
study revealed that component oriented complexity metrics can be of great
value in predicting both integration and maintenance efforts.

Arun Sharma, Rajesh Kumar, and P. S. Grover presented survey few


existing component-based reusability metrics in 2007 [11]. These metrics
gave a border view of components understandability, adaptability, and
portability. It also expresses the analysis, in terms of quality factors related
to reusability, contained in an approach that helps significantly in assessing
existing components for reusability.

V. Lakshmi Narasimhan, P. T. Parthasarathy, and M. Das hearted a series of


metrics projected by various researchers have been analyzed, evaluated and
benchmarked using several large-scale openly available software systems in
2009[12]. A systematic analysis of the values for various metrics has been
carried out and several key inferences have been drawn from them. A
number of useful conclusions have been drawn from various metrics
evaluations, which include inferences on complexity, reusability, testability,
modularity and stability of the underlying components.

Misook Choi, Injoo J. Kim, Jiman Hong, Jungyeop Kim suggested


Component-Based Metrics Applying the Strength of Dependency between
Classes in 2009 to increase quality of components, they proposed the
component-based metrics applying the strength of dependency between
classes to measure precisely [13]. In addition, they proved the theoretical
soundness of the proposed metrics by the axioms of Briand et al. and
suggest the accuracy and practicality of the proposed metrics through a
comparison with the conventional metrics in component development phase.
Majdi Abdellatief, Abu Bakar Md Sultan, Abdul Azim Abd Ghani,
Marzanah A.Jaba presented dependency between components is considered
as a most important issue affecting the structural design of Component-

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 3


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
Based Software System (CBSS) in 2011 [14]. Two sets of metrics that is,
Component Information Flow Metrics and Component Coupling Metrics are
proposed based on the concept of Component Information Flow from CBSS
designers point of view.

Jianguo Chen, Hui Wang, Yongxia Zhou, Stefan D. Bruda presented some
such efforts by investigating the improved measurement tools and
techniques, i.e., through the effective software metrics in 2011 [15].
Coupling, Cohesion and interface metrics are proposed newly and evaluated
those metrics.

The previous research explained the work done with varieties of Component
Based Software Metrics. This paper deals about the static and dynamic
metrics of component based software. This work is extended by developing
the E-Healthcare application and the results are carried out for the static
metrics.
3. COMPONENT BASED SOFTWARE METRICS
The traditional software metrics focus on non-CBSS and are inappropriate
to CBSS mainly because the component size is normally not known in
advance. Inaccessibility of the source code for some components prevents
comprehensive testing. So, the component based metrics are defined to
evaluate the component based application.

There are two types of metrics considered in this paper for measuring the
values.
Static Metric
Static metrics cover the complexity and the criticality within an integrated
component. Static metrics are collected from static analysis of component
assembly. The complexity and criticality metrics are intended to be used
early during the design stage. The list of static metrics [16] is provided in
Table 1.
Dynamic metric
Dynamic metrics are gathered during execution of complete application.
Dynamic metrics are meant to be used at implementation stage. The
dynamic metrics are listed in Table 2 [15].

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 4


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
Table 1. Static Metrics
Sl.no Metric Name Formulae
1 Component Packing Density Metric

2 Component Interaction Density metric

3 Component Incoming Interaction Density

4 Component Outgoing Interaction Density

5 Component Average Interaction Density

6 Bridge Criticality Metrics CRIT bridge =#bridge_component


7 Inheritance Criticality Metrics CRIT inheritance =#root_component
8 Link Criticality Metrics CRIT link =#link_component
9 Size Criticality Metrics CRIT size =#size_component
10 #Criticality Metrics CRIT all = CRIT bridge+ CRIT inheritance
+ CRIT link + CRIT size

Table 2. Dynamic Metrics

Sl.no Metric Name Formulae


1 Number of Cycle (NC) NC = # cycles
2 Average Number of Active Components

3 Active Component Density (ACD)

4 Average Active Component Density

5 Peak Number of Active Components ACt = max { AC1,..,ACn}

4. IMPLEMENTATION
The E-Healthcare application is developed to measure the static metrics.
The application is designed with the number of components. The metrics are
applied with the application and the values are measured. There are five
modules in e-healthcare application.

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 5


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
4.1 Admin
Admin module is used to store the user or doctor or admin details. Admin
has a responsibility to manage every record in the database.
4.2 Appointments and payments
This module is used to add, drop doctor details and help to get appointment
for users. Admin is a responsible person to add new doctor details. The
existing doctor also can be deleted by admin.
4.3 Diagnosis and health
Diagnosis and Health module is used to retrieve users diagnosis details.
The users who are all taking the treatment by using application, those users
information is store in the database.
4.4 First aid and E-certificate
This module is used to get blood bank details for the required blood group.
A first aid medicine detail for a particular disease is provided to the users.
The user can get treatment type which helps users for their emergency.
4.5 Symptoms and alerts
Symptoms and alerts module is used to check the BP level of the user. The
patient information is retrieved from database and their symptoms, causes
for disease are helps the users to prevent them from disease.

The pictographic representation of the modules in application is shown in


Figure 1.

Appointments
and payments

Admin Diagnosis and


health

First aid and Symptoms


E-certificates and Alerts

Figure 1. Modules in E-healthcare Application


Components are created to develop the whole application. The components
(admin, appointments and payment, diagnosis and health, firstaid and e-
certificate, symptoms and alerts, DBHelper, EhealthBL) are required to
complete the component based application called E-Healthcare. The static
metrics are applied with that component, and each component value is
measured according to the metric formula. The analysis of metric is carried

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 6


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
out manually with the application. With the help of database table, web page
form, components the metric values are calculated.
5. ANALYSIS
The analysis made to prove the CBSS has good reusability, maintainability
and independence.
The Component Packing Density Metric, Component Interaction Metrics
(Incoming, outgoing, Average), Criticality metrics analyses are as follows:
5.1 Component Packing Density Metric
CPD is used to measure the number of operation in which each component
contains.
The CPD is defined as a ratio of #constituent (LOC, object/classes,
operations, classes and/or modules) and #component

#Constituent = one of the following: LOC, object/classes, operations,


classes and/or modules
#Component = number of components

For this metric the no. of operation of each component is listed in Table 3.
Table 3. Component packing Density
S.No Component Name No. of operations
1 Admin 3
2 Appointments and payments 4
3 Diagnosis and health 4
4 Firstaid and e-certificate 6
5 Symptoms and alerts 5
6 DBHelper 1
7 EhealthBL 19
= 3+4+6+5+4+1+19/7= 42/7= 6
Hence, the CPD metric is helps to know the average number of operations
in each component contains.
5.2Component Interaction Density Metric
The CID is defined as a ration of actual interactions over potential ones. A
higher interaction density causes a higher complexity in the interaction [17].
The CID metric is applied on the E-Healthcare application. The measured
value of actual interactions in each component of E-Healthcare is illustrated
in Table 4.

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 7


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
#I = no. of actual interactions
#Imax = no. of maximum available interactions.
Table 4. Actual interactions
S.No Name of the page No. of actual interactions
1 Registration.aspx 4
2 Postquestion.aspx 2
3 Search.aspx 5 i/p, 5 o/p
4 Doctormanagement.aspx 6
5 Diagnosis.aspx 1
6 Searchmedicine.aspx 2 i/p, 3 o/p
7 Medicine.aspx 5
8 Bloodbank.aspx 4
9 Firstaidsuggestion.aspx 2
10 Medicalcertificate.aspx 3
11 Treatmenttype.aspx 1 i/p, 2 o/p
12 Symptoms.aspx 1 i/p, 3 o/p
Total 51
The actual interaction value between other components is 51.
The maximum no. of available interaction with other component is 87
=51/87 = 0.586
This metric brings out the number of incoming and outgoing interactions
available in each component. This metric helps to know which component
has greater connectivity with other component.
5.3 Component Incoming Interaction Density
CIID is defined as a ratio of number of incoming interactions and maximum
number of incoming interactions. A higher interaction density causes a
higher complexity in the interaction. The no. of actual incoming interactions
in each component is shown in the Table 5.

#I in = no. of incoming interactions


#Imax in = maximum no. of available incoming interactions.

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 8


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
Table 5. Incoming Interactions
S.No Name of the page No. of incoming interactions
1 Registration.aspx 4
2 Postquestion.aspx 1
3 Search.aspx 5
4 Doctormanagement.aspx 4
5 Diagnosis.aspx 1
6 Searchmedicine.aspx 2
7 Medicine.aspx 5
8 Bloodbank.aspx 4
9 Firstaidsuggestion.aspx 1
10 Medicalcertificate.aspx 2
11 Treatmenttype.aspx 4
12 Symptoms.aspx 4
Total 37
The no. of incoming interaction value is 37.
The maximum no. of available incoming interaction value is 51. Out of 51
interactions only the 37 interactions are actually has link to the other
component.
= 37/51 = 0.725
CIID metric value 0.725 is clearly state the incoming interactions density
with other component is very high.
5.4 Component Outgoing interaction Density
COID is defined as a ratio of number of outgoing interactions and maximum
number of outgoing interactions. A higher interaction density causes a
higher complexity in the interaction. The number of outgoing interaction in
each component is shown in Table 6.

#I out = no. of outgoing interactions


#Imax out = no. of maximum no. of outgoing interactions.
Table 6. Outgoing Interactions

S.No Name of the page No. of outgoing interactions


1 Registration.aspx 2
2 Postquestion.aspx 1
3 Search.aspx 5

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 9


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
4 Doctormanagement.aspx 1
5 Diagnosis.aspx 3
6 Searchmedicine.aspx 3
7 Medicine.aspx 1
8 Bloodbank.aspx 3
9 Firstaidsuggestion.aspx 1
10 Medicalcertificate.aspx 1
11 Treatmenttype.aspx 4
12 Symptoms.aspx 3
Total 28
The no. of outgoing interaction value is 28.
The maximum no. of available outgoing interaction value is 46. Only 28
outgoing interactions are actually connected with other components.
= 28/46 = 0.608
The calculated value is 0.608 proven that there is greater outgoing
interactions with the components.
5.5 Component Average Interaction Density
CAID represents the sum of CID for each component divided by the number
of components.

#components = Number of components in the system. (Sum of interaction


density of n component / no. of existing component)

Admin: The actual interfaces (incoming and outgoing) of admin component


are listed. Sum of interaction density value for admin component is shown
in Table 7.
Table 7. Sum of CID for admin component

S.No Name of the page Sum of CID for admin


component
1 Registration.aspx 4 out of 13 (only 4 interfaces
interact with other components out
of 13 interfaces)
2 Login.aspx 2 out of 2
3 Postquestion.aspx 1 out of 1

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 10


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
Summation of CID for Component Admin is 7/16. Seven are actual
interactions out of sixteen. This component has a greater reliability.
Appointments and payments: Sum of interaction density of an
appointments and payments component is shown in Table 8. The sum is
considered both the incoming and outgoing interfaces in appointments and
payments component.

Table 8. Sum of CID for appointments and payments component


S.No Name of the page Sum of CID for appointments and
payments component
1 Search.aspx : 2 out of 2
2 To get appointment : 4 out of 4
3 Doctormanagement.aspx : 4 out of 6

Summation of CID for Component Appointments and payments is 10/12.


The 10 interfaces has link with other component out of 12 interfaces.
Diagnosis and health: Sum of interaction density of a diagnosis and health
component is shown in Table 9.

Table 9. Sum of CID for diagnosis and health component


S.No Name of the page Sum of CID for diagnosis and health
component
1 Diagnosis.aspx : 1 out of 2
2 : 2 out of 2
Searchmedicine.aspx
3 Medicine.aspx : 4 out of 5

Summation of CID for Component Diagnosis and health is 7/9. The 7


interfaces are represents both interactions with added components out of 9
interfaces.

Firstaid and e-certificate: Table 10 shows the sum of CID value for
component called firstaid and e-certificates.

Table 10. Sum of CID for firstaid and e-certificates

S.No Name of the page Sum of CID for firstaid and


e-certificate

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 11


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
1 Bloodbank.aspx : 1 out of 1

: 3 out of 3
2 Firstaidsuggestion.aspx : 1 out of 1
3 Medical certificate.aspx : 2 out of 4
4 Treatmenttype.aspx : 1 out of 1
: 3 out of 7

Summation of CID for Component Firstaid and E-certificates is 11/17. Out


of 17 only 11 interactions are connected with the rest of the component.
Symptoms and alerts: Table 11 shows the sum of CID value for
component called symptoms and alerts.

Table 11. Sum of CID for symptoms and alerts.

S.No Name of the page Sum of CID for symptoms and alerts
1 : 1 out of 1
Searchpatient.aspx : 3 out of 3

Summation of CID for Component Symptoms and alerts is 4/4. This


component completely connected with other components.

Component Average Interaction Density metric takes the ratio between sum
of each component and number of existing components.
= (7/16+10/12+7/9+11/17+4/4)/7
= 0.5279
The measured value for this metric proved that, greater reliability with the
components.

5.6 Bridge Criticality Metric


Bridge criticality metric is used to identify the bridge component. The
component which is acts as a bridge for components is a bridge component.
CRIT bridge =#bridge_component. Out of 7 components EhealthBL is acts
a bridge component between other component and from the code behind to
the database. It contains all the queries to store and retrieve the
informations.

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 12


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
So, the bridge_component value is 1.The value 1 is explicitly tells that, one
component is operates as a bridge component to all other component.

5.7 Inheritance Criticality Metric


Inheritance is deriving a new component from the existing component. The
existing component is called as root component.
CRIT inheritance =#root_component
The interface is inherited from the existing/ derived component.

Root components
Symptoms and alerts (patient info inherited to diagnosis component)
EhealthBL (query is inherited from the basequery)
So, the root component value is: 2, this value is shows that, object oriented
programming concepts utilized between the components.

5.8 Link Criticality Metric


Link criticality metric is used to identify link component. The component
which is providing link to other components is called as link component.
CRIT link =#link_component
The link component value is: 1 (DB helper).This value proved that the
component acts as link between code behind page to database.

5.9 Size Criticality Metric


Size criticality metric is used to identify the component which exceeds the
critical level, which is called size component.
CRIT size =#size_component
The size component value is: 0
Size critical level is: 60 lines in a component. No component exceeds the
critical level.

5.10 # Criticality Metric


The Sum of the bridge criticality, inheritance criticality, link criticality, size
criticality is known as Criticality Metrics.
CRIT all = CRIT bridge+ CRIT inheritance + CRIT link + CRIT size
CRIT all = 1+2+1+0
=4
The compound value 4 proved that the huge criticality is arising.

Threshold Value
The threshold value is fixed as 0.5 and it is used to compare the computed
value of each meric. The comparison with this threshold value is to check

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 13


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
the metric value is increased or decreased in it reusability and good
maintainability aspects. Table 12 shows the result of compared with the
threshold value.

Table 12. Comparison with threshold value.


{{{

Metric Name Comparison with Threshold


Value
Component Packing Density Metric Increasing
Component Interaction Density Metric Increasing
Component Incoming Interaction Density Increasing
Component Outgoing Interaction Density Increasing
Component Average Interaction Density Increasing
Bridge Criticality Metrics Increasing
Inheritance Criticality Metrics Increasing
Link Criticality Metrics Increasing
Size Criticality Metrics Decreasing

6. CONCLUSIONS
Building software systems with reusable components bring many
advantages to Organizations. Reusability may have several direct or indirect
factors like cost, efforts, and time. This paper discussed various aspects of
reusability for Component- Based systems. It has given an insight view of
various reusability metrics for Component-Based systems. The qualities of
components are correctly measured by applying metrics to an e-healthcare
in an electronic commerce domain. The component-based metrics result in
improving the quality of design components and developing the component
based system with good maintainability, reusability, and independence.
Most of the Metrics have future enhancement. That enhancements help to
add the features at the future. The demand of the new software applications
is currently increasing at the exponential rate. So the future enhancements
will help to fulfill those requirements. The Dynamic Metric analysis can be
applied to the component based software application and it can be validated.
Based on the applications the enhanced metrics can be proposed for the
component based software systems.
REFERENCES
[1] Dr. Nedhal A. Al Saiyd, Dr. Intisar A. Al Said, Ahmed H. Al Takrori, Semantic-Based
Retrieving Model of Reuse Software Component, IJCSNS International Journal of
Computer Science and Network Security, VOL.10 No.7, July 2010.
[2] Joaquina Martn-Albo, Manuel F. Bertoa, Coral Calero, Antonio Vallecillo, Alejandra
Cechich and Mario Piattini, CQM: A Software Component Metric Classification
Model, IEEE Transactions onJjournal Name.

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 14


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
[3] Anas Bassam AL-Badareen, Mohd Hasan Selamat, Marzanah A. Jabar, Jamilah Din,
Sherzod Turaev, Reusable Software Component Life Cycle, International Journal of
Computers, Issue 2, Volume 5, 2011.
[4] Chintakindi Srinivas, Dr.C.V.Guru rao, Software Reusable Components With
Repository System, International Journal of Computer Science & Informatics, Volume-
1, Issue-1,2011
[5] Parvinder S.Sandhu, Harpreet Kaur, and Amanpreet Singh, Modeling of Reusability of
Object oriented Software System, World Academy of Science, Engineering and
Technology 56 2009.
[6] Sarbjeet Singh, Manjit Thapa, Sukhvinder singh and Gurpreet Singh, Sarbjeet Singh,
Manjit Thapa, Sukhvinder singh and Gurpreet Singh, International Journal of
Computer Applications (0975 8887) Volume 8 No.12, October 2010
[7] Linda L. Westfall, Seven steps to designing a software metrics, Principles of software
measurement services.
[8] K.S. Jasmine and R.Vasantha, DRE A Quality metric for Component Based Software
Products, World Academy of Science, Engineering and Technology 34 2007.
[9] Iqbaldeep Kaur, Parvinder S. Sandhu, Hardeep Singh, and Vandana Saini, Analytical
Study of Component Based Software Engineering, World Academy of Science,
Engineering and Technology 50 2009.
[10] Nael Salman, Complexity metrics as predicators of maintainability and integrability of
software components, Journal of arts and science, May 2006.
[11] Arun Sharma, Rajesh Kumar, and P. S. Grover, A critical survey of reusability aspects
for component-Based systems, World academy of science, Engineering and
Technology 33 2007.
[12] V. Lakshmi Narasimhan, P. T. Parthasarathy, and M. Das, Evaluation of a suite of
metrics for CBSE, Issues in informing science and information technology, Vol 6,
2009.
[13] Misook Choi, Injoo J. Kim, Jiman Hong, Jungyeop Kim, Component-Based Metrics
Applying the Strength of Dependency between Classes, ACM Journal, March 2009.
[14] Majdi Abdellatief, Abu Bakar Md Sultan, Abdul Azim Abd Ghani, Marzanah A.Jabar,
Component-based Software System Dependency Metrics based on Component
Information Flow Measurements, ICSEA 2011.
[15] Jianguo Chen, Hui Wang, Yongxia Zhou, Stefan D.Bruda, Complexity Metrics for
Component-based Software Systems, International Journal of Digital Content
Technology and its Applications. Vol.5, No.3, March 2011.
[16] V. Lakshmi Narasimhan, and Bayu Hendradjaya, Theoretical Considerations for Software
Component Metrics, World Academy of Science, Engineering and Technology 10
2005.
[17] E. S. Cho, M.S. Kim, S.D. Kim, Component Metrics to Measure Component Quality,
the 8th Asia-Pacific Software Engineering Conference (APSEC), Macau, 2001, pp.
419-426.

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 15


International Journal of Computer Science and Business Informatics

IJCSBI.ORG

SDI System with Scalable Filtering of


XML Documents for Mobile Clients
Yi Yi Myint
Department of Information and Communication Technology
University of Technology (Yatanarpon Cyber City)
Pyin Oo Lwin, Mandalay Division, Myanmar

Hninn Aye Thant


Department of Information and Communication Technology
University of Technology (Yatanarpon Cyber City)
Pyin Oo Lwin, Mandalay Division, Myanmar

ABSTRACT
As the number of user grows and the amount of information available becomes even bigger,
the information dissemination applications are gaining popularity in distributing data to the
end users. Selective Dissemination of Information (SDI) system distributes the right
information to the right users based upon their profiles. Typically, the exploitation of
Extensible Markup Language (XML) representation entails the profile representation, and
the utilization of the XML query languages assist the employment of queries indexing
techniques in SDI systems. As a consequence of these advances, mobile information
retrieval is crucial to share the vast information from diverse data sources. However, the
inherent limitations of mobile devices require information to be delivered to mobile clients
to be highly personalized consistent with their profiles. In this paper, we address the issue
of scalable filtering of XML documents for mobile clients. We describe an efficient
indexing mechanism by enhancing XFilter algorithm based on a modified Finite State
Machine (FSM) approach that can quickly locate and evaluate relevant profiles. Finally, our
experimental results show that the proposed indexing method outperforms the previous
XFilter algorithm in time aspect.
Keywords
XML, FSM, scalable filtering, SDI.
1. INTRODUCTION
Nowadays the SDI System becomes increasingly an important research area
and industrial topic. Obviously, there is a trend to create new applications
for small and light computing devices such as cell phones and PDAs.
Amongst the new applications, mobile information dissemination
applications (e.g. electronic personalized newspapers delivery, ecommerce
site monitoring, headline news, alerting services for digital libraries, etc.)
deserve special attention.
Recently, there have been a number of efforts to build efficient large-scale
XML filtering systems. In an XML filtering system [4], constantly arriving
streams of XML documents are passed through a filtering engine that
matches documents to queries and routes the matched documents

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 1


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
accordingly. XML filtering techniques comprise a key component of
modern SDI applications.
XML [3] is becoming a standard for information exchange and a textual
representation of data that is designed for the description of the content,
especially on the internet. The basic mechanism used to describe user
profiles in XML format is through the XPath query language. XPath is a
query language for addressing parts of an XML document. However, this
technique often suffers from restricted capability to express user interests,
being unable to rightly capture the semantics of the user requirements.
Therefore, expressing deeply personalized profiles require a querying power
just like SQL provides on relational databases. Moreover, as the user
profiles are complex in mobile environment, a more powerful language than
XPath is needed. In this case, the choice is XML-QL. XML-QL [7] has
more expressive power compared to XPath and it is also measured the most
powerful among all XML query languages. XML-QLs querying power and
its elaborate CONSTRUCT statement allows the format of the query results
to be specified.
The rest of the paper is organized as follows: Section 2 briefly summarizes
the related works. Section 3 describes the proposed system architecture and
its components. The operation of the system that is how the query index is
created, the operation of the finite state machine and the generation of the
customized results are explained in Section 4. Section 5 gives the
performance evaluation of the system. Finally Section 6 concludes the
paper.
2. RELATED WORKS
We now introduce some existing XML filtering methods. XFilter [1] was
one of the early works. The XFilter system is designed and implemented for
pushing XML documents to users according to their profiles expressed in
XML Path Language (XPath). XFilter employs a separate FSM per path
query and a novel indexing mechanism to allow all of the FSMs to be
executed simultaneously during the processing of a document. A major
drawback of XFilter is its lack of expressiveness.
In addition, XFilter does not execute the XPath queries to generate partial
results. As a result, the whole document is pushed to the user when a
document matches a users profile. This feature prevents XFilter to be used
in mobile environments because the limited capability of the mobile devices
is not enough to handle the entire document. Also XFilter does not utilize
the commonalities between the queries, i.e. it produces a FSM per query.
This observation motivated us to develop mechanisms that employ only a
single FSM for the queries which have common element structure.

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 2


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
YFilter [2] overcomes the disadvantage of XFilter by using
Nondeterministic Finite Automata (NFA) to emphasize prefix sharing. The
resulting shared processing provided tremendous improvements to the
performance of structure matching but complicated the handling of value-
based predicates. However, the ancestor/descendant relationship introduces
more matching states, which may result in the number of active states
increasing exponentially. Post processing is required for YFilter.
FoXtrot [5] is an efficient XML filtering system which integrates the
strengths of automata and distributed hash tables to create a fully distributed
system. FoXtrot also describes different methods for evaluating value-based
predicates. The performance evaluation demonstrates that it can index
millions of queries and attain an excellent filtering throughput. However,
FoXtrot necessitates the extensions of the query language to reach the full
XPath or the powerful expressiveness for user profiles.
NiagaraCQ system [6] uses XML-QL to express user profiles. It provides
the measures of scalability through query groups and cashing techniques.
However, its query grouping ability is derived from execution plans which
are different from our proposed method. The execution times of queries do
not make such planning a possible applicant for mobile environments.
Accordingly, our system will solve the above problems and reduce the
filtering time as much as possible.

3. PROPOSED SYSTEM ARCHITECTURE


We first present a high-level overview of our XML filtering system. We
then describe XML-QL language that we use to specify the user profiles in
this work. The overall architecture of the system is depicted in Figure 1.

Figure 1. Overall architecture of the system


User profiles describe the information preferences of individual users. These
profiles may be created by the users themselves, e.g., by choosing items in a
Graphical User Interface (GUI) via their mobile phones. The user profiles

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 3


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
are automatically converted into a XML-QL format that can be efficiently
stored in the profile database and evaluated by the filtering system. These
profiles are effectively standing queries, which are applied to all incoming
documents. Filtered engine first creates query indices for user profiles and
then parses the incoming XML documents to obtain the query results. The
results are stored in a special content list, so that the whole document need
not be sent. Extracting parts of an XML document can save bandwidth in a
mobile environment. After that, filtered engine sends the filtered XML
documents to the related mobile clients.
3.1 Defining User Profiles with XML-QL
XML-QL has a SELECT WHERE construct, like SQL, that can express
queries, to extract pieces of data from XML documents. It can also specify
transformations that, for example, can map XML data between Document
Type Definitions (DTDs) and integrate XML data from different sources.
Profiles defined through a GUI are transformed into XML documents which
contain XML-QL queries as shown in Figure 2.
<Profile>
<XML-QL>
WHERE<course>
<major>
<name>ICT</name>
<program>First Year</program>
<syllabus>$n</syllabus>
</major></course> IN course.xml
CONSTRUCT<result><syllabus>$n</syllabus></result>
</XML-QL>
<PushTo> <address></address> </PushTo>
</Profile>
Figure 2. Profile syntax represented in XML containing XML-QL query
3.2 Filtered Engine
The basic components of the filtered engine are 1) An event-based XML
parser which is implemented using SAX API for XML documents; 2) A
profile parser that has an XML-QL parser for user profiles and creates the
Query Index; 3) A Query Execution Engine which contains the Query Index
which is associated with Finite State Machines to query the XML
documents; 4) Delivery Component which pushes the results to the related
mobile clients (see Figure 3).

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 4


International Journal of Computer Science and Business Informatics

IJCSBI.ORG

User Profiles Profile Parser

Query

XML-QL Parser

Query nodes

Query Execution Engine

Delivery
Query Index
Results

Events

XML Parser XML Document

Figure 3. Filtered engine


4. OPERATION OF THE SYSTEM
The system operates as follows: subscriber informs the filtered engine when
a new profile is created or updated; the profiles are stored in an XML file
that contains XML-QL queries and addresses to transmit the results (see
Figure 2). Profiles are parsed by the profile parser component and XML-QL
queries in the profile are parsed by an XML-QL parser. While parsing the
queries, the XML-QL parser generates FSM representation for each query if
the query does not match to any existing query group. Otherwise, the FSM
of the corresponding query group is used for the input query. FSM
representation contains state nodes of each element name in the queries
which are stored in the Query Index.
When a new document arrives, the system alerts the filtered engine to parse
the related XML document. The event based XML parser sends the events
encountered to the query execution engine. The handlers in the query
execution engine move the FSMs to their next states after the current states
have succeed level checking or character data matching. Meanwhile the data
in the document which matches the variables are kept in the content lists so
that all the necessary partial data for producing the results are formatted and
pushed to the related mobile clients when the FSM reaches its final state.
4.1 Creating Query Index
Consider an example XML document and its DTD given in Figure 4.
<!-- DTD for Course -->
<!ELEMENT root (course*)>

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 5


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
<!ELEMENT course (degree, major*)>
<!ELEMENT degree (#PCDATA)>
<!ELEMENT major(name, program, semester, syllabus*)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT program (#PCDATA)>
<!ELEMENT semester (#PCDATA)>
<!ELEMENT syllabus (sub-code, sub-title, instructor)>
<!ELEMENT sub-code (#PCDATA)>
<!ELEMENT sub-title (#PCDATA)>
<!ELEMENT instructor (#PCDATA)>
<root> <course>
<degree>Bachelor</degree>
<major><name>ICT</name>
<program>First Year</program>
<semester>First Semester</semester>
<syllabus>
<sub-code>EM-101</sub-code>
<sub-title>English</sub-title>
<instructor>Dr. Thiri</instructor>
</syllabus>
</major>
</course></root>
Figure 4. An example XML document and its DTD (course.xml)
The example queries and their FSM representations are shown in Figure 5.
Note that there is a node in the FSM representation corresponding to each
element in the query, and the FSM representations tree structure follows
from XML-QL query structure.
Query 1: Retrieve all syllabuses of first year program for ICT major.
WHERE <major> <name>ICT</><program>First Year</><syllabus>$n</>
Q1.2

Q1.1 Q1.2 Q1.3 Q1.4


Q1.1 Q1.3
</> IN course.xml
CONSTRUCT<result><syllabus>$n</></>
Q1.4

FSM for Query 1


Query 2: Find the instructor name of the subject code EM-101.

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 6


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
WHERE <syllabus> <sub-code>EM-101</><instructor>$s</>
Q2.2
Q2.1 Q2.2 Q2.3
Q2.1
</> IN course.xml Q2.3
CONSTRUCT<result><syllabus>$s</></>

FSM for Query 2


Query 3: Retrieve all the instructors for first year program in ICT major.
WHERE<major> <name>ICT</><program>First Year</><syllabus> <instructor>$s</></>
Q3.2
Q3.1 Q3.2 Q3.3 Q3.4 Q3.5
Q3.1 Q3.3
</> IN course.xml
CONSTRUCT<result><syllabus>$s</></>
Q3.4 Q3.5

FSM for Query 3


Figure 5. Example queries and its FSM representation
We also substitute constants in a query with parameters to create
syntactically equivalent queries, which lead to the use of the same FSM for
them. The state changes of a FSM are handled through the two lists
associated with each node in the Query Index (See Figure 6). The current
nodes of each query are placed on the Candidate List (CL) of their related
element name. In addition, all of the nodes representing the future states are
stored in the Wait Lists (WL) of their related element name. A state
transition in the FSM is represented by copying a query node from WL to
the CL. Notice that the node copied to the CL also remains in the WL so
that it can be reused by the FSM in future executions of the query as the
same element name may reappear in another level in the XML document.
When the query index is initialized, the first node of each query tree is
placed on the CL of the index entry of its relevant element name. The
remaining elements in the query tree are placed in relevant WLs. Query
nodes in the CL designate that the state of the query might change when the
XML parser processes the relevant elements of these nodes. When the XML
parser catches a start element tag, the immediate child elements of this node
in the Query Index are copied from WL to CL If a node in the CL of the
element satisfies level checking or character data matching. The purpose of
the level checking is to make sure that this element name possibly will
reappear in the document.

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 7


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
CL
Q1.1 Q3.1

major WL

CL

name WL
Q1.2 Q3.2
CL

program
WL
Q1.3 Q3.3
CL
Q2.1
syllabus WL
Q1.4 Q3.4
CL
sub-code
WL
Q2.2
CL
instructor
WL
Q2.3 Q3.5

Figure 6. Initial states of the query index for example queries

4.2 Operation of the Finite State Machine


When a new XML document activates the SAX parser, it starts generating
events. The following event handlers hold these events:
Table 1. Sample SAX API

An XML Document SAX API Events


<?xml version=1.0> start document
<course> start element: course
<major> start element: major
<name> start element: name
ICT characters: ICT
</name> end element: name
</major> end element: major
</course> end element: course
end document

Start Element Handler checks whether the query element matches the
element in the document. For this purpose it performs a level and an
attribute check. If these are satisfied, it either enables data comparison or
starts variable content generation. As the next step, the nodes in the WL that
are the immediate successors of this node are moved to CL.

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 8


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
End Element Handler evaluates the state of a node by considering the states
of its successor nodes. Moreover, it generates the output when the root node
is reached. It also deletes the nodes from CL which are inserted in the start
element handler of the node. This provides backtracking in the FSM.
Element Data Handler is implemented for data comparison in the query. If
the expression is true, the state of the node is set to true and this value is
used by the End Element Handler of the current element node.
End Document Handler signals the end of result generation and passes the
results to the Delivery Component.
4.3 Generating Customized Results
Results are generated when the end element of the root node of the query is
encountered. Therefore, content lists of the variable nodes are traversed to
obtain content groups. These content groups are further processed to
produce results. This process is repeated until the end of the document is
reached. The results require to be formatted as defined in the CONSTRUCT
clause. After all, the queries results are sent to the related mobile clients.
5. PERFORMANCE EVALUATION
In this section, we conducted three sets of experiments to demonstrate the
performance of the architecture for different document sizes and query
workloads. The graph shown in Figure 7 contains the results for different
query groups, that is, the queries have the same FSM representation but
different constants, for the document course.xml (1MB). When the number
of queries on the same XML document is very large, the probability of
having queries with the same FSM representation increases considerably.

Figure 7. Comparing the performance by varying the number of queries


The above experiment indicates that our proposed architecture is highly
scalable, and a very important factor on the performance is the number of

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 9


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
query groups and that generating a single FSM per query group rather than
per query is well justified.

Figure 8. Comparing the performance by varying depth


The depth of XML documents and queries in the user profiles varies
according to application characteristics. Figure 8 shows the execution time
for evaluating the performance of the system as the maximum depth is
varied. Here, we fixed the number of profiles at 25000 and varied the
maximum depth of the XML document and queries from 1 to 10.

Figure 9. Execution time of queries for different number of query groups and
document sizes
Figure 9 shows the results for the execution times of queries which are
varied the number of query groups and the size of different documents. The
results indicate that performance is more sensitive to document size when
the number of query groups increases. Therefore, this result also confirms
the importance of the query grouping.
As final conclusion we can say that FSM approach proposed in this paper
for executing XML-QL queries on XML documents is a very promising
approach to be used in the mobile environments.

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 10


International Journal of Computer Science and Business Informatics

IJCSBI.ORG

6. CONCLUSIONS
Mobile communication is blooming and access to Internet from mobile
devices has become possible. Given this new technology, researchers and
developers are in the process of figuring out what users really want to do
anytime from anywhere and determining how to make this possible. In
addition, highly personalization is a very important requirement for
developing SDI services in mobile environment as the limited capability of
mobile devices is not enough to handle the entire documents. This paper
attempts to develop an efficient and scalable SDI system for mobile clients
based upon their profiles. We anticipate that one of the common uses of
mobile devices will be to deliver the personalized information from XML
sources. We believe that a querying power is necessary for expressing
highly personalized user profiles and for the system to be used for millions
of mobile users, it has to be scalable. Since the critical issue is the number
of profiles compared to the number of documents, indexing queries rather
than documents makes sense. We expect that the performance of the system
will still be acceptable for mobile environments for millions of queries since
the results of the experiments show that the system is highly scalable.

7. ACKNOWLEDGMENTS
The authors wish to acknowledge Dr. Soe Khaing for her useful comments
on earlier drafts of the paper. Our heart-felt thanks to our family, friends and
colleagues who have helped us for the completion of this work.

REFERENCES
[1] M. Altinel and M. Franklin, Efficient filtering of XML documents for selective
dissemination of information, Proc of the Intl Conf on VLDB, pp. 53-64, Sept 2000.
[2] Y. Diao, M. Altinel, M. Franklin, H. Zhang and P.M. Fischer, Path sharing and
predicate evaluation for high-performance XML filtering, ACM Trans. Database
Syst., 28(4), Dec 2003, pp. 467516.
[3] Extensible Markup Language, http://www.w3.org/XML/.
[4] I. Miliaraki, Distributed Filtering and Dissemination of XML Data in Peer-to-Peer
Systems, PhD Thesis, Department of Informatics and Telecommunications, National
and Kapodistrian University of Athens, July 2011.
[5] I. Miliaraki and M. Koubarakis, FoXtrot: distributed structural and value XML
filtering, ACM Transactions on the Web, Vol. 6, No. 3, Article 12, Publication date:
September 2012.
[6] J. Chen, D. DeWitt, F. Tian and Y. Wang, NiagaraCQ: a scalable continuous query
system for internet databases, ACM SIGMOD, Texas, USA, June 2000, pp.379-390.
[7] XML-QL: A Query Language for XML, http://www.w3.org/TR/1998/NOTE-xml-ql-
19980819.

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 11


International Journal of Computer Science and Business Informatics

IJCSBI.ORG

An Easy yet Effective Method for Detecting


Spatial Domain LSB Steganography
Minati Mishra
Department of Information and Communication Technology
Fakir Mohan University, Balasore, Odisha, India

Flt. Lt. Dr. M. C. Adhikary


Department of Applied Physics and Ballistics
Fakir Mohan University, Balasore, Odisha, India

ABSTRACT
Digitization of image was a revolutionary step for the fields of photography and Image
processing as this made the editing of images much effortless and easier. Image editing was
not an issue until it was limited to corrective editing procedures used to enhance the quality
of an image such as, contrast stretching, noise filtering, sharpening etc. But, it became a
headache for many fields when image editing became manipulative. Digital images have
become an easier source of tampering and forgery during last few decades. Today users and
editing specialists, equipped with easily available image editing software, manipulate
digital images with varied goals. Photo journalists often tamper photographs to give
dramatic effect to their stories. Scientists and researchers use this trick to get theirs works
published. Patients diagnoses are misrepresented by manipulating medical imageries.
Lawyers and Politicians use tampered images to direct the opinion of people or court to
their favor. Terrorists, anti-social groups use manipulated Stego images for secret
communication. In this paper we present an effective method for detecting spatial domain
Steganography.
Keywords
Digital Image, Steganography, Secret Message, Cover image, Stego Image, Encryption,
Decryption, Steganalysis, Tampering, Histogram, JPEG.

1. INTRODUCTION
Significant advancements in digital imaging during the last decade have
added a few innovative dimensions to the field of image processing.
Steganography and watermarking are few such creative dimensions of
image processing those have gained wide popularity among the researchers.
Digital image watermarking techniques are generally used for authentication
purposes and are achieved by embedding a small piece of information into
copyrighted digital information. Steganography, on the other hand, hides a
large amount of data secretly into a digital medium and is generally used for
secret communications. It is one of the effective means of data hiding that
protects data from unauthorized or unwanted disclosure and can be used in
various field such as medicines, research, defence and intelligence for secret
data storage, confidential communication, protection of data from alteration
and disclosure and access control in digital distribution.

ISSN: 1694-2108 | Vol. 8, No.1 . DECEMBER 2013 1


International Journal of Computer Science and Business Informatics

IJCSBI.ORG

Like every coin has two sides, all technological developments are associated
with both bad as well as good applications and Steganography is not an
exception to this [1]. Though there are many good reasons to use data hiding
techniques and these should be used for legitimate applications only but,
unfortunately, Steganography can also be used for illegitimate reasons [2].
For instance, someone trying to steal data can conceal it in another file and
send it out through an innocent looking email. The information stolen and
passed may be a patients confidential test reports, the tender information of
a company/ organization or even the defence plans of a country. No doubt,
terrorists and criminals can use this method to secretly spread their action
plans and though no evidence is yet established to this claim still, it is
claimed that, Steganography was used to pass the execution plan of the 9/11
WTC attack[3]. Since Steganography mainly targets innocent looking
digital images for their huge size and popularity therefore it has become an
important requirement of the time to differentiate between real innocent
images from those innocent looking Stego images. In this paper we are
suggesting an easier method that can be used to detect spatial domain LSB
Steganography by analyzing the histograms of digital images.

The organization of the paper is as follows. The following section explains


the process of Steganography. Steganalysis is discussed in section 3.
Experimental results are given in section 4 followed by the summary and
conclusion at the end.

2. STEGANOGRAPHY
Steganography is a process of secret communication where a piece of
information (a secret message) is hidden into another piece of innocent
looking information, called a cover. The message is hidden inside the cover
in such a way that the very existence of the secret information remains
concealed without raising any suspicion in the minds of the viewers [4, 5].

Steganographic procedures make use of several media such as text, audio,


video or image as covers but, digital image Steganography is more popular
among the researchers as images are more common forms of mediums that
are used worldwide for data transmission and also due to their data hiding
capacity. The embedding (M) and extraction (X) processes in digital image
Steganography are mapping given by the following equations.
M :{C x (K) x S } C' ----------- (eq. 1)

And

ISSN: 1694-2108 | Vol. 8, No.1 . DECEMBER 2013 2


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
X :{C' x (K)} S --------- (eq. 2)

Where, M and X are the embedding and extraction functions respectively, C


is the cover image, C is the Stego image, K is an optional set of keys (K in
M and X may be same or may be different depending upon the encryption
algorithm used), S is a set of secret messages [6].

Just like the image tampering techniques, Steganography is also an image


manipulation technique. This too manipulates digital images from their
original captures like the image tampering methods but with a different
purpose. Tampering techniques manipulate images to fake a fact and
mislead the viewer to misbelieve the truth behind a scene whereas
Steganography manipulates it for covert communication. Because of its
inherent purpose of data hiding, Steganography requires the original and the
Stego image to look alike but a tampered images need to look as natural as
possible keeping the tampering undetectable to human vision.

Steganography is broadly classified into two categories such as spatial


domain Steganography and transform domain Steganography. Spatial
domain methods take advantage of the human visual system and directly
embed data by manipulating the pixel intensities whereas in the transform
domain procedures, the image is first transformed into frequency domain
and then the message is embedded. The transform domain procedures are
more robust against statistical attacks and manipulations in comparison to
the spatial domain methods but spatial domain techniques are more popular
due to their simplicity and ease of use.

2.1 Cover Image Selection

As imperceptibility and message security are the important criteria of


steganography, choice of cover image plays an important role. Images with
large areas of uniform color, with a few colors or with a unique semantic
content, such as fonts, computer generated arts, charts etc are poor
candidates for cover as they will easily reveal the secret content. Although
computer-generated fractal images can be considered to be good covers due
their complexity and irregularity but as they are generated by strict
deterministic rules that may be easily violated by message embedding.
Therefore, scans of photographs or images obtained with a digital camera
those contain a high number of colors are usually considered safe for
Steganography. In general, grayscale images are considered as the best
covers by many experts [7]. Again, the raw- uncompressed BMP format
images generally are preferred as candidates for covers over the lossy
compressed format images such as JPEG, Wavelet, JPEG2000 and palette

ISSN: 1694-2108 | Vol. 8, No.1 . DECEMBER 2013 3


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
formats (such as, GIF) as they offer the highest capacity and best overall
security [8].

2.2 Bit-plane slicing


Digital images can be monochrome (bi-tone), grayscale (true-tone) or color
depending upon the permissible intensity levels of each pixel i.e. whether
each pixel is represented by only one bit, 8-bits or 24-bits. Like the gray-
level ranges, the contribution made by a single bit to the total image
appearance is also important for specific applications such as Steganography
and watermarking. If each pixel of an image is represented by m bits then
the image can be imagined to be composed of m numbers of 1-bit planes
ranging from bit-plane 0 through bit-plane m-1. For example, a
monochrome image can have only one bit-plane whereas there are 8 bit-
planes in a gray scale image and 24 bit-planes (8 bits each, with respect to
the 3 channels R, G and B) in a color image. Figure 1 shows the bit-plane
representation of a 8x7 pixel gray scale image.

Figure 1: Bit-plane representation of a gray scale image

The 0th bit-plane or the least significant bit-plane (LSB plane) is the plane
that consists of bits with minimum positional value (20= 1) and the MSB
plane (most significant bit plane or the 7th bit-plane) consists of all high
order bits. Therefore a pixel with gray value 225 can have bits
1,1,1,0,0,0,0,1 in 7th to 0th bit-planes respectively. Figure 2 shows different
bit-planes of the grayscale Lena image.

ISSN: 1694-2108 | Vol. 8, No.1 . DECEMBER 2013 4


International Journal of Computer Science and Business Informatics

IJCSBI.ORG

Figure 2: The 8 bit-planes of Lena image

Information hiding in Steganography is generally carried out in two steps.


Firstly, identifying redundant bits in a cover medium which can be modified
without degrading the quality of the cover medium and secondly, selecting a
subset of the redundant bits to be replaced with data from secret messages.
The stego medium is created by replacing the selected redundant bits with
message bits. It can be noticed from the above figure that the higher order
bits (especially the top four bits) contain majority of the visually significant
image information. The other bits contribute to the finer details in the
image. In spatial Steganography, a digital image is generally sliced into
different bit-planes and the lower order bit-planes are replaced by secret
messages. As the lower-order bits do not carry much visually significant
image information, replacement of those bits does not make any visible
disruption to the image and the covert message remains secret inside the
Stego image. Figure.3 shows the Stego Lena image and its bit planes. It can
be seen that the Stego image completely conceals the secret messages inside
without revealing anything to the viewers. The Peak signal-to-noise ratio
(PSNR) that is considered to be an approximation to human perception
quality also remains high, 43.3342 in this case, without raising any
suspicion in the minds of the viewers.

Figure 3: The Stego Lena image and its 8 bit-planes

As Stego images look exactly same as the original images therefore, it


becomes a challenge for the viewers to differentiate the Stegoes from their
original counter parts.

ISSN: 1694-2108 | Vol. 8, No.1 . DECEMBER 2013 5


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
3. STEGANALYSIS
Steganalysis is the art and science of detecting messages hidden using
Steganography [9]. It can be considered to be a two-class pattern
classification problem which aims to determine whether a medium under
test is a cover or Stego [2]. Hence, the goal of steganalysis is to identify
suspected packages, determine whether or not they have a payload encoded
into them, and, if possible, recover that payload. Steganalysis is similar to
cryptanalysis with a slight difference. In cryptanalysis, it is obvious that
intercepted data contains a message therefore; the task here is just to find the
underlying message by decrypting the encrypted text. On the other hand
steganalysis generally starts with a huge set of suspected data files without
any prior information about whether they contain any hidden message or not
and if any secret message is there then which file contains it. Therefore,
steganalysis can be viewed to be a two step process. One is to reduce this
large set to a smaller subset of files those are most likely to have been
altered and the second is to separate the covert signal from the carrier.
Detection of suspected files is straightforward when the original,
unmodified carriers are available for comparison but when only a single
image is available; it becomes a tough problem to say whether it is
manipulated or not as steganography attempts to make the Stego
indistinguishable from the cover.

Distinguishing a Stego file from that of a real innocent file is a major task in
steganalysis. Because of its global nature, Steganography generally leaves
detectable traces in the mediums characteristics and careful examination of
the modified media can reveal the existence of some secret content in it
defeating the very purpose of Steganography, even if the secret content is
not exposed. This paper illustrates how Stego images can be easily
separated from the real innocent ones through histogram analysis
(Histanalysis).

3.1 Histogram
A digital image is a two dimensional function f(x, y) where, x and y are
spatial coordinates, f is the amplitude at (x, y) , also called the intensity or
gray level of the image at that point and x, y, f are finite- discrete quantities.

The histogram of a digital image with gray levels from 0 to L-1 is a discrete
function h(rk)=nk, where, rk is the kth gray level, nk is the number of pixels
in the image with that gray level, n is the total number of pixels in the image
and k = 0, 1, 2, , L-1. [10] In other words, a histogram plot gives the
number of counts of each gray level. In a histogram plot, the horizontal axis
corresponds to gray level values, rk and the vertical axis corresponds to the
values of h(rk)=nk or p(rk)=nk/n if the values are normalized. Histograms are

ISSN: 1694-2108 | Vol. 8, No.1 . DECEMBER 2013 6


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
the basis to various spatial domain processing and provide useful image
statistics. Information inherent in histograms is quite useful for a number of
image processing applications. Figure.4 shows a grayscale image and its
corresponding histogram.

Figure 4: Grayscale Lena image and the corresponding Histogram

3.2 JPEG Compression


JPEG (Joint Photographic Experts Group) is an international compression
standard for continuous-tone grayscale and color still images. JPEG
standard has two basic compression methods. The popular DCT-based
encoding and decoding method offers lossy compression whereas the
predictive method is specified for lossless compression. The ISO/ITU-T
standard defines several modes of JPEG compression such as baseline
sequential, progressive and hierarchical. In baseline mode, the image is
divided into 8x8 non-overlapping blocks, each of which is discrete cosine
transformed. The transformed block coefficients are quantized with a
uniform scalar quantizer, zig-zag scanned and entropy coded with Huffman
coding. Whereas, in case of interlaced progressive JPEG format, data is
compressed in multiple passes of progressively higher detail. This is ideal
for large images that will be displayed while downloading over a slow
connection, allowing a reasonable preview after receiving only a portion of
the data. Figure 5 shows the block diagrams of baseline sequential JPEG
encoding and decoding steps. [11]

Encoding

Decoding

Figure 5: Baseline sequential JPEG encoding and decoding process

ISSN: 1694-2108 | Vol. 8, No.1 . DECEMBER 2013 7


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
3.2.1 Discrete Cosine Transform
Given an image f(x, y) of size NxN, the forward discrete transform F (u, v)
and given the F (u, v), the inverse discrete transform f(x, y) can be obtained
by the general relations:
N 1 N 1
F (u, v) f ( x, y ) g ( x, y, u, v)
x 0 y 0

N 1 N 1
f ( x, y ) F (u, v)h( x, y, u, v)
x 0 y 0
Where, g(x, y, u, v) and h(x, y, u, v) are called the forward and inverse
transform kernels respectively. In case of discrete cosine transformation
(DCT), both the forward and inverse transform kernels are given by a single
relation [8]

(2 x 1)u (2 y 1)v
g ( x, y, u, v) h( x, y, u, v) (u ) (v) cos cos 2 N
2N

Where,
1
if k 0
N
(k )
2
if k 1,2.. N 1

N

In a DCT transformed block F (u, v), the upper left corner bit F (0,0)
represents the DC component and rest of the F (u, v) are called AC
components. Because significant signal energy of an image lies in the low
frequency DC components, those appear in the upper left corner of the DCT
blocks, and since the lower right values representing higher frequencies are
small enough and are often neglected without causing much visible
distortion to the image therefore, DCT offers compression.
3.2.2 Quantization
It is the process of approximating a continuous signal by a set of discrete
signals and in this step corresponding quantized blocks C (i, j) of the DCT
coefficient blocks D (i, j) are obtained using the formula

D(i, j )
C (i, j ) round
Q(i, j )
Where Q (i, j) is a quantization table of selected quality. After this
quantization process, values of high frequency AC coefficients of the DCT

ISSN: 1694-2108 | Vol. 8, No.1 . DECEMBER 2013 8


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
block are usually become zero providing a lossy compression. The non-zero
DC coefficients of each block are then coded separately using Hoffmans
loss-less entropy encoding algorithm. Given in the Figure 6 (b) are the DCT
coefficients of the matrix A (given in Figure 6(a) after quantized by Q50
(Figure. 6(c) and then rounded up.

[a b c]
Figure 6: A 8 x 8 matrix and its coefficients after DCT and Quantization steps

4. EXPERIMENTAL RESULTS AND DISCUSSIONS

Natural photographs being true tone images, the gray levels in these images
generally show a continuous variation between a minimum and maximum
gray value. For example, in the above Lena image of figure 4, we have
pixels with every gray value between 11 and 242. But when one or more bit-
planes of an image are replaced by some other data, the gray counts for
certain bins increase/ decrease in random giving rise to a type of
discontinuity to the gray value counts. In the following Figure.7, image (a)
is an original image (without secret message embedded) captured in PNG
format and figure (b) shows the histogram of gray level counts of this image
where x-axis represents the bins and y-axis the gray level counts. Figure (e)
shows the histogram of the same image after saving it in JPEG. In both
these histograms (as given in (b) and (e) in the figure below), it can be seen
that the gray counts vary continuously between the minimum to maximum
bin. Figure.7 (c) is a stego image which is formed by replacing the LSB bits
of the image (a) with secret message and saved in TIFF format.

[a,b,c,d,e]
Figure 7: Histograms of original and Stego images saved in PNG, TIFF and JPEG

The histogram plot of this image is given in Figure.7 (d) and it is clear from
the plot that the gray counts for certain bins have increased (e.g., for some
bins between 50 and 100) whereas for some other bins that have decreased
introducing a type discontinuity to the histogram plot that seems as if two
different histograms are super imposed with each other. This discontinuity is
further increased as shown in Figure.7 (e), when the Stego image is saved in

ISSN: 1694-2108 | Vol. 8, No.1 . DECEMBER 2013 9


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
JPEG format and then the histogram is plotted. The gray level counts have
been clearly partitioned into different gray zones and the histogram shows a
clear sign of discreetness with alternate picks and valleys which is not an
usual phenomena for photographic images.

To establish this finding, we plotted the histograms of a number of gray


scale images by replacing one, two and three lower order bit-planes of those
with text messages and then saving the stego image both in .tiff and .jpg
formats. Since the luminance component of a color image is equivalent to a
grayscale image and since grayscale images are generally considered
suitable for steganography, we have used grayscale images for our
experiments. MATLAB have been used for bit-slicing, embedding and
creating the .TIF stego images. The TIFF images are then converted to .jpg
using Windows paintbrush tool. Histograms are plotted using the
MATLABs imhist function giving the bins in x-axis and gray counts in y-
axis. Figure.8 shows the results of this operation performed on the 128x128
grayscale Lena image.

[j, k]
Figure 8: Histograms of 128x128 bit Lena (tiff) image without Stego and with Stego
performed.

ISSN: 1694-2108 | Vol. 8, No.1 . DECEMBER 2013 10


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
Figure.8 (a) shows the histogram of the original Lena (tiff) image, b is the
histogram of stegolena after the LSB plane is replaced with text and saved
in .tif format, c shows the histogram of LSB replaced stegolena saved in
JPEG format. d, e, f and h show the histograms of 2 bit-plane and 3-bit
plane replaced stegolena images (TIFF) and their JPEG counterparts
respectively. Histogram counts of the original Lena image and that of the
Stegolena are plotted in j and those of the stegolena after saving it in JPEG
is given in Figure.8 (k).

Though JPEG images with high compression ratio (low quality factor) also
produce histograms of the type Figure.8 (b) but as it is discussed in section
2.1, JPEG images are considered to be very poor candidates for spatial
Steganography and are generally not considered for this purpose. Secondly,
after data is embedded into bit planes, Stego images are never saved in lossy
compression formats such as JPEG as these formats discard the lowers order
bits (so also the secret messages embedded into those bits!) in order to
achieve compression. Thirdly, uncompressed natural images never show a
histogram pattern as that of Figure. 8(b) or 8(d). Therefore, when a raw
format image such as BMP, TIFF produces a histogram with such an
unexpected pattern, it can be suspected to be a Stego medium.

5. SUMMARY AND CONCLUSION

Today, digital images not only provide forged information but also work as
agents of secret communication. With the availability of a wide range of
easy Steganographic methods, these popular digital media are used for
secret data transmission, sometimes with legitimate goals and sometimes for
immoral purposes. Lots of work has been done on steganalysis and tamper
detection techniques and still researchers worldwide are working to
successfully detect manipulations made to digital images. In this paper we
have used an extremely easy but highly effective histanalysis method to
detect spatial domain digital image Steganography.

REFERENCES
[1] Minati Mishra, P. Mishra, MC Adhikary, Digital Image Data Hiding Techniques:
A Comparative Study, ANVESA- the journal of F. M. University, (ISSN 0974-
715X), Vol. 7, Issue 2, Pp.105-115, Dec 2012.
[2] Bin Li et al., A Survey on Image Steganography and Steganalysis, Journal of
Info. Hiding and Multimedia Signal Processing, ISSN 2073-4212, Vol-2, No-2,
pp142-172, Apr 2011.
[3] http://en.wikipedia.org/wiki/Steganography
[4] Neil F. Johnson: Steganography: Seeing the Unseen, George Mason University,
http://www.iitc.com/stegdoc/sec202.html

ISSN: 1694-2108 | Vol. 8, No.1 . DECEMBER 2013 11


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
[5] M. Mishra et al., High Security Image Steganography with modified Arnolds cat
map, IJCA, Vol.37, No.9:16-20, January 2012.
[6] M. Mishra, M. C. Adhikary, Digital Image Tamper Detection Techniques: A
Comprehensive Study, International Journal of Computer Science and Business
Informatics (ISSN: 1694-2108), Vol. 2, No. 1, Pp. 1-12, JUNE 2013.
[7] Aura, T., Invisible communication, Proc. of the HUT Seminar on Network
Security 95, Espoo, Finland, November 1995. Telecommunications Software and
Multimedia Laboratory, Helsinki University of Technology.
[8] Fridrich, J. and Du, R., Secure Steganographic Methods for Palette Images,
Proc. The 3rd Information Hiding Workshop, September 2830, Dresden,
Germany, LNCS vol. 1768, Springer-Verlag, New York, pp. 4760, 1999.
[9] http://en.wikipedia.org/wiki/Steganalysis
[10] RC Gonzalez, RE Wood: Digital Image Processing, 2nd Ed, PHI, New Delhi,
2006.
[11] http://en.wikipedia.org/wiki/JPEG

ISSN: 1694-2108 | Vol. 8, No.1 . DECEMBER 2013 12


International Journal of Computer Science and Business Informatics

IJCSBI.ORG

Minimizing the Time of Detection of


Large (Probably) Prime Numbers
Dragan Vidakovic
Assistant Gimnazija
Ivanjica, Serbia

Dusko Parezanovic
Assistant Gimnazija
Ivanjica, Serbia

Zoran Vucetic
Assistant Gimnazija
Ivanjica, Serbia

ABSTRACT
In this paper we present the experimental results that more clearly than any theory suggest
an answer to the question: when in detection of large (probably) prime numbers to apply, a
very resource demanding, Miller-Rabin algorithm. Or, to put it another way, when the
dividing by first several tens of prime numbers should be replaced by primality testing? As
an innovation, the procedure above will be supplemented by considering the use of the
well-known Goldbachs conjecture in the solving of this and some other important
questions about the RSA cryptosystem, always guided by the motto do not harm neither
the security nor the time spent.
Keywords
Public key cryptosystems, Prime numbers, Trial division, Miller-Rabin algorithm,
Goldbach conjecture.
1. INTRODUCTION
In asymmetric schemes [1] of protecting the confidentiality and integrity of
data there is a need for large prime numbers. For some tasks required
number of bits now exceeds 15,000, and it is still just a passing figure in the
endless game of those who protect data and those who attack them. It is
therefore quite clear that the time spent on detection of large prime numbers
must be as short as possible.
It would be best to check the divisibility of number n with all prime
numbers less than or equal to sqrt(n). However, with so many bits it's not
realistic. Therefore, the number which is tested to primality is previously
divided by several tens of first prime numbers and then, if it is not divisible
by any of these numbers, it is left to Miller-Rabin algorithm [1].
It is a very difficult task to find theoretically an optimum ratio of time
required for dividing the number and testing by Miller-Rabin algorithm.

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 1


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
Perhaps a redundant task as well in terms of our needs, since in practical
tasks such as ours, we have no reason to pretend that computers do not exist,
that the experimentally obtained, very useful results are less valuable than
the values obtained theoretically.
As a useful tool for our task (minimizing the time required for detection of
prime numbers) we see the Goldbach conjecture [2], which states that every
(large for us) even number is the sum of two prime numbers. It may forever
remain a conjecture, or one day some talented mathematician may write a
book of hundreds of pages that will prove its truth, or some computer may
find the number for which it is not valid, and with that break the conjecture.
For those of us looking for large prime numbers none of these three matters.
We will, in any case, generate a random large even number 2n, of, say,
1024 bits, and detect a much smaller random prime number of, say, 128 or
256 bits, which is negligible in terms of time, and then verify that the
difference of those two numbers is a prime number. If so, we have a large
prime number. If not, we will repeat the procedure or we will use this
difference to generate the prime number closest to it by a combination of
dividing by first prime numbers and Miller-Rabin algorithm. Experimentally
we will ensure that the above procedures may also result in saving the time
required to detect a large prime number.
2. WHY WE NEED PRIME NUMBERS
The public key cryptography-PK [1][3], a major breakthrough in the field
of data secrecy and integrity protection, is mostly based on the assurance
which has never been mathematically proved that some mathematical
problems are difficult to solve. The two of them are particularly prominent
and used a lot.
Since we opted for RSA [1][3] mechanism we will point to one of them.
The multiplication of two large prime numbers is a one-way hash function
[1], which means that we can easily get their multiplication result. However,
factorization of that multiplication result with the aim of getting the prime
factors (factoring), turned out to be very difficult. This problem of
identifying private key d in the Public Key cryptography (PK), if we know
the public key and if it is the pair (n,e) are two equivalent problems [1][3].
Certainly, there are many other PK schemes, asymmetrical algorithms, apart
from RSA. They are based on the same problem which is difficult to solve
in practice if the number of digits is large enough, and by means of these
schemes a one-way function with trap door is created.
By technological development and progress in the field of algorithms for
whole number factorization, the need for larger and larger prime numbers
has been demonstrated. This means that their multiplication result will
consist of more and more digits. The competition between those who attack

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 2


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
unprotected data and those who protect them using RSA mechanism
requires creation of the faster operations for dealing with large numbers.
The new arithmetic requires more efficient codes for addition, subtraction,
multiplication and division of large numbers and what is particularly
significant is to solve modular exponentiation in the most efficient way
possible [4].
This makes sense only if special attention is paid to the creation of ones
large (probable) prime numbers, since the use of such numbers available on
the Internet or in any other way is not in accordance with the very aim of
data protection. Since the process of large prime generation requires a lot of
time and computer resources [1][4], it is of particular interest to us to find a
way to avoid the application of the primality testing algorithm to the number
as much as possible.
3. EXPERIMENTAL RESULTS
In order to avoid unnecessary applications of Miller-Rabin algorithm to the
number in question, we resort to trial division by a few initial prime
numbers, since such a division take less time.
How far we should go with such a division is the that we are trying to
answer in this paper? For the theory of the matter is fully resolved.
However, that in practice we do not have much use.
The trial division takes less time then exponentiation [1][4], but it would
certainly be wrong to conclude that we should divide the number as long as
possible. It is very difficult to determine the real relation between the two,
since everything depends on the number we start with and odd numbers we
examine so as to generate a probable prime.
Therefore, we present two solutions that are probably irrelevant to theorists,
but it is very useful to people who have spent many nights to produce large
(probably) prime numbers using its own software [4].
3.1 Dividing by First Several Tens of Prime Numbers
In this paragraph we show the results of detection of prime numbers of 513,
1024 and 1500 bits, namely: without dividing by prime numbers, dividing
by first 10, 20, 30,, 100 prime numbers.
Example 1
If we start with number c with ones in places: c[512]:=1; c[255]:=1;
c[200]:=1;c[127]:=1; c[100]:=1; c[50]:=1; c[10]:=1; c[9]:=1; c[8]:=1;
c[7]:=1; c[2]:=1; c[1]:=1; c[0]:=1; by dividing and testing we intend to
detect first prime number with ones in places: c[512]:=1; c[255]:=1;
c[200]:=1;c[127]:=1; c[100]:=1; c[50]:=1; c[11]:=1; c[9]:=1; c[6]:=1;
c[3]:=1; c[2]:=1; c[1]:=1; c[0]:=1; as a result we have the following table:

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 3


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
TABLE 1. The Timing of Detection of a Prime Numbers
a 353 110 91 81 73 72 67 66 65 65
b 0 10 20 30 40 50 60 70 80 90
c 1455 466 398 361 337 342 329 330 334 337
d 1455 453 375 334 301 297 276 272 268 268
e 0 13 23 27 36 45 53 58 66 69

a 63 62 61 61 61 60 58 57 57 56
b 100 110 120 130 140 150 160 170 180 200
c 326 328 331 337 343 345 343 344 353 358
d 260 255 251 251 251 247 239 235 235 231
e 66 73 80 86 92 98 104 109 118 127

Where the following row labels are valid:


a- number of passes through the Miller-Rabin (MR) algorithm
b- number of first prime numbers by which we divide the number tested
before passing through the MR algorithm
c- the total time needed for detection of a prime number
d- time spent on the MR algorithm
e- time spent on the division by first prime numbers
(all times are expressed in seconds).
It is clear from the table that the search for prime numbers, without dividing
by first prime numbers, is not an option. This is an unnecessary waste of
time. Dividing by first ten prime numbers would be a minimum. Dividing
by 60, 70, ... would be a good choice. The choice of tens of numbers more
or less could make little savings or a small loss of time and would not
significantly affect the quality of our task.
Example 2
If we start with number c with ones in places: c[1023]:=1; c[767]:=1;
c[512]:=1; c[255]:=1; c[127]:=1; c[100]:=1; c[50]:=1; c[10]:=1; c[9]:=1;
c[8]:=1; c[7]:=1; c[2]:=1; c[1]:=1; c[0]:=1; by dividing and testing we
intend to detect first prime number with ones in places: c[1023]:=1;
c[767]:=1; c[512]:=1; c[255]:=1; c[127]:=1; c[100]:=1; c[50]:=1; c[11]:=1;
c[10]:=1; c[4]:=1; c[2]:=1; c[1]:=1; c[0]:=1; as a result we have the
following table:
TABLE 2. The Timing of Detection of a Prime Numbers

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 4


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
a 584 178 129 115 111 107 105 101
b 0 10 30 50 60 70 80 100
c 18144 5583 4251 3872 3778 3711 3734 3792

d 18144 5518 3999 3565 3441 3317 3255 3131

e 0 65 252 307 337 394 479 661

A minimum below which we should not go in generating 1024-bit prime


numbers is dividing by first ten numbers, which reduces the time of
detection of a prime number (about) three times. Dividing by first 60 to a
hundred numbers reduces the time of detection of a prime number (about)
five times, so that these values may be a good choice.
Example 3
If we start with number c with ones in places: c[1499]:=1; c[1023]:=1;
c[767]:=1; c[512]:=1; c[255]:=1; c[127]:=1; c[100]:=1; c[50]:=1; c[10]:=1;
c[9]:=1; c[8"1; c[7]:=1; c[2]:=1; c[1]:=1; c[0]:=1; by dividing and testing
we intend to detect first prime number with ones in places: c[1499]:=1;
c[1023]:=1; c[767]:=1; c[512]:=1; c[255]:=1; c[127]:=1; c[100]:=1;
c[50]:=1; c[10]:=1; c[9]:=1; c[8]:=1; c[7]:=1; c[6]:=1; c[5]:=1; c[3]:=1;
c[2]:=1; c[1]:=1; c[0]:=1; as a result we have the following table:
TABLE 3. The Timing of Detection of a Prime Numbers

a 50 16 15 14 12 12 12
b 0 10 20 30 40 50 60
c 4805 1551 1458 1387 1175 1178 1256
d 4805 1538 1442 1345 1153 1153 1153
e 0 13 16 42 22 25 103

Similar conclusions as in the above examples we may draw in the case of


generating a 1500-bit number. Dividing by 50-60 first prime numbers is a
very good choice.
3.2 Using Goldbach Conjecture
Goldbach set the conjecture that "every even number (2n) larger than four is
the sum of two (odd) prime numbers." [2]
Our idea is to detect a prime number p less than n and then to examine
whether the difference (2n-p) is a prime number. Given that we choose p to
be less than n (in order to detect it faster) and 2n to be a large number,
which, if 2n-p is a prime, gives us a large prime number, avoiding the search

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 5


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
through the upper part (numbers larger than n) which is in terms of time far
more demanding job than detection of prime numbers in the lower part
(numbers less than n).
Of course all of this is possible if there are enough pairs with simple
coordinates between all pairs of numbers (p, q), where p + q = 2n, p-prime
and q-odd number.
TABLE 4. Number of GC Pairs
Even number 220 221 225 226 227
Number of 4244 7492 83543 153881 283830
Pairs GC
Number of 43458 82125 1078257 2064123 3958400
pairs (*1)
% (GC) in (*1) 9.77% 9.12% 7.75% 7.45% 7.17%

The Table 4. shows that Goldbach conjecture can be a useful tool in our task
because there is a probability, though not large, of guessing the large prime
number. A possible loss of time in detecting the prime number for the first
coordinate is negligible because it is number less than n, and it is
particularly negligible compared to the possibility to immediately detect the
other simple coordinate- a large prime number. We can get more favorable
result, if we consider the set (*2 ) = {(p,r)} for given number 2n, p n-1, r
n + 1, where p is a prime number and r is an odd number from the set
{6*k+1, 6*k -1} and p + r = 2n. In any case, it is clear that by this process
we cannot increase the time of detecting a large prime number, while we can
significantly reduce it using favourable conditions.
With our own software we conducted an experiment whose aim was to find
all pairs (p, q) for given number 2n, p n-1, q n + 1, where p is a prime
number and q is an odd number and p + q = 2n (*1). Then, among these
pairs to find those in which the second coordinate is prime number (pairs of
Goldbach conjecture (GC)) and to measure the time of finding a number of
representations of number 2n which satisfy the Goldbach conjecture.
4. SOME FURTHER OBSERVATIONS
If we consider the time of finding all GC pairs of some even number 2n, we
can see that with the increase of number n, the time to find them
significantly increases.

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 6


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
TABLE 5. Time to Find a GC Pairs
Even number 224 225 226 227
Number of 45752 83543 153881 283830
pairs GC
Time of finding 56 42 2h 1 14 3h 29 48 7h 18 17
the pairs GC

4.1 Point Multiplication


One of the most important operations for all applications of elliptic curves
cryptosystems (ECC) [5] is scalar point multiplication. Scalar point
multiplication consists of calculating the value of an integer multiplied by a
point by doing a series of point doublings and additions until the product
point is reached.
Example
For NIST recommended prime field Fp for p= 2192- 264 -1, base point P(XG,
YG) [6]:
XG = 0x 188da80eb03090f67cbf20eb43a118800f4ff0afd82ff1012
YG = 0x 07192b95ffc8da78631011ed6b24cdd53f977a11e794811, and
k=10000000000000000000000000000000000000000000000000000000000
1000000000000000000000000000000000000000000000000010000000000
0000000000000000000000000000010001101011
Coordinates of the point Q(X, Y) =k* P(XG, YG) are [6]:
X: 0x a9355c37074c8195faa23d1d071997fe4ea7a2bdec781047
Y:0x 911a232aabeba33b5e8f29743f2837955cd5bf1f74aa9a24
Very similar: Even if we know 2n = 227, it takes more than seven hours to
find a number of its GC representations. If we know the number of GC
representations it is impossible to guess the number whose representations
those are, which resembles a Goldbach "multiplication of point", which ends
the "similarities" with ECC because we cannot define the addition of points.
But, the question of using this sort of Goldbachs point transformation (from
starting point to the point (2n, (*1),(GC))) remains open for some further
research.

4.2 The Possibility to Applying to the RSA


The public key technique developed by Rivest, Shamir and Adelman is
known as the RSA algorithm. The security of this approach is based on the
fact that it can be relatively easy to multiply large primes together but

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 7


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
almost impossible to factor the resulting products. RSA has became the
algorithm that most people associate with the notion of public key
cryptography. The technique literally produces public keys that are tied to
specific private keys. If Alice has a copy of Bobs public key she can
encrypt a message to him, and he uses his private key to decrypt it. RSA
also allows the holder of a private key to encrypt data with it so that anyone
with a copy of the public key can then decrypt it. While public decryption
obviously doesnt provide secrecy, the technique does provide digital
signature, which attest that a particular crypto transform was performed by
the owner of a particular private key [1][3].
RSA keys consist of three special numeric values that are used in pairs to
perform encryption or decryption. The public key value is generally a
selected constant that is recommended to be either 3 or 65537. After
choosing the public key we generate two large prime numbers P and Q. The
private key value is derived from P, Q, and the public key value. The
distributed public keying material includes the constant public key value
and the modulus N, which is the product of P and Q. The modulus is used
in both the encryption and decryption procedures when either the public or
private key is used. The original primes P and Q are discarded [1][3].
Key generation for the RSA encryption:
Each entity creates an RSA public key and a corresponding private key [1].
Algorithm
Each entity A should do the following:
Generate two large distinct random primes p and q, each roughly
the same size
Compute n = pq i = (p - 1)(q - 1).
Select a random integer e, 1 < e < , tako da je nzd(e, ) = 1
Use the extended Euclidean algorithm [1]. (Algorithm 2.107) to
compute the unique integer d, 1 < d < , such that ed 1 (mod
).
As public key is (n, e); As private key is d
Example 1
Let the message m [4]: rat or binary: m = (0)10100100110000101110100
Select the two 128 BD primes:
p:
1000000000000000000000000001000000000000000000000000000000000
0000000000000000100000000000000000000000000000000000000010101
011111.

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 8


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
q:
1000000000000000000000000001000000000000000000000000000000000
0000000000000000100000000000000000000000000000100000000000111
001011
We count
n = pq:
1000000000000000000000000010000000000000000000000000001000000
0000000000000001000000000000000000000000001000100000000011100
1010100000001000000000111001010110000000000000000000000000000
0100000000011100101010000000000000000000010101011111100110100
00101010101.
= ( p - 1 )(q 1 ):
1000000000000000000000000010000000000000000000000000001000000
0000000000000001000000000000000000000000001000100000000011100
1010000000001000000000111001010010000000000000000000000000000
0100000000011100101000000000000000000000010101011110100110011
01000101100.
Public key e = 3 or binary 11.
Secret key d:
1010101010101010101010101101010101010101010101010101100000000
0000000000000001010101010101010101010101100000101010101111011
1000000000001010101011110111000010101010101010101010101010101
1010101011010000110101010101010101010101110001111110001000100
0101110011.
Encrypted message m:
1000100001111110101010110100101100001100010110010011100000010
1000000.
Example 2
Let the message m [4]: Ratne godine!
binary:
m=(0)10100100110000101110100011011100110010100100000011001110
11011110110010001101001011011100110010100100001
Encrypted message m:
1011111001011001110110101110101110011111011010011010100110000
0011011001000101010101000010100101001110001011010101101100111
1011000001001001100000100100110111110100100101110001001001010
0111101110011001100001000111000010101011010010101010010000010
1011110101.

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 9


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
Now we point out the two possible connections between RSA and the
Goldbach conjecture.
4.2.1 The First Possibility
Only powerful computers can calculate (GC) numbers of 1024, 2048, or a
larger number of bits. We have no reasons not to believe that (*1) and (GC)
are greater and greater numbers and at the same time (probably) they are
unique for a given number 2n. Even if various even numbers have the same
representation it does not matter for us because we will create a table that
will contain in each row for a given even number the hash value (*1) and
(GC).
For large, probably prime numbers p1 and q1 we will calculate the number
2n = p1 * q1 +1.
For this number we will find (*1) and (GC) and their hash values h (*1) and
h (GC). The procedure will be repeated k times and the table of k rows (each
of which contains an even number and its corresponding values h (*1) and h
(GC)) will be, in a safe manner, delivered to users.
Instead of a pair as a public key (2n-1, e), we suggest that the first part of
the public key, instead of 2n-1, be h (*1) and h (GC), based on which the
user would set the number 2n by reading the table, and therefore the number
of 2n-1 (the RSA modulus) would be known.
It is clear that this procedure does not weaken the RSA. It just makes it
difficult for those who intend to reveal the secret key, because prior to the
use of algorithms for finding prime factors of the number 2n-1=p1*q1, that
number should be determined primarily, which is very difficult for large
numbers if we know only h (*1) and h(GC).
4.2.2 The Second Possibility
Another possibility would be publishing the number 2n, which implicitly
publishes (GC), too. (GC) may be (another part of the key pairs) public key
for RSA (in the standard label e) if gcd((GC),)=1, or the first number
greater than it that is relatively prime to , where =(p1-1)*(q1-1). This
would be a semi-public key cryptosystem, as the users in addition to the
secret key d, obtain in the same way, safely, the public key e, while others
who have bad intentions must first find (implicitly published) public key e,
which is a very demanding job in terms of time, and only then they may
access the disclosure of the secret key d. It is clear that in the meantime, we
can change the parameters of RSA and thus further complicate efforts to
breach confidentiality and integrity of our data.
5. CONCLUSIONS
This work, too, is in line with our belief that it is necessary that each country
protects the confidentiality and integrity of its data using its own software

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 10


International Journal of Computer Science and Business Informatics

IJCSBI.ORG
[7]. Good experts are a prerequisite for this, and they cannot exist without
the increased interest of young people in cryptography. We believe that
there is no such an interest without more interesting approach to
cryptography, and encryption of cryptographic algorithms and
experimentation with own software is the best way for that. To this end we
have written this paper dealing with such an important topic for practical
cryptography: Minimizing the time of detection of large (probably) prime
numbers.
The consideration of our problem naturally led to the Goldbachs conjecture
[2]. We have noticed that the Goldbachs conjecture can find its place in
cryptography because its assumed property can only reduce the detection
time (not increase it). It is quite possible that Goldbachs conjecture can
play an important role in hindering the intentions of an unauthorized user to
find out the secret key mathematically, and thus to compromise the integrity
and confidentiality of our data.
Regarded more widely, we believe that the using of Goldbachs conjecture
can slow down the trend of massive transition from RSA to ECC. In that
case, the increase in the number of bits would no longer be the only asset of
RSA.

REFERENCES
[1] A. Menezes, P.C. Van Oorschot, S. Vanstone, Handbook of Applied Cryptography,
CRC Press, New York, 1997.
[2] Goldbach, C., Letter to L. Euler, June 7, 1742.
[3] R. Smith, Internet Cryptography, Addison-Wesley, Reading, MA, October 1997.
[4] D. Vidakovic, Analysis and implementation of asymmetric algorithms for data
secrecy and integrity protection, Master Thesis (mentor J. Golic), Faculty of
Electrical Engineering, Belgrade, Serbia, 1999.
[5] Koblitz N., Elliptic Curve Cryptosystems, Mathematics of Computation, 48, pp. 203-
209,1987.
[6] D. Vidakovic, D. Parezanovic, Generating keys in elliptic curve cryptosystems,
International Journal of Computer Science and Business Informatics, Vol. 4, No 1.
August 2013
[7] D. Vidakovic, D. Simic : A Novel Approach to Building Secure Systems, Second
International Conference on Availability, Reliability and Security, In 1th IEEE
International Workshop on Secure Software Engineering (SecSE 2007), Vienna, 2007.,
Austria, pp 1074-1081

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 11


International Journal of Computer Science and Business Informatics

IJCSBI.ORG

Design of ATL Rules for


Transforming UML 2 Sequence
Diagrams into Petri Nets
Elkamel Merah
University of Khenchela, MISC Laboratory, University of Constantine, Algeria

Nabil Messaoudi
University of Khenchela, MISC Laboratory University, of Constantine, Algeria

Dalal Bardou
University of Khenchela, Algeria

Allaoua Chaoui
MISC Laboratory, University of Constantine, Algeria

ABSTRACT
UML 2 sequence diagrams are a well-known graphical language and are widely
used to specify the dynamic behaviors of transaction-oriented systems. However,
sequence diagrams are expressed in a semi-formal modeling language and need a well-
defined formal semantic base for their notations. This formalization enables analysis and
verification tasks. Many efforts have been made to transform sequence diagrams into
formal representations including Petri Nets. Petri Nets are a mathematical tool
allowing formal specification of the system dynamics and they are commonly used in
Model Checking. In this paper, we present a transformation approach that consists of
a source metamodel for UML 2 sequence diagrams, a target metamodel for Petri Nets and
transformation rules. This approach has been implemented using Atlas
Transformation Language (ATL). A Cellular Phone System is considered, as a case
study.
Keywords
UML 2, Sequence diagrams, Petri Nets, Model checking, Model transformation,
Metamodeling, Transformation rules, ATL.

1. INTRODUCTION
The Unified Modeling Language (UML) [2] is a general-purpose graphical
object-oriented modeling language that is designed to visualize, specify,
construct and document software systems in both structural and
behavioral aspects. UML is intended to be a common way of capturing
a n d expressing relationships and behaviors in a notation t h a t i s easy to
learn and efficient to write [17].

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 1


International Journal of Computer Science and Business Informatics

IJCSBI.ORG

In 1997, UML [2] was accepted by the Object Management Group


(OMG) [13][17]. Since then, UML has gone several revisions and
refinements leading up to the current UML, revision 2 [13][17].
This revision represents t h e cleanest, most compact version of UML.
Today, UML is widely accepted by the software engineering community
as a standard in industry and research.
The UML 2 provides several categories of diagrams to specify
different aspects of the system, like structural or behavioral aspect.
For behavioral-intensive s y s t e m s , the dynamic behavior is the most
critical aspect to take into account.
Sequence Diagrams (SDs) - which are considered in this paper - be- long
to the behavioral d i a g r a m s l i k e communication d i a g r a m s .
They are collectively known as interaction diagrams. The communication
diagrams are used to understand and document the interactions between
the objects and also in order to show how the classes are working
together to achieve a goal [11]. Sequence diagrams emphasize the type
and order of messages passed between elements during execution [17].
We selected SDs from UML 2 interactions diagrams since they are the
most common type of interaction diagrams and are very intuitive to new
users of UML [17].
UML models of the interaction category are generally transformed for
verification and validation purposes. This is because dynamic models,
such as SDs, lack sufficient formal semantics [9]. Moreover, UML was
created as a semi-formal modeling language it does not include a formal
semantics [ 4]. This limitation m a k e s rigorous analysis difficult, which
leads to an ambiguous model and problems with modeling the process
concurrency, synchronization, and mutual exclusion [21]. On the other
hand, one of the most important problems of designing phase in software
engineering is to verify all designed things before going to the
implementation phase because starting the implementation phase before
verifying design phase is a big risk in big projects [11].
Thus, production of the new technologies for verification and
validation of UML models seem very crucial and converting UML to
some mathematical models, in order to formalize and validate them
can be a very important task. Many researchers have been
performed in order to only transform the UML models into a formal
model [11]. In our approach, the formal model is Petri Nets (PNs)
[12][18]. Petri Nets can model, among others like automata, the
behavior of systems having concurrency. Since PNs are a formal model
and they have a mathematical representation with a well-defined
syntax a n d semantics, they do not carry any ambiguity and thus, are
able to be validated, verified and simulated.

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 2


International Journal of Computer Science and Business Informatics

IJCSBI.ORG

The suggested approach is mainly based on the technique of metamodel


transformations [5]. Such approach consists in defining the source
metamodel of sequence diagrams, defining the target metamodel of Petri
Nets, and defining the transformation rules. Our transformation
contributes t o the on-going attempt to develop a formal semantics of
UML [13] based on model transformations [5]. On the basis of this
transformation it is possible to accomplish verification of the dynamic
model of the real system. All these reasons motivate the work to map or
to transform UML 2 sequence diagrams to Petri Nets. To achieve this
goal, this paper proposes a set of rules for this transformation.
The rest of this paper is organized as follows. In section 2, we discuss
related work. In section 3, we briefly review the features of UML 2
interactions and sequence diagrams and we briefly introduce Petri Nets.
Section 3 also describes both source and target metamodels suited to the
transformation. We then show in section 4 how we translate a sequence
diagram into behaviorally equivalent Petri Net (PN). In section 5 is
presented the application of the proposed transformation rules with a
Cellular Phone System. Section 6 presents the implementation of the
system design transformation process. We finally conclude our work in
section 7 with some remarks and future work.
2. RELATED WORKS
Many research works have been done on model transformations and
especially to transform sequence diagrams into Petri Nets in order to
perform formal verification. UML sequence diagrams have been very
considered and many works propose a rule-based approach to
automatically trans- late sequence diagrams into Petri Nets.
Kessentini [9] describes an automated SDs to colored Petri Nets
transformation method, which finds the combination of transformation
fragments that best covers the SD model, using heuristic search in a base
of examples. To achieve his goal, he combines two algorithms for global
and local search, namely Particle Swarm Optimization (PSO) and
Simulated Annealing (SA). Ait-Oubelli [15] uses graph transformation to
transform SDs to Promela code. Ribeiro [19] proposed a set of rules that
allow software engineers to transform UML 2.0 sequence diagrams into a
Colored Petri Net. He also used graph transformation to specify
transformation rules. Chaoui [3] proposed an approach to translate SDs
models to their equivalent ECATNets models. The resulting models can
be subjected to various Petri Net analysis techniques. His approach TNets
m odel s are graphs.
In another work, we have proposed a similar approach [10] but deals with
UML 2.0 communication diagrams.

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 3


International Journal of Computer Science and Business Informatics

IJCSBI.ORG

This paper deals with transforming UML 2 sequence diagrams into Petri
Nets models for analysis and verification purposes by using some
transformation rules expressed in the ATL language. Our work is a step
forward in a project that is exploring means to define a semantics for
UML 2 communication diagrams.
3. THE BASIC METAMODELS
3.1 UML 2 Diagrams For Interaction
UML 2 divides diagrams into two categories: structural modeling
diagrams and behavioral modeling diagrams:
Structural diagrams illustrate the static features of a model. Static
features include classes, objects, interfaces and physical components. In
addition, they are used to model the relationships and dependencies
between elements. Structural diagrams include Class diagram, Object
diagram, and some others.
Behavioral diagrams describe how the modeled resources in the
structural diagrams interact and how they execute each other
capabilities. The behavioral diagram puts the resources in motion, in
contrast to the structural view, which provides a static definition of the
resources [16]. Behavioral diagrams include the Interaction diagrams,
Use Case diagram, Activity diagram, State Machine diagram and others.

Interaction diagrams [17] are defined by UML 2 to emphasize the


communication between objects, not the data manipulation associated
with that communication. Interaction diagrams focus on specific
messages between objects and how these messages come together to
realize functionality [17]. An interaction can be displayed in several
different kinds of diagrams: Sequence Diagrams, Communication
Diagrams, Interaction Overview Diagrams, and Timing Diagrams.
Sequence diagrams are one of the kinds of interaction diagrams
that emphasize the type and order of messages passed between elements
during execution [17].
Communication diagrams are one of the kinds of interaction
diagrams that focuses on the elements involved in a particular behavior
and what messages they pass back and forth [17]. Communication
diagrams emphasize the objects involved more than the order and type of
the messages exchanged [17].
Interaction overview diagrams are simplified versions of activity
diagrams [17]. Instead of emphasizing the activity at each step,
interaction overview diagrams emphasize which element or elements are
involved in performing that activity [17].

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 4


International Journal of Computer Science and Business Informatics

IJCSBI.ORG

Timing diagrams are designed to specify the time constraints


on messages sent and received in the course of an interaction. They are
often used to model real-time systems such as satellite communication or
hardware handshaking [17].
Both sequence and communication diagrams concentrate on the
presentation of dynamic aspects of a software system, each from a
different perspective. Sequence diagrams stress time ordering while
communication diagrams focus on organization. Despite their different
emphases, they share a common set of features. Booch, Rumbaugh,
and Jacobson [2] claim that they are semantically equivalent since
they are both derived from the same sub-model of the UML
metamodel, which gives a systematic description of the syntax and
semantics of the UML. In this work, we concentrate on sequence
diagrams.
3.1.1 Sequence Diagrams
Sequence Diagrams (SDs) and Communication Diagrams (CDs) are two
views of the same scenario where SD gives the temporal view of a scenario
and CD gives the structural one. SDs record the same information as
CDs and, hence, scenarios. They just provide a different view one that
focuses on the structural view of the object interactions, rather than the
temporal view. The communication is implicit in a SD, rather than
explicitly represented as in a CD. Some tools even generate SDs from
CDs (or vice versa).
3.1.2 Sequence Diagrams Metamodel
Sequence diagram expresses interactions between objects by exchanging
messages. We provide UML 2 sequence diagram a metamodel, which
graphically displays the abstract syntax in terms of class diagram. The
metamodel complies with the interaction metamodel provided by OMG
[13], whereas showing only the essential syntax constructs of a sequence
diagram, to facilitate the mapping to the Petri Nets. In the metamodel,
the syntax elements are represented as classes, shown as boxes, and
relations elements are represented as associations, shown as lines among
classes in terms of class diagram. A hollow diamond on an association
represents aggregation relationship (has-a), while a filled diamond
represents a composition relationship (part-of). A triangle on an
association represents a generalization between a superclass and its
subclass. The numbers attached to an association are called
multiplicities, which describe how many objects may exist in the
association. A star denotes zero or more. If no multiplicity is present, a
one-to-one relationship is implied [20].
In this work, we proposed a sequence diagram metamodel, it is inspired
from the OMG [13] metamodel. It describes all the concepts and the

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 5


International Journal of Computer Science and Business Informatics

IJCSBI.ORG

relations existed between them. Figure 1 shows our simplified


metamodel for UML 2 sequence diagrams. The important concepts in
an interaction are life lines, messages, and combined fragments.
Description of the metamodel:
- The class Interaction: It is the root, w h i c h represents an
interaction. Each interaction has a name (attribute name of type
String). An interaction consists of a set of life lines and a set of
messages.
- The class LifeLine: LifeLine represents the operations executed
by an object. Each life line has a set of incoming and outgoing
messages. It can be covered by interaction operands.

Figure 1. A simplified metamodel for UML 2 sequence diagrams

- The class Message: A message defines a particular


communication between two objects. Each message has a name,
the attribute IsPart of type Boolean is used for the transformation
of the operator Alt. Message consists of Send and Receive action,
which are placed on two different Occurrences Start and End.
- The class OccurrenceSpecification: It describes the scheduling
of messages, through order attribute of type Int. The attribute
IsTheLast of type Boolean is used to differentiate between the last
message and the others.

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 6


International Journal of Computer Science and Business Informatics

IJCSBI.ORG

- The class CombinedFragment: It consists of a set of operands of


type IntercationOperand. Each CombinedFragment has a kind; it
takes a value among the enumeration Kind values.
- The class InteractionOperand: It represents an operand of an
operator, it has a name and it can have a constraint of type
InteractionConstraint. An interaction operand covers by a set of
life lines.
- The class InteractionConstraint: It has an attribute name of
type String that represents the value of the constraint.
3.2 Petri Nets Metamodel
Petri Nets are a graphical and mathematical representation of discrete
distributed systems. They are also known as Place/Transition nets or
P/T nets. Petri Nets consist of places, transitions and directed arcs to
connect them. There are two sorts of arcs connecting place to transition
or transition to place.
A Petri Net is a 4-tuple PN = (P, T, Pre, Post) where:
1. P is a finite set of places,
2. T is a finite set of transitions,
3. Pre: P*T > N is the application of previous places,
4. Post: P*T > N is the application of following places.
Figure 2 shows a metamodel for Petri Nets.

Figure 2. Petri Nets metamodel

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 7


International Journal of Computer Science and Business Informatics

IJCSBI.ORG

Description of the metamodel:


- The class PetriNet: Its the root which represents a Petri Net,
it has one attribute name of type String, it takes the name of
the Petri Net.
- The class Place has two attributes:
Name: of type String, i t represents t h e c o nt e n t of the
place.
Id: of type Int, it used for scheduling the set of the places.
Each place has a set of outgoing PlaceToTransArc, and a set of
incoming TransToPlaceArc.
- The class Transition: It has one attribute name of type
String. It represents t h e action ex ecut ed by the transition
(Send or Receive action). Each transition has a set of outgoing
TransToPlaceArc, and a set of incoming PlaceToTransArc.
- Arcs are PlaceToTransArc or TransToPlaceArc. The
class Arc is an abstract class, its only used for inheritance. Both
of PlaceToTransArc and TransToPlaceArc inherit frome class
Arc.
- Each PlaceToTransArc has as a source a place, and as a target a
transition.
- Each TransToPlaceArc has as a source a transition, and as a
target a place.
4. TRANSFORMATION APPROACH
4.1 The Transformation Process
To make easier the rules specification of the transformation, our efforts
address the transformation at the metamodel level of UML 2. This also
allows the mapping between the concepts of both metamodels source and
target. The metamodeling based transformation approach for
transforming UML 2 sequence diagrams into Petri Nets is shown in
Figure 3. Sequence diagrams are assumed to be syntactically and static
semantically correct. The transformation process is achieved by the
application o f rules. A transformation rule consists in transforming a
concept outlined in the source metamodel to a corresponding concept in
the target meta- model.
4.2 Transformation Rules
In the following, we define the rules for transforming sequence diagrams
into Petri Nets. The transformation rules describe the interactions that
exist between classes of the sequence diagrams metamodel and Petri
Nets metamodel. These rules consist essentially of:

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 8


International Journal of Computer Science and Business Informatics

IJCSBI.ORG

Figure 3. Overview of the Sequence Diagrams to Petri Nets ATL trans-


formation

Basic Interaction Transformation Rules


- Rule1: The name of Petri Net is the name of the Interaction.
- Rule2: Each Message is transformed into two sub-Petri nets
(Figure 4). Each sub-Petri net describes the behavior of an object
(the status o f the object before and after Send (Receive action).
These sub-Petri n e t s are connected with a place labeled with the
message name.
Alt Transformation Rules
Alt is transformed a s shown in Figure 5.
- Rule1: The role of this rule is the verification of the kind and the
number of operands. From the class CombinedFragment, places
and transitions correspond to Alt transformation are initialized.
- Rule2: Its the same rule (Rule2) of Basic Interaction
Transformation, the only difference that we handle all cases to
connect operand begin transition with the places correspond to
first send and receive messages, and places correspond to last send
and receive message with operand end transition (for example,
Altcase1 below).

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 9


International Journal of Computer Science and Business Informatics

IJCSBI.ORG

Figure 4. A message transformation

Figure 5. Alt transformation

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 10


International Journal of Computer Science and Business Informatics

IJCSBI.ORG

Parallel Transformation Rules


Parallel i s transformed a s shown in Figure 6.

Figure 6. Parallel transformation

- Rule1: The role of this rule is the verification of the kind, the
number of operands d o e s not matter. From the class
CombinedFragment, places and transitions correspond to Parallel
transformation are initialized.
- Rule2: Its the same rule (Rule2) of Basic Interaction
Transformation, the only difference that we handle all cases to
connect Parallel begin transition with the places correspond to
first send and receive messages , and places correspond to last
send and receive message with Parallel end transition.

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 11


International Journal of Computer Science and Business Informatics

IJCSBI.ORG

Basic Interaction Transformation Rules

rule Interaction2PetriNet{
from
s: SequenceDiagram!Interaction - It produces Petri Nets name-
to
p: PetriNet!PetriNet (name <- s.name)
}
rule Interaction{ from s:SequenceDiagram!Message to
l:PetriNet!Place( - It produces the initial send place-
name<-Send+ s.name +.Before =+ s.SourceLifeLineName +.Begin, id <-
s.MessageSendOrder),
n:PetriNet!Transition( -It produces the send transition-
name<-Send + s.name +(+ s.SourceLifeLineName+ ,+
s.TargetLifeLineName+ )),
r: PetriNet!Place ( It produces the final send place- name<-
s.SourceLifeLineName +:Send+ s.name +.After, id <-
s.MessageSendOrder+1),
m:PetriNet!Place( -It produces the middle place
name<- s.name),
t:PetriNet!Place ( -It produces the initial receive place
name<-Receive+ s.name +.Before =+ s.TargetLifeLineName +.Begin,
id<-s.MessageReceiveOrder),
p:PetriNet!Transition( -It produces the receive transition
name<-Receive + s.name +(+ s.SourceLifeLineName+ ,+
s.TargetLifeLineName+ )),
d:PetriNet!Place ( -It produces the final receive place name<-
s.TargetLifeLineName+:Receive+ s.name+.After , id<-
s.MessageReceiveOrder+1),
isp-st :PetriNet!PlaceToTransArc( -It produces for each arc their source
and target nodes, the same thing for the rest below
source <-l,
target <-n),
...
st-mp:PetriNet!TransToPlaceArc( source<-n,target<-m)
}

ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 12


International Journal of Computer Science and Business Informatics

IJCSBI.ORG

Alt Transformation Rules

rule Alt{
from
c:SequenceDiagram!CombinedFragment (c.IsAlt() and c.Has2Operand())
It checks if the combined fragment is named Alt and if it has two operands
to
pl:PetriNet!Place ( It produces the first operand begin place
name<-Alt part one:Begin),
sl:PetriNet!Place ( -It produces the second operand begin place
name<-Alt part two:Begin),
rl:PetriNet!Place( It produces the first operand end place
name<-Alt part one:End),
tl:PetriNet!Place ( -It produces the second operand begin place-
name<-Alt part two:end),
nl:PetriNet!Transition ( -It produces the first operand transition contains the
operands name-
name<- ConditionOne : + c.getFirstOperandName),
bl:PetriNet!Transition ( -It produces the second operand transition contains the
operands name-
name<- ConditionTwo :+ c.getSecondOperandName),
al:PetriNet!Transition( -It produces the first operand transition end contains the
operands end-
name<-ConditionOne:+ c.getFirstOperandName +.End ),
dl:PetriNet!Transition( -It produces the second operand transition end con-
tains the operands end- name<-ConditionTwo:+ c.getSecondOperandName
+.End ),
The arcs source and target nodes as Basic Interaction
Transformation
}

13
International Journal of Computer Science and Business Informatics

IJCSBI.ORG

Alt Transformation Rules

rule AltCase1{
from
s:SequenceDiagram!Message(s.FirstSendMessage() and
s.FirstReceiveMessage() and s.IsPartOne() and s.IsTheLastSend() and
s.IsTheLastReceive()) It checks which part is the message and its extremities to
attach it with the right arcs
to
l:PetriNet!Place( name<-Send+ s.name +.Before =+
s.SourceLifeLineName +.Begin, id -s.MessageSendOrder),
n:PetriNet!Transition(
name-Send + s.name +(+ s.SourceLifeLineName+ ,+
s.TargetLifeLineName+ )),
r:PetriNet!Place (name-Send+ s.name +.After =+ s.SourceLifeLineName
+.End,
id - s.MessageSendOrder+1), m:PetriNet!Place(name-s.name),
t:PetriNet!Place (
name-Receive+ s.name +.Before =+ s.TargetLifeLineName +.Begin, id-
s.MessageReceiveOrder),
p:PetriNet!Transition(
name-Receive + s.name +(+ s.SourceLifeLineName+ ,+
s.TargetLifeLineName+ )), d:PetriNet!Place (
name-Receive+ s.name +.After =+ s.SourceLifeLineName +.End , id-
s.MessageReceiveOrder+1),
isp-st :PetriNet!PlaceToTransArc(source <-l,target <-n),... fsp1-
ft:PetriNet!PlaceToTransArc(
source<-r,
target<-thisModule.resolveTemp(thisModule.root,al)), frp1-
ft:PetriNet!PlaceToTransArc(
source<-d,target<-thisModule.resolveTemp(thisModule.root,al))
}

14
International Journal of Computer Science and Business Informatics

IJCSBI.ORG

Parallel Transformation Rules

rule Parallel{
from
c:SequenceDiagram!CombinedFragment (c.IsParallel())
to
pl:PetriNet!Place (
name<-Operator Parallel), pl2:PetriNet!Place (
name<-Operator Parallel), nl:PetriNet!Transition (
name<-Operator Parallel Begin), tl:PetriNet!Place (
name<-Operator Parallel), tl2:PetriNet!Place (
name<-Operator Parallel), kl:PetriNet!Transition (
name<-Operator Parallel End), pl-nl:PetriNet!PlaceToTransArc( source <-
pl,
target <-nl),
pl2-nl:PetriNet!PlaceToTransArc(
source <-pl2, target <-nl),
tl-kl:PetriNet!TransToPlaceArc(source <-kl,target <-tl),
pl2-nll:PetriNet!TransToPlaceArc(source <-kl,target <-tl2)
}

rule ParallelCase1 {
from
s:SequenceDiagram!Message(s.FirstSendMessage() and
s.FirstReceiveMessage() and s.IsTheLastSend() and s.IsTheLastReceive())
c:SequenceDiagram!CombinedFragment (c.IsParallel())
to l:PetriNet!Place(
name<-Send+ s.name +.Before =+ s.SourceLifeLineName +.Begin,id
<-s.MessageSendOrder),n:PetriNet!Transition(
name<-Send + s.name +(+ s.SourceLifeLineName+ ,+
s.TargetLifeLineName+ )), r:PetriNet!Place (
name<-Send+ s.name +.After =+ s.SourceLifeLineName +.End,id <-
s.MessageSendOrder+1),
m:PetriNet!Place(name<-s.name), t:PetriNet!Place (
name<-Receive+ s.name +.Before =+ s.TargetLifeLineName
+.Begin,id<-s.MessageReceiveOrder), p:PetriNet!Transition(
name<-Receive + s.name +(+ s.SourceLifeLineName+ ,+
s.TargetLifeLineName+ )), d:PetriNet!Place (
name<-Receive+ s.name +.After =+ s.SourceLifeLineName +.End
,id<-s.MessageReceiveOrder+1),
tl-kl:PetriNet!TransToPlaceArc(source <-kl,target <-tl),
pl2-nll:PetriNet!TransToPlaceArc(source <-kl,target <-tl2)... d-
kl:PetriNet!PlaceToTransArc(source<-d,
target<-thisModule.resolveTemp(thisModule.root,kl))
}

15
International Journal of Computer Science and Business Informatics

IJCSBI.ORG

5. A CASE STUDY: A PHONE SYSTEM


To validate the proposed transformation, we choose a Phone System as
a case study. Sequence diagram shown in the Figure 7, illustrates a basic
interaction between three objects Caller, Phone and Receiver. The use
case of this interaction is carried out as follows:
Caller lifts the Phone.
Dial-tone is heard by the Caller.
Caller composes the number.
Caller is connected to the network (C o n n e c t tone) .
Ring-tone is heard by the Caller (Receiver is not busy) .
Receivers phone rings.
Receiver answers the Caller.
Caller is talking to the Receiver.
Receiver is talking to the Caller.
Disconnexion operation.
Caller hangs up.
This simple procedure is depicted in the sequence diagram in Figure 7,
while Figure 8 shows a possible abstract syntax of the same diagram
according to the metamodel we have defined above. The Phone System
model after applying the steps of the transformation is seen in Figure 9.
We have now reached a Petri Net corresponding to the Phone System.

Figure 7. Sequence diagram of the phone system

16
International Journal of Computer Science and Business Informatics

IJCSBI.ORG

6. IMPLEMENTATION
We have chosen Atlas Transformation Language (ATL) [8][7] under the
Eclipse development platform [6] to express the transformation rules.
ATL is a model transformation language that contains a mixture of
declarative and imperative constructs. ATL is accompanied b y a set
of tools built on top of the Eclipse platform. According to the adopted
transformation process, the implementation of this process requires the
following steps:
1. The r e p r e s e n t a t i o n of the s o u r c e metamodel d e s c r i b e d
i n U ML2- s e q u e n c e diagram i n Ecore D i a gr a m T o o l which
generates An Ecore file named Sequence Diagram.ecore described in XMI
language [14].
2. The representation of the target metamodel described in Petri Nets in
Ecore Diagram Tool which generates an Ecore file named PetriNet.ecore
described in XMI language.
3. The representation of a model instance, i.e. a sequence diagram, of the
source metamodel in Ecore file.
4. Applying the rules of model transformation specified in ATL
language to the source model. This process generates an XMI file
containing a Petri Net describing formally the behavior of the source
sequence diagram.

17
International Journal of Computer Science and Business Informatics

IJCSBI.ORG

Figure 8. Sequence diagram for the phone system in abstract syntax

18
International Journal of Computer Science and Business Informatics

IJCSBI.ORG

Figure 9. An extract f r o m Petri Net for the phone system in abstract syntax

7. CONCLUSION AND PERSPECTIVES


In this paper, we proposed a transformation from UML 2 sequence
diagrams into Petri Nets. A set of rules was defined to govern the
transformation p r o c e s s . On the basis of this transformation it is
possible to accomplish verification of the dynamic model of the real
system expressed by a sequence diagram. Our approach w a s
implemented usi ng the ATL language. A Phone System case study
was used to illustrate the transformation technique. This work still in
progress so we plan to complete it further. First, one direction f o r
future w o rk can be to extend t hi s transformation to other operators
such as ignore and loop. Second, we need to better tune the rules, to
realize if they can be automated [1]. Third, i s to generate Java code
automatically from UML 2 sequence diagrams [22].

19
International Journal of Computer Science and Business Informatics

IJCSBI.ORG

REFERENCES

[1] W. Alouini, O. Guedhami, S. Hammoudi, M.Gammoudi, and D. Lopes.


Semiautomatic Generation of Transformation Rules in Model Driven
Engineering: The Challenge and First S t e p s . International Journal of Software
Engineering and Its Applications ( IJSEA), 5(1), (2011).
[2] G. Booch, J. Rumbaugh, and I. Jacobson. Unified Modeling Language, Version
1.0. Rational Software Corporation, (2010).
[3] A. Chaoui, R. Elmansouri, W. Saadi, and E. Kerkouche. From UML Sequence
Diagrams to ECATNets: a Graph Transformation based Approach for modelling
and analysis. In proceedings of The 4th International Conference on Information
Technology ICIT 2009, June 3rd, (2009).
[4] H.Y. Chen, C. Li, and T.H. Tse. Transforming of UML Interaction Diagrams into
Contract Specifications for Object-Oriented Testing. In Proceedings of the 2007
IEEE International Conference on Systems, Man, and Cybernetics, (2007).
[5] K. Czarnecki and S. Helsen. Classification of Model Transformation Approaches.
In OOSPLA03, W o r k s h o p on Generative Techniques in the Context of Model-
Driven Architecture. Anaheim, USA, (2007).
[6] Eclipse Official Site: http::/www.eclipse.org.
[7] F. Jouault, F. Allilaire, J. Bezivin, I. Kurtev, and P. Valduriez. ATL: a QVT-like
Transformation Language. In Proceedings OOPSLA06 Companion to the 21st
ACM SIGPLAN symposium on Object-oriented programming systems, languages,
and applications, (2006).
[8] F. Jouault and I. Kurtev. Transforming Models with ATL. In J.M. Bruel, editor,
MODELS Workshop, LNCS3844. Montego Bay, Jamaica, (2005).
[9] M. Kessentini, A. Bouchoucha, H. Sahraoui, and M. Boukadoum. Example-
Based Sequence Diagrams to Colored Petri Nets Transformation Using Heuristic
Search. In T. Kuhne et al. (Eds): ECMFA 2010, LNCS 6138. Springer-Verlag
Berlin Heidelberg, (2010).
[10] E. Merah, N . Messaoudi, H . Saidi, and A. Chaoui. Design of ATL Rules for
Transforming UML 2 Communication Diagrams into Buchi Automata. In
International Journal of Software Engineering and Its Applications V o l . 7 , No.2,
March, (2013).
[11] H. Motameni a n d T. Ghassempouri. Transforming Fuzzy Communication
Diagram to Fuzzy Petri Net. American J o u r n a l of Scientific Research, 1 6 ,
(2011).
[12] T. Murata. Petri ne ts : Properties, Analysis and Applications. In Proceedings of
the IEEE, vo l u me 77, (1989).
[13] Object M a n a g e m e n t Group. OMG Unified Modeling Language (OMG UML),
Superstructure, V2.1.2. (2007).
[14] Object Management Group XMI Specification. http://www.omg/spec/xmi/2.4.1.
(2011).
[15] M. Ait Oubelli, N. Younsi, A. Amirat, and A. Menasria. From UML 2.0 Sequence
Diagrams to PROMELA code by Graph Transformation using AToM3. In CIIA,
volume 825 of CEUR Workshop Proceedings, CEUR-WS.org, (2011).

20
International Journal of Computer Science and Business Informatics

IJCSBI.ORG

[16] T. Pender. UML Bible. John Wiley & Sons, (2003).


[17] D. Pilone and N. Pitman. UML 2.0 in a Nutshell. OReilly Publisher, (2005).
[18] W. Reisig. Petri nets - An Introduction. Springer, (1985).
[19] O.R. Ribeiro and J.M. Fern. Some Rules to Transform Sequence Diagrams
into Coloured Petri Nets. In In 7th Workshop and Tutorial on Practical Use of
Coloured Petri Nets and the CPN Tools, (2006).
[20] H. Shen, A. Virani, and J. Niu. Formalize UML 2 Sequence Diagrams. High
Assurance Systems Engineering Symposium, 2008. HASE 2008. 11th IEEE.,
(2008).
[21] I. Trickovic. Transformation of the State Diagram of the Unified Modeling
Language into a Petri Nets Model. NOVI SAD J. MATH, 28(3), (1998).
[22] M. Usman and A. Nadeem. Automatic Generation of Java Code from UML
Diagrams using UJECTOR. International Journal of Software Engineering and
Its Applications (IJSEA), 3(2), (2009).

21

S-ar putea să vă placă și