Jitwe PDF

Editors-in-Chief: Ghazi I. Alkhatib, Princess Sumaya U.
for Technology, Jordan

Ernesto Damiani, U. of Milan, Italy
Editorial Board Coordinator:
Zakaria Maamar, Zayed U., UAE
Publicity Coordinator: Wael Toghuj, Isra U., Jordan
Associate Editors: Michael Berger, Siemens Corporate Technology, Germany
Walter Binder, EPFL, Switzerland
Schahram Dustdar, Vienna U. of Technology, Austria
N.C. Narendra, IBM Software Labs, India
Andres Iglesias Prieto, U. of Cantabria, Spain
David Taniar, Monash U., Australia
IGI Editorial: Jamie M. Wilson, Managing Editor
Adam Bond, Editorial Assistant
Danielle Adams, Editorial Assistant
Chris Hrobak, Journal Production Manager
Christen Croley, Journal Production Assistant
IJITWE Editorial Board
IGI PublIshInG
www.igi-global.com
IGIP
Jamal Bentahar, Concordia U., Canada
Sara Comai, Politecnico di Milano, Italy
Marie-Pierre Gleizes, IRIT - Universit Paul Sabatier,
France
Yih-Feng Hwang, America On Line (AOL), USA
Ali Jaoua, Qatar U., Qatar
Leon Jololian, Zayed U., UAE
Seok-won Lee, The U. of North Carolina, USA
Farid Meziane, The U. of Salford, UK
Soraya Kouadri Mostfaoui, The British Open U., UK
Michael Mrissa, Claude Bernard Lyon 1 U., France
International Editorial Review Board:
Manuel Nunez, Universidad Complutense de Madrid,
Spain
Quan Z. Sheng, Adelaide U., Australia
Amund Tveit, Norwegian U. of Science and
Technology, Norway
Leandro Krug Wives, Federal U. of Rio Grande do
Sul, Brazil
Kok-Wai Wong, Murdoch U., Australia
Hamdi Yahyaoui, King Fahd U. of Petroleum and
Minerals, Saudi Arabia
IGI GLOBAL
The Editors-in-Chief of the International Journal of Information Technology and Web Engineering (IJITWE) would like
to invite you to consider submitting a manuscript for inclusion in this scholarly journal.
mi ssi on:
The main objective of the International Journal of Information Technology and Web Engineering
(IJITWE) is to publish refereed papers in the area covering information technology (IT)
concepts, tools, methodologies, and ethnography in the contexts of global communication
systems and Web engineered applications. In accordance with this emphasis on the Web
and communication systems, this journal publishes papers on IT research and practice that
support seamless end-to-end information and knowledge fow among individuals, teams,
and organizations. This end-to-end strategy for research and practice requires emphasis on
integrated research among the various steps involved in data/knowledge (structured and
unstructured) capture (manual or automated), classifcation and clustering, storage, analysis,
synthesis, dissemination, display, consumption, and feedback. The secondary objective is
to assist in the evolving and maturing of IT-dependent organizations, as well as individu-
als, in information and knowledge based culture and commerce, including e-commerce.
coverage:
CALL FOR ARTICLES
An offcial publication of the Information Resources Management Association
International Journal of Information Technology
and Web Engineering
Please recommend this publication to your librarian. For a convenient
easy-to-use library recommendation form, please visit:
http://www.igi-global.com/ijitwe
Ideas for Special Theme Issues may be submitted to the Editors-in-Chief
ISSN 1554-1045
eISSN 1554-1053
Published quarterly
Case studies validating Web-based IT solutions
Competitive/intelligent information systems
Data analytics for business and government organizations
Data and knowledge capture and quality issues
Data and knowledge validation and verifcation
Human factors and cultural impact of IT-based systems
Information fltering and display adaptation techniques for wireless devices
Integrated heterogeneous and homogeneous workfows and databases within and across organizations, suppliers, and customers
Integrated user profle, provisioning, and context-based processing
IT education and training
IT readiness and technology transfer studies
Knowledge structure, classifcation, and search algorithms or engines
Metrics-based performance measurement of IT-based and Web-based
organizations
Mobile, location-aware, and ubiquitous computing
Ontology and Semantic Web studies
Quality of service and service level agreement issues among integrated systems
Radio frequency identifcation (RFID) research and applications in Web engineered systems
Security, integrity, privacy, and policy issues
Software agent-based applications
Strategies for linking business needs and IT
Virtual teams and virtual enterprises: communication, policies, operation, creativity, and innovation
Web systems architectures, including distributed grid computers and communication systems processing
Web systems engineering design
Web systems performance engineering studies
Web user interfaces design, development, and usability engineering studies
All submissions should be emailed to:
Ghazi Alkhatib, Ernesto Damiani
Editors-in-Chief
Alkhatib@psut.edu.jo;
ernesto.damiani@unimi.it
IGI GLOBAL
Special Issue on Applied Web Engineering Platforms
i Mousa Al-akhras, University of Jordan, Jordan
Research Articles
1 New Fields in Classifying Algorithms for Content Awareness
Radu-Dinel Miru, Politehnica University of Bucharest, Romania
Cosmin Stanuic, Politehnica University of Bucharest, Romania
Eugen Borcoci, Politehnica University of Bucharest, Romania
16 Analyzing the Efect of Node Density on the Performance of the LAR-1P Algorithm
Hussein Al-Bahadili, Petra University, Jordan
Ali Maqousi, Petra University, Jordan
Reyadh S. Naoum, Middle East University, Jordan
30 Using Permutations to Enhance the Gain of RUQB Technique
Abdulla M. Abu-ayyash, Central Bank of Jordan, Jordan
Naim Ajlouni, Al-Balqa University, Jordan
46 Autonomic Healing for Service Specifc Overlay Networks
Ibrahim Al-Oqily, Hashemite University, Jordan
Bassam Subaih, Al-Balqa'a Applied University, Jordan
Saad Bani-Mohammad, Al al-Bayt University, Jordan
Jawdat Jamil Alshaer, Al- Balqa' Applied University, Jordan
Mohammed Refai, Zarqa Private University, Jordan
60 New Entropy Based Distance for Training Set Selection in Debt Portfolio Valuation
Tomasz Kajdanowicz, Wroclaw University of Technology, Poland
Slawomir Plamowski, Wroclaw University of Technology, Poland
Przemyslaw Kazienko, Wroclaw University of Technology, Poland
InternatIonal Journal of
InformatIon technology and
Web engIneerIng
Table of Contents
April-June 2012, Vol. 7, No. 2
IGI GLOBAL
i
Special Issue on Applied Web
Engineering Platforms
Mousa Al-akhras, University of Jordan, Jordan
GUEST EDITORIAL PREFACE
The 2011 IEEE Jordan Conference on Applied
Electrical Engineering and Computing Tech-
nologies (AEECT) was held at the University of
Jordan, Amman, Jordan, from 6 to 8, December,
2011. AEECT theme was Technology for
Solving National Problems. This theme is a
symbolic refection of the importance of elec-
trical engineering and computing technologies
in solving many technical problems that face
Jordan and other countries. This international
conference provides a unique forum to dis-
cuss practical approaches and state-of-the-art
fndings in using these technologies to solve
national problems related to energy, informa-
tion and communications, management, health,
business, and other areas. This conference
is the frst of a series of conferences to be
organized by the IEEE Jordan Section. The
section, which was established in Jordan in
1999, intends to hold this conference biannu-
ally. The section plans to hold the next AEECT
conference in November 2013.
AEECT 2011 has received papers in its
nine tracks that cover various aspects of electri-
cal engineering and computing technologies.
Each of these papers was sent to at least three
expert reviewers. Out of the 182 received
papers, only 62 papers were accepted for in-
clusion in the printed conference proceedings
and for presentation in the thirteen technical
sessions of this conference. The papers that
are successfully presented are also published
in IEEE Xplore Digital Library.
The authors of the papers that have re-
ceived high score in the reviewers evaluation
have been invited to submit an extended version
for this special issue. Each of these papers was
sent to three expert reviewers. One reviewer
reject was enough to reject the paper. Fifty
percent of the invited papers were accepted
after guaranteeing at least 40% extension from
the conference version.
This issue contains a diverse set of articles
that deal Web engineering platforms, such
as networking/security and applications. It
includes three network-related papers: multi-
dimensional packet classifer of an edge router,
route discovery in mobile ad hoc networks
(MANETs), and autonomic healing for service
overlays; one security-related paper: security
quantum key distribution, and fnally, one
application-related paper: debt portfolio valu-
ation based on machine learning.
The lead article, New Fields in Cla-
sifying Algorithms for Content Awareness,
authored by R. Miru, C. Stanuic, and E.
IGI GLOBAL
Borcoci discusses a solution for a new multi-
dimensional packet classifer of an edge router.
The technique is applicable to content-aware
networks. The proposed classifcation algo-
rithm uses three new packet felds: 1) Virtual
content aware network, 2) Service type, and
3) U (unicast/multicast) which are part of the
Content Awareness Transport Information
(CATI) header. A CATI header is inserted into
the transmitted data packets at the Service/
Content Provider server side, in accordance
with the media service defnition, and enables
the content-awareness features at a new overlay
content-aware network layer. The functionality
of the CATI header within the classifcation
process is then analyzed. Two possibilities are
considered: the adaptation of the Lucent Bit
vector algorithm and the tuple space search.
In the second paper titled Analyzing the
Effect of Node Density on the Performance of
the LAR-1P Algorithm, H. Al-Bahadili, A.
Maqousi, and Reyadh S. Naoum combined the
location-aided routing scheme 1 (LAR-1) and
probabilistic algorithms into a new algorithm
for route discovery in Mobile Ad hoc NET-
works (MANETs) called LAR-1P. Simulation
results reported by the authors demonstrated
that the LAR-1P algorithm reduces the num-
ber of retransmissions as compared to LAR-1
without sacrifcing network reachability. Fur-
thermore, on a sub-network (zone) scale, the
algorithm provides an excellent performance
in high-density zones, while in low-density
zones; it preserves the performance of LAR-1.
A. Abu-ayyash and N. Ajlouni in the third
paper Using Permutations to Enhance the Gain
of RUQB Technique, focuses on preserving
confdentiality during communications. The
problem traditionally solved using encryption.
Quantum key distribution (QKD) techniques
are discussed and it was found that they usually
suffer from a gain problem when comparing the
fnal key to the generated pulses of quantum
states. The authors propose to permute the
sets that RUQB uses in order to increase the
gain. The effect of both randomness and per-
mutations are studied; while RUQB technique
improves the gain of BB84 QKD by 5.5% it
was also shown that the higher the randomness
of the initial key, the higher the gain that can
be achieved, this work concluded that the use
of around 7 permutations results in 30% gain
recovery in ideal situations.
I. Al-oqily, B. Subaih, S. Bani-Moham-
mad, J. Alshaer, and M. Refai proposed in the
paper titled Autonomic Healing for Service
Specifc Overlay Networks a self-healing
system for Service Specifc Overlay Networks
(SSONs). SSONs have recently attracted a
great interest, and have been extensively inves-
tigated in the context of multimedia delivery
over the internet. SSONs are virtual networks
constructed on top of the underlying network.
They have been proposed to provide and to
improve services not provided by other tradi-
tional networks. The increased complexity and
heterogeneity of these networks in addition to
the ever changing conditions in the network
and the different types of fault that may occur
make their control and management by human
administrators more diffcult. Therefore, the
self-healing concept was introduced to handle
these changes and to assure highly reliable
and dependable network system performance.
Self-healing aims at ensuring that the service
will continue to work regardless of defects that
might occur in the network.
The last paper by T. Kajdanowicz, S.
Plamowski, and P. Kazienko titled New En-
tropy Based Distance for Training Set Selection
in Debt Portfolio Valuation, investigates the
use of a new distance measure for training in
machine learning. In their proposal, the dis-
tance between two datasets is computed using
variance of entropy in groups obtained after
clustering. The approach is validated using
real domain datasets from debt portfolio valu-
ation process and the prediction performance
is examined.
Mousa Al-akhras
Guest Editor
IJITWE
ii
IGI GLOBAL
IGI GLOBAL
International Journal of Information Technology and Web Engineering, 7(2), 1-15, April-June 2012 1
Copyright 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Keywords: Content Aware Classifcation, Content Provider, Future Internet, Media Services, Multi-
Dimensional Packet Classifer, Multimedia Distribution
INTRODUCTION
THE current trend in network management is
to add more intelligence toward the edges of
the network infrastructure, while at the core is
to continue providing fast switching and for-
warding. The aim is to sustain the incremental
speed and availability demands within the
communication networks. The appearance of
new protocols and applications illustrates that
the data carried by future networks will be gen-
erated by a large number of non-standardized
applications. Even if the current Internet is a
great success in terms of connecting people and
communities, it was designed 40 years ago for
purposes quite unlike todays heterogeneous
application needs and user expectations.
New Fields in Classifying
Algorithms for Content
Awareness
Radu-Dinel Miru, Politehnica University of Bucharest, Romania
Cosmin Stanuic, Politehnica University of Bucharest, Romania
Eugen Borcoci, Politehnica University of Bucharest, Romania
ABSTRACT
The content aware (CA) packet classifcation and processing at network level is a new approach leading to
signifcant increase of delivery quality of the multimedia traffc in Internet. This paper presents a solution for
a new multi-dimensional packet classifer of an edge router, based on content - related new felds embedded
in the data packets. The technique is applicable to content aware networks. The classifcation algorithm is
using three new packet felds named Virtual Content Aware Network (VCAN), Service Type (STYPE), and U
(unicast/multicast) which are part of the Content Awareness Transport Information (CATI) header. A CATI
header is inserted into the transmitted data packets at the Service/Content Provider server side, in accor-
dance with the media service defnition, and enables the content awareness features at a new overlay Content
Aware Network layer. The functionality of the CATI header within the classifcation process is then analyzed.
Two possibilities are considered: the adaptation of the Lucent Bit vector algorithm and, respectively, of the
tuple space search, in order to respond to the suggested multi-felds classifer. The results are very promising
and they prove that theoretical model of inserting new packet felds for content aware classifcation can be
implemented and can work in a real time classifer.
DOI: 10.4018/jitwe.2012040101
IGI GLOBAL
2 International Journal of Information Technology and Web Engineering, 7(2), 1-15, April-June 2012
The high increase of the hardware capa-
bilities both in terms of CPU power and
memory availability is an enabling factor
in designing network equipment capable of
identifying the content transferred within the
packet and applying specific routing policies.
At the same time, multimedia content is
anticipated to be increased by at least a factor of
6 before the beginning of 2012, rising to more
than 990 Exabytes, mainly fueled by the users
themselves (Media Delivery Platforms Cluster,
2008). In order to cope with this versatile content
and services, future network equipments should
become content-aware. This capability will
offer more possibilities to the network manage-
ment entities in order to efficiently manage the
network and to provide flexible and dynamic
traffic handling and QoS assurance functions
(Vorniotakis, Xilouris, Gardikis, Zotos, Kourtis,
& Pallis, 2011).
The packet classification issue is very im-
portant because it offers a way to differentiate
the packets as they traverse different networks
and also because many different services can be
provided through packet classification.
Given the above mentioned considerations,
this paper describes a new solution for classify-
ing the data packets by taking into consideration
not only the common fields, but also the content
aware features.
Due to the importance of this subject,
many algorithms for packet classification al-
ready exists in the literature. As a designer of
a new algorithm based on content awareness,
it is very important to understand the require-
ments of the problem and how these demands
are expected to evolve. Similar to the address
lookup algorithms, the most two widely used
metrics for packet classification are the search
speed and the memory storage space occupied
by the data structures of the algorithm. There
are as well other important metrics, as follows:
Link speed: As physical links are getting faster,
the need for efficient classification algorithms
is greater than ever. This requirement translates
into making a classification decision for an
arbitrary packet during the time allocated for
handling a minimum-sized packet (Internet
traffic measurements (Thompson, Miller, &
Wilder, 1997) show that approximate 50%
of the packets that arrive at a router are TCP-
acknowledgment packets, which are typically
40-byte packets). This issue is much more
severe for very high speed links. For example,
at OC-768 rates (i.e., 40 Gbps) with a minimum
packet size of 40 bytes, we need to handle 125
million packets per sec.
40 10
8 40
125 10
9
6
= packet / sec (1)

Hence, a decision must be made in 8
nanoseconds:
8 40
40 10
8 10
9
9
=

sec (2)
Because the memory access time is expen-
sive and represents the true bottleneck in a real
system, dictating thus the worst-case scenario,
the links speed is usually measured in terms of
the number of memory accesses that are required
to accomplish a packet transfer.
Memory space: Minimizing at the algo-
rithm level the size of the memory required
to run the algorithm is paramount. This is
because the smaller the required memory,
the higher the changes of being imple-
mentable by using fast memory technolo-
gies, as Static Random Access Memory
(SRAM). On-chip SRAM provides the
fastest access time (around 1 nanosec-
ond) and can be used as on-chip cache for
software-based algorithm implementation.
As to meet the very aggressive timing, any
efficient hardware-based implementation
must locally embed SRAM (on-chip)
(Medhi & Ramasamy, 2007).
Database updates: As the classification
requirements often change: new rules
are added, existing rules are deleted or
modified, the data structures used by an
algorithm need to be updated. The data
structures can be categorized into those
IGI GLOBAL
updated incrementally and those that need
to be rebuilt from scratch every time there is
a change is the databases rules set. Gener-
ally, such updates are not very demanding
in the case of the core routers, where the
rules are changed infrequently. However,
edge routers that support dynamic packet
filtering or packet intrusion detection need
to identify and to track certain patterns;
thus, faster updates are required.
Number of header fields: An ideal clas-
sifying algorithm should be able to handle
an arbitrary large number of header fields
for classification, in the minimum period
of time.
The organization of the paper is the follow-
ing. Section 2 presents samples of related work.
Section 3 is an overview of the ALICANTE
architecture. Section 4 is the main part of the
paper; it is focused on the new packet fields for
content awareness. Conclusion, open issues and
future work is shortly outlined in the Section 5.
1. RELATED WORK
The packet classification has undergone a con-
tinuous evolution over time until the optimiza-
tions achieved today. Thus, the first solutions
were based on simple algorithms using Nave
solutions, where all the rules are stored as a
linked list sorted by cost (in increasing order)
and an incoming packet is checked against each
rule sequentially until a rule that matches all
relevant fields is found. At this moment, the
algorithms have significantly evolved, being
enabled by sophisticated QoS requirements
and hardware developments for communica-
tion networks.
Because checking a packet against one
rule takes non-zero processing time, the Nave
solutions have poor scaling properties relative to
the number of rules that are required to match.
Therefore, other more efficient approaches,
such as hierarchical trees, tuples spaces, divide
and conquer (with or without special hardware
accelerators), are introduced in order to meet
specific requirements (Medhi & Ramasamy,
2007).
The Lucent bit vector scheme, which is part
of the divide and conquer approach, provides
comparable performance for medium-sized
classifiers with a few thousand of rules and it
consumes much less memory relative to the
Recursive Flow Classification (RFC) (Medhi
& Ramasamy, 2007). Since it employs bitwise
AND operations on all the bits representing
the rules for identifying the matching ones, it
does not scale well with the rules set for large
classifiers. The aggregated bit vector scheme
attempts to mitigate the lack of scalability by
using summary bits. However, it suffers from
false positives (detecting a rule match when
there isnt), leading to unpredictable average
time to perform a complete search. The dynamic
updates for the Lucent bit vector scheme are
slow, as many bit vectors need to be recon-
structed (practically, the entire data structure
is rebuilt). For larger classifiers, the decision
tree approaches seem to be more attractive
by providing a good trade-off between speed
and memory. These approaches work well in
practice, with the exception of databases that
contain large numbers of wildcards in one or
more fields. However, the overall performance
of the decision tree approaches is governed by
various parameters that are not characterized.
In general, for a real system, the average search
time achieved by an algorithmic approach is
based on exploiting certain assumptions and
characteristics of the classifiers. It is not clear
whether these assumptions will continue to
hold true for the non-algorithmic approaches,
unless they are extensively validated. Hence,
the worst-performance cases of these algo-
rithmic approaches might degrade in reality.
However, solutions based on Ternary Content-
Addressable Memory (TCAM) are independent
of such assumptions. This is because TCAM
presents an exhaustive search of all the rules
and, at the same time, provides the fastest search
speed. Even though, TCAM solutions have their
own disadvantages of high power consumption
and rule blowup due to port ranges. Recent
research in this direction seems to mitigate the
IGI GLOBAL
importance of some of these issues (Medhi &
Ramasamy, 2007).
If all these schemes work well and part of
them are implemented on different machines
according to the particular classification re-
quirements, none of them addresses the content
aware issue. They cannot take into consideration
the content of the packet in the classification
process. In this paper, the content awareness
is made part of the classification process and
represents the novelty of our proposal. There
also are other approaches considering content
classification. Zhou and Song (2011) proposed
a method through which a file can be identified
in the associated flows. Based on this idea ISPs
can adopt more efficient strategies to manage
P2P traffic, but no more is said about the content
packet itself.
Commex Technologies with Vulcan
Architecture (Engle, 2009) address a content
aware routing concept. In order to classify the
incoming traffic, a classification engine is used
to locate specific patterns in the packet (e.g., the
text string john smith). Based on the fields and
the extracted patterns, the Commex Vulcan NIC
sends the relevant data to a specific host core
or other network port. Vulcan architecture dif-
ferentiates between the high and lower priority
data by labeling the payload with specific text
strings as buy, sell, or stop order, in the
case of a automatic trading application (Engle,
2009). Thus, Commex technologies approach
content-based routing, but in a restricted manner
because of the few text strings that can label the
payload. Another weak aspect of this approach
is the necessity to inspect the packet payload.
This research is part of the new European
FP7 ICT research project, Media Ecosystem
Deployment Through Ubiquitous Content-
Aware Network Environments, ALICANTE
(2010) adopted the Network-Awareness at Ap-
plications layers (NAA)/ Content-Awareness at
Network layer (CAN) approach, while acting to
define, design and implement a multi-domain
Media Ecosystem.
2. EXISTING PROBLEMS
AND IMPACT ANALYSIS
As we mentioned before, by adding the CATI
header to any Layer 3 Protocol header (IPv4 of
IPv6), we are increasing the packet size with up
to 4 bytes. The number of bits requested by the
CATI header, as presented in this paper, is 26
bits; still, in order to maintain other restrictions,
we will need to add extra bits (padding), so that
the length of the content header is a multiple of
4 bytes. This restriction is originated by the fact
that the IP header length is a multiple of 32-bit
octets. Even if this extra information does not
look like a major change in the packet structure,
there are some problems that could appear as a
result of adding the content aware fields:
The maximum length of an Ethernet frame
is equal to 1518 bytes. By adding 4 more
bytes to the Ethernet frame, there will be
situations in which the maximum Ethernet
frame length is exceeded. The frames that
are exceeding the pre-established value by
a small margin are named little giants and
they will be classified as transmission errors
by the Ethernet switches. This issue was
also addressed in the implementation of the
802.1Q networking standard, where a 32-
bit encapsulation (Q-Tag includes 802.1Q
information and 802.1p priority informa-
tion) was needed. In order to overcome
this limitation, the IEEE 802.3ac standard
extended the Ethernet frame maximum size
to 1522 bytes. In our experiments, based
on the fact that the CATI header is used
during the routing process (Layer 3), we
placed the content aware group of bits in
the Options field of the Layer 3 protocol
in use (IPv4 for example). The impact of
such a decision is that some network equip-
ments and technologies will address this
type of packets (packets with data inserted
in the Layer 3 Options field) in a differ-
ent manner. An example is based on the
IGI GLOBAL
Cisco Multi Layer Switches running Cisco
Express Forwarding the fastest packet
forwarding method in use for this type of
equipments where such packets will not
be candidates for hardware processing, and
will be processed in software. Because of
the processing time differences existing
between the two methods, this could be
considered an important drawback and
needs to be addressed in collaboration
with the networking equipment producers.
A possible solution for this issue could be
the insertion of CATI in the RTP extension,
thus could be solved the problem of network
equipments that dont recognize pakets with
the IP option field set.
3. OVERVIEW OF THE
ALICANTE ARCHITECTURE
The ALICANTE project proposes a novel
concept and architecture, Future Internet (FI)
oriented, towards the deployment of a net-
worked Media Ecosystem, based on a flexible
cooperation between Service Providers (SP),
Network Providers (NP), and End-Users (EU).
The solution enables EUs to access the media
services that are offered in various contexts,
and also to share and deliver their own media
content and services dynamically, seamlessly,
and transparently relative to other users. Figure
1 shows a high level view of the ALICANTE
concept. The environments and layers are shown
in a vertical decomposition. Innovative entities
like Media-Aware Network Elements (MANE),
Home-Box (HB), along with environments
interactions are also depicted (Mevel, 2010).
The architecture and main concepts are
defined in Mevel (2010), Grafl and Timmerer
(2010), and Borcoci, Negru, and Timmerer
(2010), and we will present here a brief sum-
mary only. To enable the key concepts described,
two novel virtual layers are suggested in ALI-
CANTE, on top of the traditional network
layer, making virtual the network nodes: one
layer essentially for packet processing (CAN
layer) and the other mainly for content delivery
to End-Users (Home-Box layer). Some further
details of this process are as follow:
Virtual Content-Aware Network (VCAN)
is an overlay network offering an enhanced
support for packet payload inspection,
processing and caching in network nodes.
Figure 1. ALICANTE concept and system architecture (Grafl & Timmerer, 2010)
IGI GLOBAL
The specific components in charge of
creating this VCAN are the MANE, i.e.,
the new CAN routers. A CAN manager
entity is controlling the VCAN logical
infrastructure, via Intra-domain Network
Resource Managers (to preserve each do-
main independency). VCAN can improve
data delivery by classifying and controlling
messages in terms of content, application
and individual subscribers. It improves QoS
assurance, via classifying the packets and
associating them to the appropriate CANs.
It may apply content/name-based routing
and forwarding. An important feature is
the flows adaptation, which can be done
per flow in network nodes, to respond at
changes in the end-to-end chains (termi-
nal, access network or core network). The
network security level can be increased
via content-based monitoring and filter-
ing. Therefore, CAN NAA can provide
higher levels of performance, End-User
experience, and can enable application
and subscriber-specific data forwarding.
For scalability reasons, in a first phase of
deployment, the MANE nodes can be in-
stalled at the edge of IP core domains only.
Virtual Home-Box layer is a middleware
layer, which uses CAN services and takes
into account network-aware information
delivered upward by the CAN layer (e.g.,
can get network distances in order to
improve P2P peering for HBs working in
such a mode (ALTO problem) (Xie, Krish-
namurthy, Silberschatz, & Yang, 2010).
The HBs cooperates with User and Service
environments. They assure functions such
as adaptation, service mobility, security,
and the entire management of services and
content. A new specific middleware is thus
proposed, working in conjunction with the
other layers. The content-aware network
router (i.e., MANE) is an innovative intel-
ligent network node. It detects the content
type in order and performs appropriate
processing (filtering, routing, adaptation,
security operations, etc.) according to the
content properties (described by metadata
or extracted by protocol field analysis) and
also depending on network properties and
current status. The results of the content
related information analysis consist of
some metrics, which allow choosing the
best strategy to adopt for the content flow
processing (Mevel, 2010).
4. NEW FIELDS FOR CONTENT
AWARE CLASSIFIERS
Because the MANE router from ALICANTE
must classify data packets considering content
based information, its classifier should also be
capable to process the content information of
the packet header in addition to the classical
headers.
A. Content Awareness
Transport Information (CATI)
The main purpose of Content Awareness
Transport Information is to enable the content
awareness features at CAN layer. More specifi-
cally, for the delivery of each individual media
service, its related CATI will be first constructed
at the SP/CP Server side according to the me-
dia service definition, the SLA with End User
and the SLA with CAN; at the Virtual Home
Box Layer, the CATI will be inserted into the
transmission data packets. At the CAN Layer,
the transmission data packets will be inspected
and the CATI will be used to enable the Content
Awareness features. The CATI header consists
of a fixed length and well structured bit-string
as depicted in Table 1.
Where each field of the header will be
interpreted as follows:
U is defined for multicast or unicast services
and is 0 for unicast and 1 for multicast, re-
spectively. It is necessary for the multicast
framework in order to identify if a flow
is multicast or not. The overlay multicast
approach does not maintain the multicast
address in the core part of the network.
IGI GLOBAL
M indicates the presence of CATI extension
(0: no extension, 1: extension present). If
1, MANE also reads the data available in
the header extension field.
A represents the adaptability of the service
(0: no adaptation, 1: adaptation). If 1, sig-
nals at the MANE may be applied.
STYPE is the acronym for the service type
field. The combination of the 4 bits will
represent types of services, as push content,
user interaction, web based services, LIVE
TV, VOD, content download, etc.
SST represents the service subtype, such as
big size, small size, http streaming, instant
messaging, HD, 3D, SD or other necessary
characteristics of a finer granularity.
SPR defines the service priority, where the
four variants are as follows: 00 Default,
01 - Silver, 10 Gold and 11 Platinum.
MANE consults the appropriate Table and
marks the corresponding flow with the
indicated DSCP.
VCANID is a value that corresponds to
different VCANs.
EXT represents an extension header, dedi-
cated to the future usage.
B. Using CATI Fields for
Creating a Multifields
Content Aware Classifier
In order to use the content feature in the packet
classification process, our solution proposes
the utilization of some CATI fields by a new
multidimensional packet classifier. Thus, one
field of the suggested classifier is represented by
the 6 bits associated with the VCAN ID. These
6 bits may generate 64 possible values, and are
inserted at the CATI initialization respecting the
following rule (services priority is assumed to be
established in advance, as Platinum the most
important and Default the least important):
Platinum services VCAN ID: 48-63;
Gold services VCAN ID: 32-47;
Silver services VCAN ID: 16-31;
Default services VCAN ID: 0-15.
Hereby, all the packets containing a bit
sequence beginning with 11 in the VCAN
ID field belong to Platinum service type, and
the ones beginning with 00 belong to Default
service type. The number of bits to be initially
inspected depends on the number of existing
VCANs in the system (if there are only a few
VCANs, it is enough to inspect only first 2 bits,
otherwise, the first three need to be examined
in order to get finer granularity).
Other field used in the content aware clas-
sification is the U field. Inspecting only one
bit, the classifier relieves the overall router pro-
cessing and determines if the flow is multicast
or unicast. It is very efficient for the router to
verify only a single bit and to conclude thus
the type of the flow.
STYPE field of the CATI header is the
third one used in the suggested multi-field
classifier. Using the 4 bits combination of the
STYPE field, 32 different service types can
be created. Thus, a finer granularity in terms
of content results for the packet classification.
The association between each 4 bits sequence
and the corresponding service types is assumed
to be known in the entire network.
Table 1. CATI header structure
0 1 2
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
U M A STYPE SST SPR VCANID EXT
IGI GLOBAL
C. Lucent Bit Vector Algorithm
Adaptation for the New
Multifields Classifier
As Lucent Bit vector algorithm uses in the
first phase hierarchical tries to obtain the best
matching rule in order to classify a new incom-
ing packet, the new solution processes the three
presented new header fields (VCAN ID, U, and
STYPE). To demonstrate the functionality of
the content aware classification algorithm, we
consider the classifier from Table 2. Unlike the
others, this classifier contains the new defined
fields needed to achieve a content aware clas-
sification.
By keeping the working principle of the
Lucent bit vector algorithm (Medhi & Rama-
samy, 2007) and after identifying the prefixes
of the U, STYPE and VCAN ID fields, the field
sets associated with the classifier and three
binary tires are built, as illustrated in Figure 2.
We consider an incoming packet with U=0,
STYPE= 011 and VCAN ID=1011. After per-
forming first U search, it is found 10100010
binary vector corresponding to prefix 0. In the
next step, STYPE field is checked and it is
Table 2. Classifier with content aware information
Rule U Field STYPE Field VCAN_ID Field
R1 0 00* 01*
R2 1 0* 10*
R3 0 0* 101*
R4 1 10* 10*
R5 1 11* 00*
R6 1 11* 00*
R7 0 0* 01*
R8 1 * 11*
Figure 2. Tries for U, STYPE, and VCAN ID fields
IGI GLOBAL
found 01100011 binary vector, which corre-
sponds to the 0* prefix. Finally, we search in
VCAN ID field obtaining 00100000 binary
vector that corresponds to 101* prefix. Then,
by performing a bitwise AND operation, the
final result is obtained in the form of the bit
vector 00100000. Since the lowest bit position
in the outcome bit vector is three, the best
matching rule is R3.Tuple space adaptation for
the new multifields classifier in order to obtain
content awareness
In order to demonstrate the possibility of
using the three new fields as an improvement
in different classifying algorithms, we illustrate
how the content aware packets can be classi-
fied using the tuple space approach (Medhi &
Ramasamy, 2007). We are using the classifier
example from the Table 2. Also, the unique
length of the U field is 1, indicating if the packet
is unicast or multicast. The possible lengths
of the STYPE field are 0, 1 and 2, while the
distinct lengths of the VCAN ID field are 2 and
3. Thus, the tuple space consists of the set T=
{ (0,2); (0,3); (1,2); (1,3); (2,2); (2,3) }. A rule
maps to tuple T, if, for all i, the length of the
prefix specified in field F
i
consists of exactly
T[i] bits (Table 3),
Example: In order to underline the content based
classification process, we consider the same
incoming packet as in C section, having
U=0, STYPE=011 and VCAN ID=1011
(Table 4).
Results
The search begins with tuple (1, 0, 2) by us-
ing the key 010, generated from the first bit
of the U field and from the first 2 bits of the
VCAN ID field. Note that tuple (1, 0, 2) does
not involve any bit of the STYPE field. The
key 010 is used to probe the hash table for
tuple (1, 0, 2) and does not match any rule.
For the next tuple (1, 1, 2), the key 0010 is
generated. Because this key does not match any
Table 3. Mapping of rules from Table 2 to tuples
Rule U Field STYPE Field VCAN_ID Field Tuple
R1 0 00* 01* (1, 2, 2)
R2 1 0* 10* (1, 1, 2)
R3 0 0* 101* (1, 1, 3)
R4 1 10* 10* (1, 2, 2)
R5 1 11* 00* (1, 2, 2)
R6 1 11* 00* (1, 2, 2)
R7 0 0* 01* (1, 1, 2)
R8 1 * 11* (1, 0, 2)
e.g., rule R
7
maps to tuple (1, 1, 2), since its prefix length for U is 1 bit, for STYPE is 1 bit, and for VCAN ID is 2 bits.
Table 4. Contents of tuple hash tables
Tuple Hash table entries
(1, 0, 2) R8
(1, 1, 2) R2, R7
(1, 1, 3) R3
(1, 2, 2) R1, R4, R5, R6
IGI GLOBAL
rule, the search moves to tuple (1, 1, 3). At this
stage, key 00101, which matches rule R
3,
is
generated. For the moment, R
3
is kept track of
as Rbest and the search moves to the next tuple
(1, 2, 2). The generated key 00110 does not
match any rule. There are no more tuples to be
examined and, therefore, R
3
is declared to be
the best matching rule.
5. DEEP PACKET INSPECTION
In the previous paragraphs we described the
content aware packets classification situation in
the case of CATI insertion at the content servers
side. There are no doubts that the classification
through the CATI simplifies the MANE analy-
sis. Further we will address the content aware
packet classification situation when there is not
a content server which inserts a CATI header
and the classification will be done based on a
deep packet inspection (DPI) analysis. Before
continuing we assume that every incoming
data packet that arrives at the ingress MANE
is analyzed and classified according to the
content carried in the packet payload. In order
not to apply this intensive work on every single
packet for new incoming flows, the first few
packets of the flow identifies the content-type
and this information are kept in memory along
with 5-tuple of protocol header information.
This information, along with a unique hash
key, calculated from each 5-tuple defines the
flow (Mevel, 2011). For every other incoming
packet that belongs to an existing flow only
the hash key needs to be calculated. This al-
lows the handling of incoming traffic per flow
and not per packet, increasing efficiency and
minimizing possible delays.
The DPI implies a special inspection of the
network flows with the result of information
extraction from higher layers (up to application
layer). By exploiting these rich information
provided through DPI, MANE provide many
options in packet flow processing in respect to
QoS policies.
Figure 3 shows how the classifier of an
ingress MANE performs the classification based
on DPI or, if existent on CATI. Depending on the
classification results the packets are forwarded
inside the suitable VCAN.
The development of DPI methods for traf-
fic classifications built around two basic as-
sumptions (Mevel, 2011):
Third parties unaffiliated with either source
or recipient are able to inspect each IP
packets payload;
Figure 3. MANE classifier based on DPI/CATI
IGI GLOBAL
The classifier knows the syntax of each
applications packet payloads.
For this second assumption, a library of
protocols signatures and filter strings has to
be built and consulted that the protocol can
be detected. An inevitable issue is that some
protocols or traffic flows encrypt the content
or even some upper layer protocol header. In
such situation DPI cannot be applicable for
traffic classification.
6. PERFORMANCE ANALYSIS
In order to analyze the possibilities that the
newly introduced packet fields are offering, we
decided to create a classifier that, based on the
content fields, is able to map each particular
packet (placed at its entry) to one of the rules
specified by the user or, in a real environment,
by the system administrator. At the current
time, we will center out attention on working
with only one CATI field at a time Multicast
bit (U), Service Type Field (STYPE) or Virtual
CAN ID (VCANID). We intend to prove that
by using the content fields, we are able to gain
significant information regarding each packet,
while the complexity of the implementation is
kept at an acceptable level. In the following
sections, we will present the platform used for
implementing the Lucent Bit Vector and Tuple
Space algorithms, followed by the data struc-
tures used, and the generated results, represented
as the average processing time per packet for
each of the algorithms.
We implemented the two algorithms in
Java, using as development environment Net-
beans 7.0.1 (http://netbeans.org).
The packets placed at the input of the clas-
sifier were created using a packet generation
tool, PackETH http://packeth.sourceforge.net/)
and the content information was placed in the
IPv4 Options field.
The hardware platform used, relevant for
the obtained results (average processing time per
packet), is characterized by: Processor Intel
Core i5, M480; Memory 4GB. For simulating
the exact behavior of a real router, we decided
to adopt a packet processing technique that fol-
lows the main steps performed by a common
router when receiving a frame:
Decapsulating the L2 information;
Extracting the relevant L3 information
(here we are including CATI information);.
Processing the packet based on the ex-
tracted information mapping to one of
the user specified rules.
In order to extract the relevant information
we are using one object per packet, each object
representing an instance of java a class. We are
using the IPv4 Header Length field in order to
identify the existence of the CATI fields. As a
result, if the header length is exceeding 20 bytes,
we are assuming that the IPv4 Header is con-
taining the content fields in the IPv4 Options
field a simple verification is needed in order
to identify the packets that are containing the
new content aware fields. However, in our
experiment, all the packets that are used are
including the CATI fields - the overhead consists
of 32 content bits (total header length is being
increased by
192
160
20 = %).
A. Analysis Parameters for Lucent
Bit Vector Adaptation Created in
the New Multifields Classifier
The tree nodes for each of the fields will be
calculated only once, based on the rules pro-
vided by the user, before the firs packet is read.
We consider the rules illustrated in Table 2.
Based on this bitmaps, we will be able to take
a decision for each of the incoming packets. As
a result, when a packet is processed, we will
extract the field of interest (we are working
with only one field) and, after identifying the
best matching prefix, we will obtain the as-
sociated bitmap also. By analyzing the bitmap
bits, using a method named getRuleNumber,
included in the LucentBitVector class, we are
able to identify the best matching rule .Below
we present the method for obtaining the best
IGI GLOBAL
matching rule, included in Lucent Bit Vector
class (see Listing 1).
We must also specify that the results pro-
vided by the specified algorithm are including
the best matching rule and the processing time
per packet, measured from the point in which
the file is read, until the best matching rule is
determined. In the Results section, we will use
this processing time in order to compare the
two algorithms implemented on our platform.
B. Analysis Parameters for Tuple
Space Adaptation Created in
the New Multifields Classifier
Similar to Lucent Bit Vector, before any packet
is to be processed, we are constructing the hash
associated with every rule. Using the current
implementation, we are working with only one
field at a time, so the hash associated to every
rule will be created based on the field that the
user selected.
At the beginning of the program a method
included in the TupleSpace java class is used to
create a space of dimensions, associated with
the relevant field. After this process ends, each
of the rules is mapped to one of the dimensions
from the tuple space. When a packet is received,
we are extracting the relevant information and
we are comparing the extracted values to the
hash of each of the rules. The rules are processed
sequential, starting with the rules associated
with the lowest dimension, from the tuple space:
After a match is identified, we stop the
search for the current dimension;
We advance the search to a higher dimen-
sion from the tuple space, in order to search
for a better match.
The two steps specified earlier are imple-
mented in a method called getRuleNumber,
included in the TupleSpace class, method that is
based on the following algorithm (see Listing 2).
We are measuring the processing time per
packet in the same manner as the one adopted
for Lucent Bit Vector. Based on this premises,
we compared the average processing times for
this two methods. To obtain relevant data, from
which we can extract sufficient information in
order to accurately compare the two methods
that we have implemented on this platform, we
used the following procedure: for each of the
analyzed fields, we placed a number of 50
identical packets at the packet classifier entry.
In order to avoid exceptional results (caused
by various system related reasons), we will
consider only the average packet processing
time for both the Lucent Bit Vector and Tuple
Space approach.
As a first test, we analyzed packets based
on the Multicast CATI field (U). After running
the two algorithms for a set of 50 packets with
CATI U field set to 0 and another set of pack-
ets with the value 1 for the specified field, we
obtained the results shown in Figure 4.
As can be observed from Figure 4, the
Tuple algorithm offers the best results with an
efficiency of 1.74 times greater. Also, judging
by the fact that the tuple space is composed on
only one dimension (length 1), only one rule,
for U = 0 and two rules, for U = 1 will be
Listing 1. The method for obtaining the best matching rule
public static int getRuleNumber (String Sone) {
int first_one = 0;
for (int i=0; i<Sone.length(); i++) {
if (Sone.substring(i,i+1).equals(1)) {
first_one = i;
break;
}
}
return (first_one + 1);
}
IGI GLOBAL
verified. We also observe that the differences
between the average time obtained when the
packets are matching Rule 1 and when the
packets are matching Rule 2 is not signifi-
cantly higher (maximum ~400 ns).
7. DPI VS. NORMAL ROUTING
CLASSIFICATION EFFICIENCY
When the CATI header is present, our approach
on the Content Aware Classification is based
on a normal routing behaviour, at which we
add a set of rules and policies that should be
applied to specific packets if they obey a certain
rule. This way, if we are able to treat the CATI
fields as normal packet fields this will imply
overcoming the two main problems specified
in Section 3 the complexity of the routing
process will suffer insignificant changes. Also,
if referring to scalable classification algorithms,
the efficiency of the classification itself remains
at the current level even if the CATI fields are
considered the efficiency should not suffer a
big impact once the number of analized fields
is increased by 1 3 fields.
In order to quantify the advantages of using
the CATI header, we need to determine what
are the limits defined by a Normal Routing
Process (NR) and a Routing Process based on
Deep Packet Inspection (DPIR), points that
we consider to be the lowest and respectively
highest in the current processing complexity
and processing time needed. Based on a Cisco
NBAR Network Based Application Recogni-
tion Performance Analysis, we concluded that
by using Protocol Discovery features, at a rate
Listing 2. The algorithm that supports the getRuleNumber
public static int getRuleNumber (String Sone) {
String packet_hash;
int best_rule = 0;
for (i = each space dimension lowest > highest) {
packet_hash = HashCreator(Sone, size);
for (j = each rule associated with tuple dimens i) {
if (rule_hash[j].equals(packet_hash)) {
best_rule = j;
break;
}
}
}
return (best_rule + 1);
}
Figure 4. Proportional Representation on Average Processing Time (ns) for CATI U-LBV vs Tuple
IGI GLOBAL
of 60% of the maximum traffic allowed where
no packet is dropped - two unwanted factors
will appear: throughput is degrading and CPU
utilization is increased.
To be more precise, when using Protocol
Discovery features at the specified traffic rate
is around, the throughput is decreased by up to
5 7% from the one experienced with Normal
Routing Process, while the CPU utilization is
increased with more than 30% for Cisco gigabit-
ethernet platforms like (Cisco, 2008):
Cisco 3745 (CPU: NR 61%, DPIR 93%
; Throughput: NR 197 Mbps, DPIR 187
Mbps) ;
; Throughput: NR 555 Mbps, DPIR 529
Mbps) ;
; Throughput: NR 556 Mbps, DPIR
546 Mbps)
Judging by the presented results, obtained
in a normal environment (mixed traffic), and
the fact that the level of complexity and the
classification/routing efficiency should not
change significantly with the addition of the
CATI header, classification based on the pro-
posed content aware fields will be located very
close to the Normal Routing Process results and
provides important improvements to the exist-
ing classification methods which are based on
Deep Packet Inspection.
8. CONCLUSION
This paper suggested a new possibility to clas-
sify the data packets based on content informa-
tion, as part of the ALICANTE project work.
ALICANTE proposes a new architecture for
content aware networking and this approach
could be a solution for packet differentiation/
classification based on content. Even if the
standard classification algorithms are the same,
the innovation idea of inserting the CATI header
and then taking into account its fields for obtain
a content aware classification represents one of
the key aspects of this paper. More than that,
implementing this in Java in respect to the
proposed theory part (CATI header structure
with bits position and their service asociation:
Platinum, Gold...unicast, multicast, service
type, service subtype, etc.) represents a step
forward in the packet classification based on
content. Firstly, new fields, as U, STYPE, and
VCAN ID, were introduced in the context of
the content aware packet classification. Then,
the Lucent Bit vector algorithm and also the
tuple space approach were modified in order
to respond to new fields utilities and, thus, to
the content aware classification of the packets.
It has been demonstrated that, in the context
of using VCAN ID, U and STYPE new fields,
modifying Lucent Bit algorithm (case 1) and
searching in the tuple space (case 2), produces
the same result, but with a different efficiency
as shown in the last part of this paper. Accord-
ing to the various results obtained based on our
implementation, while using different packet
data and decision fields, we concluded that
the Tuple Space algorithm provides superior
efficiency in terms of packet processing time
for a classifier comparable to ours. Relating to
our implementation, all this benefits generates
a very small overhead, consisting of 32 bits,
increasing by 20% the IPv4 header. Further,
the possibility to use the proposed VCAN ID,
U and/or STYPE fields with others algorithms,
which take into consideration the content aware
information of the packets and the possibility
to implement other classifier that will check
the other proposed new fields will be analyzed.
The content aware classification solution
with CATI insertion is in the context of Future
Internet approach, Information Centric Net-
working, Content Centric Networking where
it is accepted that for a finer classification
granularity the routers have to do more com-
plicated tasks.
REFERENCES
ALICANTE. (2010). Media ecosystem deployment
through ubiquitous content-aware network envi-
ronments. Retrieved from http://www.ict-alicante.
eu/public/include/files/Alicante_flyer_FINAL.pdf
IGI GLOBAL
Borcoci, E., Negru, D., & Timmerer, C. (2010, June).
A novel architecture for multimedia distribution based
on content-aware networking. In Proceedings of the
Third International Conference on Communication
Theory, Reliability and Quality of Service, Athens,
Greece (pp. 162-168).
Cisco. (2008). Network based application recogni-
tion performance analysis. Retrieved from http://
www.cisco.com/en/US/technologies/tk543/tk759/
technologies_white_paper0900aecd8031b712_
ps6616_Products_White_Paper.html
Engle, Y. (2009). Commex-Vulcan in-server content
aware routing network interface card. Retrieved
from http://www.hypertransport.org/docs/uploads/
Commex_Content_Aware_Routing_WhitePaper.pdf
Grafl, M., & Timmerer, C. (Eds.). (2010). Service/
content adaptation definition and specification (De-
liverable D2.2). Brussels, Belgium: ICT ALICANTE.
Medhi, D., & Ramasamy, K. (2007). Network rout-
ing- Algorithms, protocols and architectures. San
Francisco, CA: Morgan Kaufmann.
Media Delivery Platforms Cluster. (2008). Multi-
media delivery in the future Internet: A converged
network perspective. Retrieved from http://www.
ist-sea.eu/Dissemination/MDP_WhitePaper.pdf
Mevel, A. (Ed.). (2010). Overall system and com-
ponents definition and specifications (Deliverable
D2.1). Brussels, Belgium: ICT ALICANTE.
Mevel, A. (Ed.). (2011). Content-aware network
infrastructure and elements Intermediate (Deliver-
able D6.1.1). Brussels, Belgium: ICT ALICANTE.
Thompson, K., Miller, G., & Wilder, R. (1997).
Wide-area Internet traffic patterns and characteristics.
IEEE Network, 11(6), 1023. doi:10.1109/65.642356
Vorniotakis, N., Xilouris, G., Gardikis, G., Zotos,
N., Kourtis, A., & Pallis, E. (2011). A preliminary
implementation of a content-aware network node.
In Proceedings of the International Workshop on
Multimedia and Expo (pp. 1-6).
Xie, H., Krishnamurthy, A., Silberschatz, A., &
Yang, Y. (2010). P4P: Explicit communications
for cooperative control between P2P and network
providers. Retrieved from http://www.dcia.info/
documents/P4P_Overview.pdf
Zhou, Z., & Song, T. (2011, December). File-aware
P2P traffic classification. In Proceedings of the
International Workshop on Global Communications
(pp. 636-640).
IGI GLOBAL
Copyright 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. Copyright 2012, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Keywords: Flooding Optimization Algorithms, Location-Aided Routing Scheme 1 (LAR-1), Mobile Ad
Hoc Networks (LAR-1P), Probabilistic Algorithm, Pure Flooding, Route Discovery
INTRODUCTION
A mobile ad hoc network (MANET) is a
self-configuring infrastructureless network of
low-battery powered mobile devices (nodes)
connected by wireless links. Each node in a
MANET is free to move independently in any
direction, and will therefore change its links
to other nodes on the network frequently (Al-
Bahadili, 2012). Each node usually acts as a
router forwarding traffic unrelated to its own
use. One of the main challenges in building a
MANET is equipping each node to continuously
maintain the information required for efficient
and reliable traffic routing (Land, 2008).
The data packets in MANETs are forwarded
to other mobile nodes in the network through
reliable and efficient dynamic routing protocols,
which are part of the network layer software
(Land, 2008). These protocols are responsible
for deciding which output route a packet should
be transmitted on. Dynamic routing protocols
Analyzing the Effect of Node
Density on the Performance
of the LAR-1P Algorithm
Hussein Al-Bahadili, Petra University, Jordan
Ali Maqousi, Petra University, Jordan
Reyadh S. Naoum, Middle East University, Jordan
ABSTRACT
The location-aided routing scheme 1 (LAR-1) and probabilistic algorithms are combined together into a
new algorithm for route discovery in mobile ad hoc networks (MANETs) called LAR-1P. Simulation results
demonstrated that the LAR-1P algorithm reduces the number of retransmissions as compared to LAR-1
without sacrifcing network reachability. Furthermore, on a sub-network (zone) scale, the algorithm provides
an excellent performance in high-density zones, while in low-density zones; it preserves the performance of
LAR-1. This paper provides a detailed analysis of the performance of the LAR-1P algorithm through various
simulations, where the actual numerical values for the number of retransmissions and reachability in high- and
low-density zones were computed to demonstrate the effectiveness and signifcance of the algorithm and how
it provides better performance than LAR-1 in high-density zones. In addition, the effect of the total number
of nodes on the average network performance is also investigated.
DOI: 10.4018/jitwe.2012040102
IGI GLOBAL
(e.g., the dynamic source routing (DSR), ad
hoc on-demand distance vector (AODV), zone
routing protocol (ZRP)) consist of two main
phases; these are: route discovery and route
maintenance.
Route discovery is used when a source node
desires to send a packet to some destination
and does not already have a valid route to that
destination; in which the source initiates a route
discovery process to locate the destination. It
broadcasts a route request (RREQ) packet to its
neighbours, which then forward the request to
their neighbours, and so on until the expiration
of the packet. During the forwarding process,
the intermediate nodes record the address of
the node from which first copy of the broadcast
packet is received in their routing tables. Once
the RREQ reaches the destination, it responds
with a route reply (RREP) packet back to the
source through the route from which it first
received the RREQ. Otherwise, if the RREQ
packet expired before reaching its destination,
then the node, at which it expires, sends a route
error (RERR) packet back to the source to initi-
ate a new route discovery process.
Pure flooding is the earliest, simplest,
and most reliable mechanism proposed in
the literature for route discovery in MANETs
(Bani-Yassein & Ould-Khaoua, 2007; Bani-
Yassein et al., 2006). Although it is simple and
reliable, pure flooding is costly where it costs
N transmissions in a network of N reachable
nodes. In addition, pure flooding results in
serious redundancy, contention, and collisions
in the network; such a scenario has often been
referred to as the broadcast storm problem (BSP)
(Tseng et al., 2002).
A variety of flooding optimization al-
gorithms have been developed to alleviate
the effects of BSP during route discovery in
MANETs aiming at reducing the number of
redundant retransmissions without significantly
affecting network reachability. Examples of
such algorithms include the LAR-1 (Ko &
Vaidya, 2000; Belong & Camp, 2004) and
probabilistic (Al-Bahadili, 2010a; Haas et al.,
2006) algorithms.
The performance of the LAR-1 algorithm
significantly suffers from the large number
of redundant retransmissions in high-density
networks (zones) (Ko & Vaiyda, 2000). In
contrast, the probabilistic algorithm usually
provides excellent performance in high-density
zones, by appreciably reducing the number
of retransmissions with almost no effect on
network reachability. Of course, that is sub-
ject to proper retransmission probability (p
t
)
adjustment (Bani-Yassein & Ould-Khaoua,
2007; Bani-Yassein et al., 2006; Al-Bahadili,
& Kaabneh, 2010).
A new algorithm combining the LAR-1
and probabilistic algorithms called LAR-1P
was proposed in Al-Bahadili (2012). In this
algorithm, when receiving a RREQ, an interme-
diate node within the request zone rebroadcasts
RREQs with dynamically adjusted pt. In a
low-density zone; a node is assigned a high pt
(close to unity), so that the algorithm behaves
like LAR-1, while a reasonable p
t
is assigned to
the node in a high-density zone, which means
the algorithm behaves like the probabilistic
algorithm. In that way LAR-1P combines the
best of the two algorithms.
The simulation results in Al-Bahadili
(2012) demonstrated that for a uniform random
node distribution in specific network area and
for a certain simulation setup, LAR-1P reduces
the number of retransmissions with slight reduc-
tion in reachability. Furthermore, LAR-1P can
provide better average performance if the route
discovery process is performed more often in
high-density zones than in low-density zones.
This paper analyzes in details the per-
formance of the LAR-1P algorithm through
a number of simulations using the network
simulator (MANSim) (Al-Bahadili, 2010b).
First, the effect of node density (total number
of nodes on the network divided by the network
area) on the performance of LAR-1P is inves-
tigated and compared against other algorithm,
such as pure flooding, dynamic probabilistic,
and LAR-1. Second, for a certain node density
(4x10
-4
node/m
2
), the effects of zones nodes
densities on the performance of LAR-1 and
LAR-1P algorithms are investigated.
IGI GLOBAL
The rest of this paper is organized as fol-
lows. First, we review the most recent and
related work on route discovery in MANETs.
The LAR-1, probabilistic, LAR-1P algorithms,
and computation procedures are described in
the section afterwards. Simulation results are
presented and discussed in the next section.
Finally, based on the obtained results a number
of conclusions are drawn and a number of rec-
ommendations for future work are pointed-out.
LITERATURE REVIEW
In this section, we identify some of the work
that is related to the LAR-1P, LAR-1, and
probabilistic algorithms to provide the reader
with background on the most recent develop-
ment steps in those algorithms. LAR-1P was
first proposed by Al-Bahadili et al. (2007). It
was initially proposed with fixed p
t
. Later on,
Al-Bahadili (2012) developed a new version of
LAR-1P with dynamic probability, which will
be described in the next section.
Location information was first used by
Ko and Vaidya (2000) to develop two different
LAR schemes (LAR-1 and LAR-2) to reduce
flooding overhead in ad hoc networks. Li et al.
(2000) presented a modified version of the LAR
protocol, which was called LAKER (location
aided knowledge extraction routing). Bolena
and Camp (2004) and Vyas and Mahgoub (2003)
combined location information and mobility
feedback and pattern to create a new routing
protocol to reduce routing overheads.
Probabilistic algorithm was first used by
Haas et al. (2006) for route discovery in ad
hoc networks, and they called it a gossip-based
(GOSSIP1) protocol. They used a predefined
p
t
to decide whether or not a node forwards the
RREQ packets. Later on, Haas et al developed
a modified protocol, in which they gossip with
p
t
=1 for the first h hops before continuing to
gossip with p
t
<1. Their results showed that they
can save up to 35% message overhead compared
to simple flooding.
Tseng et al. (2002) investigated the per-
formance of probabilistic flooding for various
network densities. They presented results for
three network parameters, namely, reachability,
saved rebroadcast, and average latency, as a
function of p
t
and network density. Sasson et
al. (2003) developed probabilistic algorithm in
which nodes would dynamically adjust their p
t

based on local topology information. Kim et
al. (2004) introduced a probabilistic scheme
in which a node dynamically adjusts its p
t
ac-
cording to its additional coverage area. The
additional coverage is estimated by the distance
from the sender.
Scott and Yasinsac (2004) presented a
dynamic probabilistic solution that is appropri-
ate to solving BSPs in dense mobile networks.
Barret et al. (2005) introduced a probabilistic
routing protocol for sensor networks, in which
a sensor decides to forward a message with p
t

that depends on various parameters, such as
the distance of the sensor to the destination,
the distance of the source sensor to the des-
tination, or the number of hops a packet has
already traveled.
Viswanath and Obraczka (2005) developed
an analytical model to study the performance of
plain and probabilistic flooding in terms of its
reliability and reachability in delivering packets
showing that probabilistic flooding can provide
similar reliability and reachability guarantees
as plain flooding at a lower overhead. A proba-
bilistic scheme that dynamically adjusts p
t
as
per node distribution and node movement was
proposed in Zhang and Agrawal (2005). The
scheme combines probabilistic and counter-
based approaches.
Kyasanur et al. (2006) proposed a smart
gossip protocol, which is a probabilistic pro-
tocol that offers a broadcast service with low
overheads. It automatically and dynamically
adapts transmission probabilities based on the
underlying network topology. The protocol
is capable of coping with wireless losses and
unpredictable node failures that affect network
connectivity over time. The resulting protocol
is completely decentralized. They presented
some evaluation results and demonstrated the
benefits of their protocol over existing protocols.
IGI GLOBAL
Abdulai et al. (2006) studied the perfor-
mance of the AODV protocol over a range of
p
t
. They focused on the route discovery part of
the routing algorithm; they modified the AODV
implementation to incorporate p
t
. Simulations
showed that setting efficient p
t
has a significant
effect on the performance of the protocol. The
results also revealed that the optimal p
t
for ef-
ficient performance is affected by the prevailing
network conditions such as traffic load, node
density and mobility. Abdulai et al. (2007)
also proposed two probabilistic methods for
on-demand route discovery that are simple to
implement and can significantly reduce the over-
head involved in the dissemination of RREQs.
The two methods are: the adjusted probabilistic
(AP) and enhanced AP (EAP).
Bani-Yassein et al. (2007) proposed a
dynamic probabilistic algorithm to improve
network reachability and saved rebroadcast. It
determines p
t
by considering node density and
speed. Bani Yassein et al. (2006) combined
probabilistic and knowledge based approaches
on the AODV protocol to enhance the perfor-
mance of existing protocol by reducing the
communication overhead incurred during the
route discovery process. Hanash et al. (2009)
proposed a dynamic probabilistic approach that
can reduce broadcast redundancy in MANETs.
Al-Bahadili (2010a) developed a new p
t
adjust-
ing model, in which the neighborhood densities
are divided into three regions (low, medium,
and high). The performance of the new model
was compared against pure and probabilistic
algorithms.
Bani-Yassein et al. (2010) proposed new
probabilistic method to improve the perfor-
mance of existing on-demand routing protocol
by reduced the RREQ overhead during rout
discovery operation. The simulation results
showed that the combination of AODV and
a suitable probabilistic route discovery can
reduce the average end-to-end delay as well
as overhead and still achieving low normalized
routing load, comparing with AODV which used
fixed probability and blind flooding.
DESCRIPTION OF THE
LAR-1P ALGORITHM
This section presents the detail description
of the LAR-1P algorithm, but before that we
provide a brief description of the LAR-1 and
probabilistic algorithms.
The LAR-1 Algorithm
The LAR-1 algorithm is based on two important
concepts, namely, the expected zone (EZ) and
the request zone (RZ). For EZ, consider a source
node S that needs to find a route to a destina-
tion node D. Assume that S knows that D was
at location L at time t
0
, and the current time is
t
1
. Then, the EZ of D, from the viewpoint of
S at time t
1
, is the region S expects to contain
D at time t
1
. S can determine EZ based on the
knowledge that D was at location L at time t
0
,
and travels with speed u, then S may assume
that EZ is the circular region of radius R
e
=u(t
1

- t
0
), as illustrated in Figure 1a.
If the actual speed happens to be larger
than the average, then the destination may
actually be outside EZ at time t
1
and vice versa.
Thus, EZ is only an estimate made by S to
determine a region that potentially contains D
at time t
1
. If S does not know a previous location
of D, then S cannot reasonably determine EZ,
in this case, the entire region that may poten-
tially be occupied by the ad hoc network is
assumed to be EZ, and the algorithm reduces
to the basic flooding algorithm. In general,
having more information regarding mobility of
D can result in more appropriate EZ.
For RZ, again, consider S needs to deter-
mine a route to D. LAR-1 uses pure flooding
with one modification. S defines (implicitly
or explicitly) a RZ for the RREQ that contains
both S and D and the locations of S and D as
its opposite corners. A node forwards a RREQ
only if it belongs to RZ (unlike pure flooding).
To increase the probability that the RREQ will
reach D, RZ should include EZ (Figure 1a). Ad-
ditionally, RZ may also include other regions
around RZ (Ko & Vaiyda, 2000). Note that the
probabilities of finding a path (in the first at-
IGI GLOBAL
tempt) can be increased by increasing the size
of RZ. However, route discovery overhead also
increases with the increasing size of RZ. Thus,
there exists a trade-off between latency of route
determination and the message overhead.
Now, let us move to discuss LAR-1 scheme.
It uses an RZ that is rectangular in shape
(Figure 1). It is assumed that S knows that D
was at location (X
d
, Y
d
) at time t
0
. At time t
1
, S
initiates a new route discovery for D. It is also
assumed that S knows the speed u with which
D can move. Using this, S defines EZ at time
t
1
to be the circle of radius R
e
=u(t
1
-t
0
) centred
at location (X
d
, Y
d
). As stated before, instead of
the average speed, u may be chosen to be the
maximum speed or some other function of the
speed distribution.
RZ is defined to be the smallest rectangle
that includes current location of S and EZ, such
that the sides of the rectangle are parallel to the
X and Y axes. In Figure 1a, RZ is the rectangle
whose corners are S, A, B and C, whereas in
Figure 1b, the rectangle has corners at point
A, B, C and G. In Figure 1, current location of
node S is denoted as (X
s
, Y
s
), and consequently
S can determine the four corners of RZ.
S includes their coordinates with the
RREQ packet transmitted when initiating route
discovery. When a node receives a RREQ, it
discards the request if the node is not within the
rectangle specified by the four corners included
Figure 1. The LAR-1 scheme
IGI GLOBAL
in the RREQ. For instance, in Figure 1a, if node
i receives the RREQ from another node, node i
forwards the request to its neighbours, because
i determines that it is within the rectangular RZ.
However, when node j receives the RREQ, node
j discards the request, as node j is not within RZ.
When D receives the RREQ packet, it
replies by sending a RREP packet (as in the
pure flooding). However, in case of LAR-1, D
includes its current location and current time in
the RREP packet. When S receives this RREP
packet (ending its route discovery), it records
the location of D. S can use this information
to determine RZ for a future route discovery.
It is also possible for D to include its current
speed, or average speed over a recent time
interval, with the RREP packet. This informa-
tion could be used in a future route discovery.
Further details on LAR schemes can be found
in Ko and Vaiyda (2000).
The Probabilistic Broadcast Algorithm
In this algorithm, when receiving a RREQ, a
node retransmits the RREQ with a certain p
t
and
with probability (1p
t
) it discards the RREQ.
A node is allowed to retransmit a given RREQ
only once, i.e., if a node receives a RREQ, it
checks to see if it has retransmitted it before,
if so then it just discards it, otherwise it per-
forms its probabilistic retransmission check.
Nodes usually identify the RREQ through its
sequence number. The source node p
t
is always
set to 1, to enable the source initializing a new
RREQ. While, p
t
s for intermediate nodes (all
nodes except the source and the destination)
are determined using a static or dynamic ap-
proach. In static approach, a pre-determined p
t

(0 p
t
1) is set for each node on the networks.
While, in dynamic approach, each node on
the network locally calculates p
t
using certain
probability distribution function of one or more
independent variables.
The distribution function discussed in
Al-Bahadili (2010a) is used in this paper. This
is because it demonstrates an excellent perfor-
mance in comparison with other distribution
functions in various network conditions, which
is expressed as:
p
p for k N
p
k N
N N
p p for N k N
p for k N
t
=
< <
max
min
( )
1
1
1
2 1
1 2 1 2
2

(1)
Where p
t
(k) is the dynamic retransmis-
sion probability; k is the number of first-hop
neighbor for the transmitting node; p
min
and
p
max
are the maximum and maximum pts that
could be assigned for a node; N
1
is the number
of nodes at or below which p
t
is equal to p
max
;
N
2
is the number of nodes at or above which p
t

is equal to p
min
, p
1
and p
2
are the p
t
s assigned to
intermediate nodes surrounded by k=N
1
+1 and
k=N
2
-1 nodes, respectively, p1 and p2 should lie
between p
max
and p
min
(i.e., p
max
p
1
and p
2
p
min
),
and also p
1
p
2
. Figure 2 shows the variation of
p
t
with k. In general, selection of satisfactory
distribution in the interval [N
1
+1, N
2
-1] and the
values of p
max
, p
min
, p
1
, p
2
, N
1
, and N
2
depend
on a number of factors and need to be care-
fully considered for every network condition.
Details and investigation of the above function
is presented in Al-Bahadili (2010a).
The LAR-1P Algorithm
The description of LAR-1P algorithm is straight-
forward. It is simply, when receiving a broadcast
message, a node within the RZ rebroadcasts the
message with dynamically calculated p
t
, and
each node is allowed to rebroadcast the received
message only once. This contributes to a certain
reduction in the number of retransmissions and
consequently node average duplicate reception,
and thereafter the number of collisions and
contentions at the receiver. The reduction of
the number of retransmission will of course
depend on the assigned p
t
, and the algorithm is
reduced to LAR-1 if the nodes p
t
is set as unity.
It is important to recognize that reduction
of the number of retransmission may introduce
some reduction in network reachability, because
if the number of intermediate nodes within RZ
IGI GLOBAL
is small (low-density zones) and some of them
will not rebroadcast the RREQ, which may
result in a failure of delivery of the RREQ to
destination, so that the source has to reinitiate a
new RREQ. However, it is expected that if the
number of intermediate nodes within RZ is large
(high-density zones), there will be insignificant
reduction in reachability. Figure 3 outlines the
calculation procedure of the LAR-1P algorithm.
In dynamic probability LAR-1P, there are
two values that can be considered for calculat-
ing p
t
. The first represents the total number of
first-hop neighbors regardless of whether they
are inside or outside RZ (same as in dynamic
probabilistic algorithm); while the second
considers nodes inside RZ only, which is usu-
ally equal to or less than the first value. Nor-
mally, every node on the network updates the
lists (records) of its first-hop neighbors at the
Figure 2. Retransmission probability (p
t
) as a function of the first-hop neighbors (k)
Figure 3. The LAR-1P algorithm
IGI GLOBAL
start of each time interval, which usually lasts
for what is referred to as the pause time ().
It can be seen from the above description
of LAR-1P that at high-density zones, the nodes
will be given low p
t
s inflicting a reduction in
the number of retransmissions that depends on
the assigned p
t
s. Which means the probabilistic
algorithm is effective. While at low-density
zones, the nodes will be allowed to retransmit
the RREQ with high p
t
, if it is unity or close to
unity, then LAR-1P acts as LAR-1.
The Computational Procedures
Two computational procedures have been
implemented in MANSim (Al-Bahadili, 2010b).
The first computational procedure computes
the average parameters for a pre-specified S-D
pair. In this procedure, the source initiates Q
RREQs each time interval. The computation is
repeated for a number of time intervals (M), for
each time interval, the nodes on the network are
allowed to change their positions on the network.
Therefore, the computed parameters should be
averaged over Q and M. The computed average
values represent the average parameters as-
sociated with this particular S-D pair, but they
may not reflect the average behavior of other
S-D pairs. However, it has been found that for
a network that has no probabilistic behavior,
i.e., p
t
=1, Q has no effect on the computed
parameters and Q can be set to 1.
The second computational procedure com-
putes the average network parameters that can
be considered as the average parameters for any
S-D pair. This is calculated by looping over all
nodes on the network as source nodes, and each
source tries to establish a route to all other nodes
on the network as destination nodes. Further-
more, for each S-D pair, the source initiates Q
RREQs as described in the first computation
procedure. The computed parameters for each
source are averaged over Q(N-1), and then the
average values are averaged over (N). In other
words, the computed parameters are averaged
over (QN(N-1)).
In order to consider nodes mobility, the
above calculations are repeated, in an outer
loop of size M, and the computed parameters
are averaged over M. In this case, the computed
parameters may well represent the average
behavior of any of the nodes on the network.
For non-mobile (fixed) nodes, M has no effect
on the computed parameters and can be set to
1. Figure 4 outlines the second computation
procedure.
RESULTS AND DISCUSSIONS
The LAR-1P is implemented on MANSim,
which is a network simulator especially de-
veloped for evaluating the performance of
various route discovery algorithms in MANETs
(Al-Bahadili, 2010b). The performance of the
algorithms is evaluated in terms of two param-
eters: the number of retransmissions (RET)
and reachability (RCH). RET is defined as the
Figure 4. Second computational procedure in MANSim
IGI GLOBAL
average number of retransmissions normalized
to the total number of nodes on the network
(N). RCH is defined as the average number of
reachable nodes by any node on the network
normalized to N; or the probability by which
a RREQ packet will be successfully delivered
from source to destination (Al-Bahadili, 2012).
In order to investigate the effect of node
density (N
d
=N/A) on the performance of LAR-
1P, a number of simulations were performed
using MANSim. The network environment and
simulation setup can be described as follows:
Each simulation starts by generating N nodes
randomly distributed across a network area
(A) of 500x500 m using a uniform probability
distribution function described in Al-Bahadili
(2010b). All nodes are assumed to be transmit-
ting with the same radio transmission range (R)
of 100 m and move with an average velocity
(u) of 5 m/sec.
One important parameter that needs to
be carefully specified to obtain an adequate
network performance is , which is defined as
a period of time in which all nodes on network
are assumed motionless but continue transmit-
ting (Al-Bahadili, 2012). In MANSim, it is
calculated as: =0.75*R/u. The simulation time
(T
sim
) is divided into a number of intervals (we
call it mobility loops (M)) of period . In all
simulations, T
sim
is 1800 sec. The input data for
this scenario is summarized in Table 1.
Furthermore, in order to evaluate and
compare the performance of LAR-1P with
other route discovery algorithms, we use MAN-
Sim to estimate the performance (RET and
RCH) of pure flooding, dynamic probabilistic,
LAR-1 (LAR-1P with p
t
=1), and LAR-1P using
the same network environment and simulation
setup. The results obtained are listed in Table
2 and the results for RET and RCH are plotted
in Figure 5 and Figure 6.
Before, we proceed with the discussion of
the results, let us remind ourselves with three
important points, the first one is that the main
requirement of any route discovery algorithm
is to reduce RET without sacrificing RCH. The
second point is that the results in Table 2 pres-
ent the average network behavior over 1800
sec simulation time. Finally, p
t
in LAR-1P is
calculated considering the first-hop neighbors
positioned inside RZ only.
For each node density, the percentage
reduction in RCH is calculated as:
R
RCH RCH
RCH
RCH
LAR LAR P
LAR
=

100
1 1
1

(2)
Similarly, the percentage reduction in RET
is calculated as:
R
RET RET
RET
RET
LAR LAR P
LAR
=

100
1 1
1

(3)
Table 1. Input data
Parameters Value
Geometrical model Random node distribution
Network area (A) 500x500 m
Number of nodes (N) 50, 75, 100, 125, 150 nodes
Transmission radius (R) 100 m
Average node speed (u) 5 m/sec
Simulation time (T
sim
) 1800
Pause time () 15 sec
Mobility loop size (M) 120
IGI GLOBAL
It can be seen from Table 2 that the perfor-
mance of LAR-1P varies in the same pattern
as other algorithms. Furthermore, it can be
easily deduced that in comparison with LAR-
1, R
RCH
fluctuated around a very small percent
with increasing node density, while R
RET
is
increasing, which means more reduction in
number of collisions and contentions. Table 2
shows that for dynamic probabilistic. The RET
decreases as node density increases. This is
because when the number of nodes increases,
the node density and consequently the nodes
first-hop neighbors increases. So that the p
t
of
the transmitting nodes decreases, this inflicts
less number of retransmissions.
Table 3 presents the average p
t
(p
t,avg
) and
the average k (k
avg
) for the dynamic probabilistic
and LAR-1P algorithms. It shows that, k
avg
is
increasing with increasing node density, and for
each node density, k
avg
for dynamic probabilistic
Table 2. Comparing the performance of LAR-1P for various node densities
N
N
d
(N/A)
Pure Flooding
Dynamic
Probabilistic
LAR-1 LAR-1P
R
RET
-R
RCH
(%)
RET
50 2x10
-4
0.715 0.652 0.087 0.084 3.45
75 3x10
-4
0.932 0.771 0.129 0.108 16.28
100 4x10
-4
0.975 0.718 0.147 0.115 21.77
125 5x10
-4
0.983 0.643 0.159 0.115 27.67
150 6x10
-4
0.987 0.589 0.161 0.109 32.30
RCH
50 2x10
-4
0.767 0.750 0.396 0.394 0.51
75 3x10
-4
0.961 0.946 0.641 0.612 4.52
100 4x10
-4
0.995 0.981 0.772 0.720 6.74
125 5x10
-4
0.999 0.990 0.857 0.813 5.13
150 6x10
-4
1.000 0.995 0.894 0.859 3.91
Figure 5. Variation of RET against N for the various algorithms
IGI GLOBAL
is greater than that for LAR-1P. This is because
in LAR-1P we only consider first-hop neighbors
that are positioned inside RZ. As a result of that
p
t,avg
is decreasing with increasing node density
for the two algorithms, and for the same node
density, p
t,avg
for dynamic probabilistic is less
than that for LAR-1P.
However, for the sake of comparison, we
also implemented on MANSim a procedure to
estimate the nodes retransmission probabilities
considering all first-hop neighbors (inside and
outside the RZ). The results are summarized in
Table 4. The results are for 100 nodes and all
other parameters as given in Table 1. It is obvi-
ous that because we consider all first-hop
neighbors inside and outside the request zone,
k will be higher (giving a k
avg
of 10.405) and
consequently p
t
will be lower (giving a p
t,avg
of
0.765). Since p
t
is lower, it is expected to have
lower number of retransmissions of 8.6% and
as a result of that less reachability of 60.6%,
which is nearly 17% less than the reachability
attained by LAR-1. Giving that, it is recom-
mended considering the first-hop neighbors
positioned inside RZ only when calculating p
t
.
In what follows we shall analyze the per-
formance of the algorithms for a single node
density value of 4x10
-4
nodes/m
2
. In this case,
pure flooding provides the highest RCH of
99.5% (almost 100%) at 97.5% of the nodes
on the network engaged in RREQ packets re-
transmission. The dynamic probabilistic algo-
rithm achieves a very high RCH of 98.1% but
only reduces RET by ~28% to around 72%.
Figure 6. Variation of RCH against N for the various algorithms
Table 3. Comparing p
t,avg
and k
avg
for dynamic probabilistic and LAR-1P for various node density
N
N
d
(N/A)
p
t,avg
k
avg
Dynamic
Probabilistic
LAR-1P
Dynamic
Probabilistic
LAR-1P
50 2x10
-4
0.935 0.981 4.913 3.418
75 3x10
-4
0.844 0.946 7.262 4.945
100 4x10
-4
0.749 0.904 9.562 6.247
125 5x10
-4
0.661 0.848 12.092 7.865
150 6x10
-4
0.601 0.796 14.373 9.401
IGI GLOBAL
LAR-1 reduces RET to a very low level of
14.7% at the cost of reducing RCH to 77.2%.
In other words, LAR-1 introduces a significant
reduction of ~85% in RET but at the same time
reducing RCH by ~22%. These results represent
an average behavior over uniform distribution
of zones densities. But if most of route dis-
covery processes involve searching the route
in high-density zones, and then RET will be
definitely higher than 14.7%. For example, for
RZs that confine all nodes, RCH and RET will
be the same as or close to that in pure flooding.
Furthermore, in high-density zones the delay is
much higher than in low-density zones.
It can be seen in Table 2 that for LAR-1P,
RCH is 5.2% less than LAR-1 and the associ-
ated RET is also less by 3.2%. However, when
we look at zone scale, for example, for high-
density zones, LAR-1P attains the performance
of dynamic probabilistic of 98.1% in RET. Thus,
if most route discovery processes involve high-
density zones, a reduction in RET is always
equal to or close to 30%. On the other hand, for
low-density zones, the estimated p
t
s are always
close to 1, i.e., inflicting pure flooding within
RZ, which means LAR-1 is the acting algorithm.
In order to closely demonstrate the advan-
tage of LAR-1P, let us consider the following
example. Assume node 1 is the source and
all other nodes are examined as destinations
one at a time. For each S-D pair we count the
number of nodes within the RZ (N
z
) including
the source and the destination. Then, we count
the number of cases in which N
z
lies within the
ranges 2-10, 11-20, 21-30, , etc). For each
range, we estimate RCH and RET for both
LAR-1 and LAR-1P algorithms, and the results
obtained are summarized in Table 5.
Now, let us explain the results in Table 5.
Z in the second column gives the number of
RZs that contains a number of nodes within the
range indicated in the first column. For ex-
ample, the number of RZs that contains a
number of nodes within the range from 2 to 10
is 40, and 27 contains a number of nodes from
11 to 20.
For LAR-1, consider the case of low-
density zone (2-10), in 32 out of the 40, the
source and the destination were able to establish
a connection and 8 of them were not, giving
RCH of 80%. Due to the small number of
nodes within RZs, RET was only 4.8%. The
Table 4. Comparing the performance of LAR-1P considering different approaches for calculat-
ing first-hop neighbors
Algorithm RET RCH p
t,avg
k
avg
LAR-1P (k=first-hop neighbors posi-
tioned inside RZ only)
0.115 0.720 0.904 6.247
LAR-1P (k=first-hop neighbors posi-
tioned inside and outside RZ)
0.086 0.606 0.765 10.405
Table 5. Comparison between the LAR-1 and LAR-1P
N
z
Z
LAR-1 LAR-1P
RET RCH RCH RET RCH RCH
2-10 40 0.048 32 0.800 0.043 31 0.775
11-20 27 0.127 22 0.815 0.113 21 0.778
21-30 11 0.222 10 0.909 0.186 10 0.909
31-40 11 0.328 11 1.000 0.263 11 1.000
41-50 10 0.431 10 1.000 0.310 10 1.000
IGI GLOBAL
LAR-1P provides slightly less RET and less
RCH. For high-density zone (41-50), LAR-1P
accomplishes the same RCH of 10 out of 10 at
a significant reduction in RET from 43.1% for
the LAR-1 to 31%, and that what LAR-1P is all
about. In high-density zone, reduces RET sig-
nificantly and almost maintains the same RCH.
CONCLUSION
This paper investigated the effect of node density
on the performance of the new LAR-1P algo-
rithm. The simulation results demonstrated that
the performance of LAR-1P is improving with
increasing node density, i.e., more reduction in
the number of retransmissions can be achieved
against insignificant reduction in network
reachability. Moreover, the performance of the
LAR-1P algorithm overwhelmed the perfor-
mance of LAR-1 in high- density zones, while
almost provides the performance of LAR-1 in
low-density zones.
REFERENCES
Abdulai, J., Ould-Khaoua, M., & Mackenzie, L.
(2007). Improving probabilistic route discovery in
mobile ad hoc networks. In Proceedings of the 32
nd

IEEE Conference on Local Computer Networks,
Dublin, Ireland (pp. 739-746).
Abdulai, J., Ould-Khaoua, M., Mackenzie, L., &
Bani-Yassin, M. (2006). On the forwarding prob-
ability for on-demand probabilistic route discovery
in MANETs. In Proceedings of the 22
nd
Annual UK
Performance Engineering Workshop, Poole, Dorset,
UK (pp. 9-15).
Al-Bahadili, H. (2010a). Enhancing the performance
of adjusted probabilistic broadcast in MANETs. The
Mediterranean Journal of Computers and Networks,
6(4), 138144.
Al-Bahadili, H. (2010b). On the use of discrete-
event simulation in computer networks analysis and
design. In Abu-Taieh, E., & El-Sheikh, A. (Eds.),
Handbook of research on discrete-event simula-
tion environments: Technologies and applications
(pp. 414442). Hershey, PA: Information Science
Reference. doi:10.4018/978-1-60566-774-4.ch019
Al-Bahadili, H. (2012). A new route discovery
algorithm in MANETs combining location-aided
routing and probabilistic algorithms. Submitted
for publication to. The Mediterranean Journal of
Computers and Networks.
Al-Bahadili, H., Al-Basheer, O., & Al-Thaher,
A. (2007). A location aided routing-probabilistic
algorithm for flooding optimization in MANETs.
In Proceedings of the Mosharaka International
Conference on Communications, Networking and
Information Technology, Amman, Jordan.
Al-Bahadili, H., & Kaabneh, K. (2010). Analyz-
ing the performance of probabilistic algorithm in
noisy MANETs. International Journal of Wireless
& Mobile Networks, 2(3), 8395. doi:10.5121/
ijwmn.2010.2306
Bani-Yassein, M., Bani-Khalaf, M., & Al-Dubai, A. Y.
(2010). A new probabilistic broadcasting scheme for
mobile ad hoc on-demand distance vector (AODV)
routed networks. The Journal of Supercomputing,
53(1), 196211. doi:10.1007/s11227-010-0408-0
Bani-Yassein, M., & Ould-Khaoua, M. (2007).
Applications of probabilistic flooding in MANETs.
International Journal of Ubiquitous Computing and
Communication, 1(1), 15.
Bani-Yassein, M., Ould-Khaoua, M., Mackenzie,
L. M., & Papanastasiou, S. (2006). Performance
analysis of adjusted probabilistic broadcasting in
mobile ad hoc networks. International Journal of
Wireless Information Networks, 13(2), 127140.
doi:10.1007/s10776-006-0027-0
Barrett, C., Eidenbenz, S., Kroc, L., Marathe, M.,
& Smith, J. (2005). Parametric probabilistic routing
in sensor networks. Journal of Mobile Networks
and Applications, 10(4), 529544. doi:10.1007/
s11036-005-1565-x
Boleng, J., & Camp, T. (2004). Adaptive location-
aided mobile ad hoc network routing. In Proceedings
of the 23
rd
IEEE International Performance, Com-
puting, and Communications Conference, Phoenix,
AZ (pp. 423-432).
Haas, Z. J., Halpern, J. Y., & Li, L. (2006). Gossip-
based ad hoc routing. Transactions on Networking,
14(3), 479491. doi:10.1109/TNET.2006.876186
Hanash, A., Siddique, A., Awan, I., & Woodward,
M. (2009). Performance evaluation of dynamic
probabilistic broadcasting for flooding in mobile
ad hoc networks. Journal of Simulation Modeling
Practice and Theory, 17(2), 364375. doi:10.1016/j.
simpat.2008.09.012
IGI GLOBAL
Kim, J. S., Zhang, Q., & Agrawal, D. P. (2004).
Probabilistic broadcasting based on coverage area and
neighbor confirmation in mobile ad hoc networks. In
Proceedings of the IEEE Global Telecommunications
Conference Workshops, Dallas, TX (pp. 96-101).
Ko, Y. B., & Vaidya, N. H. (2000). Loca-
tion-aided routing (LAR) in mobile ad hoc
networks. Wireless Networks, 6(4), 307321.
doi:10.1023/A:1019106118419
Kyasanur, P., Choudhury, R. R., & Gupta, I. (2006).
Smart gossip: An adaptive gossip-based broadcasting
service for sensor networks. In Proceedings of the
International Conference on Mobile Ad Hoc and
Sensor Systems, Vancouver, BC, Canada (pp. 91-100).
Land, D. (2008). Routing protocols for mobile ad hoc
networks: Classification, evaluation, and challenges.
Berlin, Germany: VDM Verlag.
Li, J., Jannotti, J., Couto, D., Karger, D., & Morris,
R. (2000). A scalable location service for geographic
ad hoc routing. In Proceedings of the 6
th
International
Conference on Mobile Computing and Networking
(pp. 120-130).
Sasson, Y., Cavin, D., & Schiper, A. (2003). Proba-
bilistic broadcast for flooding in wireless mobile ad
hoc networks. In Proceedings of the IEEE Wireless
Communications and Networking, New Orleans, LA
(Vol. 2, pp. 1124-1130).
Scott, D., & Yasinsac, A. (2004). Dynamic proba-
bilistic retransmission in ad hoc networks. In Pro-
ceeding of the International Conference on Wireless
Networks, Las Vegas, NV (Vol. 1, pp. 158-164).
Tseng, S., Ni, S., Chen, Y., & Sheu, J. (2002). The
broadcast storm problem in a mobile ad hoc net-
work. Journal of Wireless Networks, 8(2), 153167.
doi:10.1023/A:1013763825347
Viswanath, K., & Obraczka, K. (2005). Modeling the
performance of flooding in wireless multi-hop ad hoc
networks. Journal of Computer Communications,
29(8), 949956. doi:10.1016/j.comcom.2005.06.015
Vyas, N., & Mahgoub, I. (2003). Location and mo-
bility pattern based routing algorithm for mobile ad
hoc wireless networks. In Proceedings of the Inter-
national Symposium on Performance Evaluation of
Computer and Telecommunication Systems.
Zhang, Q., & Agrawal, D. P. (2005). Dynamic
probabilistic broadcasting in MANETs. Journal of
Parallel and Distributed Computing, 65(2), 220233.
doi:10.1016/j.jpdc.2004.09.006
IGI GLOBAL
Keywords: Cryptography, Key Distribution, Pseudo-Random Bit Generator (PRBG), Quantum
Cryptography, Quantum Key Distribution
INTRODUCTION
Preserving confidentiality during communica-
tions is always considered a hard task; encryp-
tion is one solution for such a problem. The
simplest, yet proved (Shannon, 1949) secure,
encryption method is one-time pad (Vernam,
1926); which uses symmetric keys between
communicating parties. The main two problems
of one-time-pad is i) the need to always generate
new keys, and ii) the need to securely distribute
such keys between the communicating parties;
while the first problem can be solved using any
real random number generator, the second is
harder to solve and known as Key Distribution
problem (KD).
Diffie and Hellman (1976) were the first to
solve the (KD) problem, utilizing a mathemati-
cal problem known as discreet log (DL) (Mene-
zes, Oorschot, & Vanstone, 1997). Based on DL
problem and utilizing another mathematical
problem known as factorization problem (FP)
Rivest, Shamir, and Adleman (1978) introduced
the asymmetric encryption technique RSA using
two correlated keys; Multiple methods were
introduced to generated such keys see FIM
(Abu-Ayyash & Jabbar, 2003).
Another recent solution for key distribution
was achieved by utilizing a well-known scien-
tific problem related to quantum physics known
as uncertainty (Price, Chissick, & Heisenberg,
1977); were two co-related properties of a
quantum particle cannot be measured with high
precision at the same time, Wiesner (1983)
was the first to suggest using it, followed by
Bennett and Brassard (1984); Since then, lots
Using Permutations to Enhance
the Gain of RUQB Technique
Abdulla M. Abu-ayyash, Central Bank of Jordan, Jordan
Naim Ajlouni, Al-Balqa University, Jordan
ABSTRACT
Quantum key distribution (QKD) techniques usually suffer from a gain problem when comparing the fnal key
to the generated pulses of quantum states. This research permutes the sets that RUQB (Abu-ayyash & Ajlouni,
2008) uses in order to increase the gain. The effect of both randomness and permutations are studied; While
RUQB technique improves the gain of BB84 QKD by 5.5% it was also shown that the higher the random-
ness of the initial key the higher the gain that can be achieved, this work concluded that the use of around 7
permutations results in 30% gain recovery in an ideal situations.
DOI: 10.4018/jitwe.2012040103
IGI GLOBAL
of quantum key distribution (QKD) protocols
were proposed (Nung & Kuo, 2002; Bennett,
1992; Ekert, 1991; Kak, 2006; Kanamori, Yoo,
& Al-Shurman, 2005; Bostrom & Felbinger,
2002; Lucamarini & Mancini, 2004; Wang, Koh,
& Han, 1997; Barrett, Hardy, & Adrian, 2005).
Some well-known protocols, in addition to
implementations, suffers from big losses com-
paring the size of the final key to the number
of quantum states (particles) used. The loss
is due to the protocol implementation steps,
in addition to the characteristics and imple-
mentations of physical devices and channels
used (Abu-ayyash & Ajlouni, 2008; Bennett
& Brassard, 1992).
Researchers have already tried to solve
this problem in a multi-dimensional space:
first by enhancing the physical devices, chan-
nels, parameters and implementations (Chou,
Polyakov, Kuzmich, & Kimble, 2004; Santori
et al, 2004; Tisa, Tosi, & Zappa, 2007); sec-
ond by increasing the information content in
the quantum particle states used (Groblacher,
Jennewein, Vaziri, Weihs, & Zeilinger, 2005;
Kuang & Zhoul, 2004); third by using other
quantum phenomena such as EPR (Einstein,
Pololsky, & Rosen, 1935; Ekert, 1991; Kuang
& Zhoul, 2004); fourth by changing or enhanc-
ing the way the protocol works (Abu-ayyash &
Ajlouni, 2008; Nung & Kuo, 2002; Kak, 2006;
Kanamori, Yoo, & Al-Shurman, 2005; Barrett,
Hardy, & Adrian, 2005).
For example Ching and Chen (Nung & Kuo,
2002) enhanced the gain of Bennett protocol
B92 (Bennett, 1992) by using another stage back
from Bob to Alice; where he sends back a new
qubits using the same bases he used initially at
the times where he fails to measure a qubit sent
by Alice, this increases the key size by around
3.6% on the expense of more qubits. Another
example, RUQB, uses a different technique for
improving the gain, based on discovering the
relationship among the original random bits that
were used during the protocol to aid in enhanc-
ing the gain (by 5.5% for BB84) (Bennett &
Brassard, 1984). In this research it is intended
to investigate permuting RUQB sets to increase
gain, and study the effect of this permutation
on the security of this method.
First, the QKD idea is presented along with
RUQB; then we discuss the studied method of
P-RUQB, afterwards we discuss gain analysis,
followed by a discussion on security aspect of
P-RUQB and then we conclude with the results.
QUANTUM KEY
DISTRIBUTION AND RUQB
The basic element of quantum key distribution
will be illustrated using the original four - state
QKD protocol developed by Bennett and
Brassard in 1984 known as BB84 protocol.
Assume that the individual photons, precisely
the polarization states of photons, serve as the
quantum bits for the protocol. The protocol
starts by one of the two parties transmitting
a sequence of photons to the other party. The
parties publicly agree to make use of the two
distinct polarization bases which are chosen to
be maximally non-orthogonal. In a completely
random order, a sequence of photons are pre-
pared in states of definite polarization in one
or other of the two chosen bases and transmit-
ted by one of the parties to the other through
a channel that preserves the polarization. The
photons are measured by the receiver in one or
the other of the agreed upon bases, again chosen
in a completely random order. The choices of
bases made by the transmitter and receiver thus
comprise two independent random sequences.
Since they are independent random sequences of
binary numbers, about half of the basis choices
will be the same and are called the compatible
bases, and the other half will be different and
are called the incompatible bases. The two
parties compare publicly, making use for this
purpose of a classical communication channel,
the two independent random sets of polariza-
tion bases that were used, without revealing the
polarization states that were observed.
Cryptographic protocols, in the absence of
real random bit generator RBG, uses pseudo-
random bit generator (PRBG). for that, it is
required that the PRBG used for cryptography
IGI GLOBAL
passes what is known to be the next bit test in
which it is unlikely to predict the next bit the
PRBG generates given the previous sequence
that were generated, at this point the PRBG that
passes this test is classified as cryptographi-
cally secure pseudo-random bit generator
CSPRBG (Menezes, Oorschot, & Vanstone,
1997); for that, and for maximum security,
QKD protocols uses RBG based on physical
phenomena like radiation.
RUQB is based on the idea that, in-spite
that random bits are generated in random, there
are some relationships between the random bits
generated, those relationships are approximately
different for each bit; RUQB technique seeks to
find a relationship, based on dividing the original
random sequence of bits into subsets, P-RUQB
will further seeks to find other relationships by
permuting the subsets.
PERMUTED RUQB (P-RUQB)
Most QKD algorithms start by using random
(not a pseudo-random) numbers generated by
any physical phenomena, a lot of such random
numbers are used for testing the presence of
an eavesdropper which reduces the gain of
the protocol. P-RUQB is based on the idea of
permuting the subsets that was used by RUQB
for discovering the relationship of the random
bits, the actual recovery process will concentrate
on recovering such random numbers by making
use of the agreed bits from the random number
list. The following will emphasize P-RUQB:
1. Alice generate a list of n random bits,
= =
{ , , , }, , a a a n l Z
n
l
0 1 1
1
2 where .
2. Alice encodes the bits into quantum states
to have a list
= {| _ ,| _ , ,| _( )} q q q n 0 1 1

and sends the quantum states to Bob.
3. Bob measures each quantum state received
based on any QKD method to obtain the
list =
{ , , , } b b b
n 0 1 1
.
4. Alice and Bob start a sifting process
based on the same QKD method result
in a list (a list of agreed on locations
of bits).
5. Alice and Bob directly compare
| |
2

bits to estimate the number of errors
e
T
and its percentage given by
=
2e
T
| |
, if the percentage is more
than an agreed on threshold t they
abort the protocol and start again, if
not, they start an error correction
process to eliminate the list of bits
to obtain the initial key string
K A F E
I
A
i i
a a i = { | , }
,
K B F E
I
B
i i
b b i = { | , }
, note that

A
I
B
I
= and | | | |
I
A
I
B
n = < .
6. P-RUQB: Alice construct multiple
lists L
p
and send them to Bob as follows:-
a. Let l denotes the number of bits
needed to represent any location
of the originally (Alice) generated
random bits. i.e., l log n = +
2
1 .
b. Let P denote the set of permuta-
tions that Alice and Bob agreed
on, such that p P , 0 p l !
.
c. for each permutation number p
Alice do the following:-
i. Construct multiple lists of bits of
size l from the initial key bits
S K
K
ip
A
j j I
A
I
A
c c
i j i l i l
=
+
{ | ,
}, | | 1 0
.
ii. Let
ip
A
denote the permutation of
the set S
ip
A
, using permutation
number p .
iii. construct a temporary set of re-
sults

Tp
A
i i I
A
r l i i l = {( , , )}, | | 0
,
where r
i
is the result of XORing
the elements of the set
ip
A
, l
i
is
IGI GLOBAL
the binary location calculated
from set
ip
A
as shown in equation
(1) and (2) respectively:-
r c c
i
j l
j j ip
A
=
0 1
, (1)
l c n c
i
j l
j
j
j ip
A
=

0 1
2 mod , (2)
iv. Construct indexing lists L
p
for
each permutation p ,
L i r l i a
r l a l L w p
p i i Tp
A
l
i i l i w
i
i
=
=
{ | ( , , ) ,
, , , }
K
F E A
,
note that the location l
i
should
not be within any previous lists.
d. Alice key is
A
i
k = { }, such that for
all L
p
k
a a A i
r r l i i L
p p l
i
i i
i i i Tp
A
p
=

, .
( , , ) , ,
,
F E
K
0 !!

(3)
7. Once Bob receives all lists L
p
he will
calculate the following:-
a. l log n = +
2
1 .
b. for each permutation number p used,
0 p l ! do the following:-
i.
S K
ip
B
j j I
B
p
c c
i j i l i L
=
+
{ | ,
}, 1
.
ii.
ip
B
a permutation of the set
ip
B
,
using permutation number p .
Tp
B
i i p
r l i i L = {( , , )}, where
r c c
i
j l
j j ip
B
=
0 1
, (4)
l c n c
i
j l
j
j
j ip
B
=

0 1
2 mod , (5)
c. Bob key is
B
i
k = { }, such that for
all L
p
k
b b i
r r l i i L
l p p l
i
i i
i i i Tp
B
p
i
=

B F E
K
F
, .
( , , ) , ,
, , ! 0

(6)
Based on step (6) points (a),(b) and (c)
above, the complexity of this algorithm is re-
lated to three internal loops, they are the num-
ber of permutations used p , the size of each
set which is l log n = +
2
1 and the number of
bits used from the key K
F
I
A
=
| |
2
, i.e.,
O
p F logn
(
| |
)
2
.
Gain Analysis
To calculate the gain of P-RUQB it is needed
to have a deeper analysis of the gain of RUQB.
The new analysis is based on finding the prob-
ability of having single gain after performing
each iteration, and then extending the analysis
for P-RUQB.
RUQB Analysis
The analysis of gain G done by Abu-ayyash
and Ajlouni (2008) for RUQB was shown to
be calculated as
G
F
E l P P P
clu cv l
r
=

| |
| | ( )
2
1
(7)
where | | F the size of the sifted key {for BB84
it is approximately
n
2
and B92 is approxi-
mately
n
4
}, | | E is the expected number of
bits to be eliminated if there is an error, in the
best case the algorithm eliminates only the er-
rors, but this is not the case (it depends on the
IGI GLOBAL
algorithm used to eliminate those errors, Shan-
non estimated the minimum limit for the num-
ber of bits that need to be communicated to
eliminate errors based on the probability of
error for each transmitted bit to be nh( ) ,
some algorithms needs more ynh( ) , where
y 1 , see Gilbert and Hamrick (2000),
l log n = +
2
1 , P
clu
the probability that the
calculated location of RUQB will not be used
within the initial key {for BB84 it is approxi-
mately
1
2
and for B92
3
4
}, P
cv
probability of
correct value found at unused location and it is
approximately
1
2
, P
l
r
is the probability of re-
dundant locations which is a random number
generator (RNG) dependent.
The idea of RUQB starts with no gain (the
original QKD key length), then after the first
iteration, the gain is either increased by 1 or
not with probability p
0 1
and p
0 0
respec-
tively, such probabilities is based on the random
result of the interpretation on the sets used, with
respect to the percentage of the length of the
original key to the total number of qubits used.
From Figure 1, a more comprehensive analysis
for the probability to have gain x after m it-
erations, where P x
m
( ) is the sum of all the
products of the paths from 0 (top most node)
to the node x at level (iteration) m :
P x p
p
m
i
x
i i
y x
y m x
k
x
k k
y
k
k
x
k
k
( )
( )
( )
=
+

=
=

=
0
1
1
0 0
0
(8)
Where p
i i + ( ) 1
is the probability for having
a gain value of ( ) i +1 after single iteration
when the previous gain was i, p
k k
is the prob-
ability of no change of gain after single iteration.
A simplified recursive version of equation
(8) can be defined as:
P x P x p
P x p
m m x x
m x x
( ) ( )
( )
( )
=
( )
+
( )
( )

( )

1
1 1
1
(9)
Where
p p
i i i i +
=
( )
1
1
(10)
And
p p iD
i i +
=
1 0 1
( ) (11)
Figure 1. Probability diagram for the gain value of , , , 0 1 m after iterating m times where
p
i j
is the probability of having gain j if the previous gain was i
IGI GLOBAL
Where D is the decrease in probability
due to increase of gain by 1 ; typically equal
to
1
n
).
For example the probability of having no
gain 0 after m iterations is
( ) ( )...( ) ( ) ( ) P p p p p
m
m
m m
0 1
0 0 0 0 0 0 0 1
= = =

times
Hence, the probability of having no gain
after one iteration is
( ) P p p 0 1
1 0 0 0 1
= =

.
And the probability of having the maximum
gain m after m iterations is
P m p p p p
m m m
i
m
i i
( ) ( )( )...( ( ).
( ) ( )
= =

=
+ 0 1 1 2 1
0
1
1
Hence, the probability of having one gain
after one iteration is ( ) P p 1
1 0 1
=
.
The expected gain after m iterations is
E iP i
i
m
m
=
=
0
( ) (12)
=

p
n
m
j
j
m
m j
0 1
0
1
1
1
2
(13)
( ) =
2 1 1
1
2
0 1
np
n
m
(14)
Note that the expected gain depends on
three factors; n number of all qubits used by
original QKD protocol, p
0 1
probability to
have a gain of one bit after one iteration when
previously no gain was found (it is also a QKD
protocol dependent) and m the number of it-
erations used by RUQB protocol.
Note that m dependents on n the total
number of original random bits, the size of the
final sifted key
F
2
after eliminating more bits
due to errors found within the key
F h
( )
2
,
list size l n = + log
2
1and I
e
information
gained by the eavesdropper, lets sum all that
in a function divided by n and call it:
f
F
n
F h
n
n
n n
I
n
e
( )
=
( )

2 2
1
2
log

(15)

| |
[ ( )]
log
=
F
n
h
n
n n
I
n
e
2
1
1
2

(16)
Where h( ) is the Shannon limit, I
e
is the
Information gained by an eavesdropper, so
replacing m maximum iterations by nf
( )
the
expected gain is
E np
n
nf
=
2 1 1
1
2
0 1
( )
( )
(17)

( )
=
2 1
0 1
2
np e
f
(18)
For BB84 and RUQB
p
0 1
1
4
= ,
f
h
n
n n
( )
( )
log
1
4 4
1
2
,
where h( ) is Shannon limit. In general, for
no errors and large values of n , equation (18)
results in
E
n
e
RUQB
=
2
1
1
8
(19)
Which mean that the expected gain for
BB84 using RUQB for large values of n is
E n
RUQB
. 0 0588 (20)
IGI GLOBAL
In Table 1 of Abu-ayyash and Ajlouni
(2008) no errors where assumed, and n was
not a large value, therefore by using equation
(17) and solving for b (gain percentage)
E nb
n
n
BB
n
n
84
4
1
2
1 1
1
2
= =

( )
log

(21)

( )

=
+
n e
n
log n
2
1
1
1
2
1
8
1
(22)

( )

=
+
n e
n
log n
2
1
1
1
2
1
8
1
(23)
b 0 0564 . (24)
This result is approximately similar to the
results obtained by Abu-ayyash and Ajlouni
(2008).
P-RUQB Analysis
In equation (14) replacing m by xnf
( )
, where
x is a multiplication factor that represents
number of permutations to be used, will result
in
E np e
xf
=
2 1
0 1
2
( )
(25)
From equation (25), as the multiplication
factor x increase the gain E increase, i.e.:
lim
( )
x
xf
np e np
= 2 1 2
0 1
2
0 1
(26)
In this case the gain cannot be 2
0 1
np
; for
which, an infinity permutations is used, while
the maximum available permutations is only
Table 1 . Average of 10 runs for each x without errors, n = 2
12

x
G
G%
E E%
r x
G
G%
E E%
r
1 236.5 5.70 240 5.87 0.02896 16 1738.8 42.45 1770 43.23 0.01804
2 450.9 11.00 453 11.06 0.00542 17 1765.6 43.10 1803 44.02 0.0209
3 615.7 15.03 640 15.63 0.03839 18 1785.0 43.57 1832 44.73 0.02593
4 797.7 19.47 805 19.67 0.01017 19 1814.2 44.29 1857 45.34 0.02316
5 916.1 22.36 951 23.23 0.03745 20 1835.6 44.81 1879 45.89 0.02353
6 1056.4 25.79 1080 26.38 0.02237 21 1843.9 45.01 1899 46.37 0.02933
7 1158.3 28.27 1194 29.15 0.03019 22 1877.3 45.83 1917 46.80 0.02073
8 1249.7 30.51 1294 31.61 0.0348 23 1898.5 46.35 1932 47.17 0.01738
9 1340.1 32.71 1383 33.76 0.0311 24 1881.8 45.94 1946 47.75 0.03791
10 1431.2 34.94 1461 35.67 0.02047 25 1920.7 46.89 1958 47.80 0.01904
11 1492.0 36.42 1530 37.36 0.02516 26 1922.1 46.92 1968 48.06 0.02372
12 1547.9 37.79 1591 38.84 0.02703 27 1957.2 47.78 1977 48.28 0.01036
13 1609.6 39.29 1644 40.15 0.02142 28 1947.6 47.54 1986 48.49 0.01959
14 1644.2 40.14 1692 41.31 0.02832 29 1969.3 48.07 1993 48.66 0.01212
15 1686.9 41.18 1733 42.33 0.02717 30 1975.0 48.23 1999 48.82 0.01209
IGI GLOBAL
l ! , also note that the gain do not increase lin-
early with the increase of the multiplication
factor, see Figure 2.
Based on the assumption that the best
(optimal) x
value for the multiplication factor

x (integer value) can be found when the speed
of change of gain E at iteration x is half the
initial speed of change of gain (i.e., at x =1).
Differentiate equation (25) with respect to x :
d
dx
E np f e
xf
=
0 1
2
( )
( )
(27)
Allow it to equal half the derivative at
x =1 and solve for optimal x
np f e
n
p f e
x f
f
0 1
2
0 1
2
2
=
( )
( )
( )
( )
(28)
x
ln
f
= + 1
2 2
( )
(29)
Note that the optimal value for x
is de-
pendant on the error , in the following two
subsections a discussion will be presented for
running P-RUQB in two situations, the absence
and presence of errors.
P-RUQB without Errors
Assume that the quantum channel is error free,
and Eve is not eavesdropping on the quantum
channel, for BB84 value of | | F
n
=
2
then
f ( )
1
4
, therefore the optimal value based
on equation (29) with no errors is
xne
8 2 1 7 ln + = (30)
And the optimal expected gain E is
E np e ne
2 1
0 1
7
8
(31)
1 17
0 1
. np
(32)
BB84 and RUQB has p
0 1
1
4
= so the
optimal gain is estimated as
Figure 2. Gain values for multiple P-RUQB iterations up to 30 iterations for number of photons
n = 2
12
See Table 1
IGI GLOBAL
E n ne
0 29 . (33)
See Figure 2. Table 1 lists an average of
10 runs for each selected x .
If Eve is eavesdropping on the quantum
channel but still no errors is detected this may
indicate that she is using a beam splitting (pho-
ton number splitting PNS) attack, where each
pulse may contains two or more photons, the
estimated bits leaked according to Bennett and
Brassard (1992) is related only to the physical
(and statistical) characteristic of photon pulse
generator; the mean number of photons per
pulse , where the leak is | | F in addition
to a suggested safe factor of five standard
divination 5 1 | | ( ) F , so the information
gained by an eavesdropper is
I F F
e
= + | | | | ( ) 5 1 (34)
a more comprehensive analysis done by Gilbert
and Hamrick (2000), includes other physical
parameters; the attenuation of the quantum
channel between Alice, Eve and Bob, in addi-
tion to Eves controlling parameters and u
where they represent the degree to which she
can adjust the transparency of the quantum
channel and the number of photons that she
chooses to remove from the multi-photon pulse
for future use respectively, in addition to the
efficiency of Bob detector . The leak is re-
lated to the number of pulses containing more
than two photons, and (decreased) by the non-
efficient amount that is controlled by param-
eters S u ( , , , ) , so the information gained
by an eavesdropper now is
I
M
e S u
e
= +
2
1 1
( ) ( , , , )
(35)
Where M is the total number of laser
pulses. This leak is included in equation (16).
Back to equation (29) the optimal multiplication
factor of P-RUQB where f
I
n
e
( ) =
1
4
for
BB84 is
x
ln
I
n
ne I
e
e
_ = +
1
8 2
1 4
(36)
P-RUQB with Errors
If the quantum channel is not error-free (but
Eve is not eavesdropping on the quantum chan-
nel, and without paying attention to the physi-
cal properties of the equipments) then errors
will be within the sifted key and they should
be eliminated, error rate (percentage) can be
estimated by comparing random bits of the key,
Shannon theorem (Shannon, 1948) estimates
the minimum number of bits to be eliminated
by error correction as
| |
( )
F
h
2
, where
h( ) ( log ( )log( )) = +
2 2
1 1
and F
n
=
2
for BB84, and for big value of n
f h ( ) ( ( ))
1
4
1 (37)
Therefore the optimal x is
x
ln
h
e
1
8 2
1
+

( )
(38)
And optimal expected gain Ee
is
E
n
e e
h
4
2
1
8
( )
(39)
The effect of errors percentage on op-
timal x and expected gain can be seen in
Figure 3 and Figure 4 respectively. Note that
IGI GLOBAL
due to errors, optimal x should be increased
to maintain approximately the same expected
gain percentage. In Gilbert and Hamrick (2000)
a study to include the physical parameters of
the devices in the estimation of errors without
eavesdropping is
e c
d
M
r e
r
2
1
2
( ) +
(40)
Where r
c
is intrinsic quantum channel
error rate, r
d
is dark count rate by Bob detector.
Due to Bennett and Brassard (1992) if Eve
is eavesdropping on the quantum channel and
the channel is not error free (using only and
), then
I F
F
e
= +
( )
+ + +

2 2
5 1 4 2 2
| | ( ( ) ( ) )
(41)
This value should be used in equation (36),
a more comprehensive analysis can be found
in Gilbert and Hamrick (2000).
SECURITY ANALYSIS
It is necessary to identify whether P-RUQB
causes more information to be leaked than
RUQB regarding both initial and final keys.
Assuming that both RUQB and P-RUQB tech-
niques allows Eve to be in possession of all sets
L
p
that contains the indexes of the locations
within the initial key, such sets will contribute
to more bits within the final key. In this research
RUQB security analysis will be discussed then
it will be compared to P-RUQB.
RUQB Security Analysis
In the RUQB algorithm an eavesdropper will
be able to get the list of indexes that Alice sends
to Bob, let L i i i
s
= { , , , }
1 2
denote the list of
indexes, x
l
j
denote the new bit value calculated
based on the list L starting at index i
j
using a
maximum of l sequential bits and l
j
be the
calculated location for x
l
j
, then an eavesdrop-
per will be able to construct the following
system of equations
x x x x x
l i i i i l
1 1 1 1 1
1 2 1
=
+ + + ( ) ( ) ( )

l x x x
i i i 1
0
1
1
2
2
1 1 1
2 2 2 = + + + +
+ + ( ) ( )

x modn
i l
l
( )
1
1
1
2
+

Figure 3. The effect of error percentage on optimal x
IGI GLOBAL
x x x x x
l i i i i l
2 2 2 2 2
1 2 1
=
+ + + ( ) ( ) ( )

l x x x
i i i 2
0
1
1
2
2
2 2 2
2 2 2 = + + + +
+ + ( ) ( )

x modn
i l
l
( )
2
1
1
2
+

x x x x x
l i i i i l
s s s s s
=
+ + + ( ) ( ) ( ) 1 2 1

l x x x
s i i i
s s s
= + + + +
+ +
2 2 2
0
1
1
2
2
( ) ( )

x modn
i l
l
s
( ) +
1
1
2
Note that the size of the list is | | L s gn = =
, where g is the RUQB gain (every element
within the list contribute to a bit of gain), from
the above system of equation the number of
equation is
= 2s (42)
Were each element in list L results in two
equations; one for location l
j
and the other for
result x
l
j
, the number of variables in the
system of equation is
= + s dl ( ) 2 (43)
Were each element in list L results in two
variables on the left hand side of the equations
plus on average dl variables for each equation,
where l n = + log 1, d is a percentage depends
on the distance between elements of L and can
be calculated as,
d
d
d
d
=
<
1
1
1
if
if
(44)
Where
d
l s
i i
k
s
k k
+
1
1
1
1
1
( )
( ) (45)
And it has a maximum value of 1 see equa-
tion (44) denoting that there is no intersection
between the variables used by each equation.
The ratio R between the number of equations
to the number of variable shows the ability for
an eavesdropper to solve this system of equation
to gain full knowledge about the original key
and the gained bits, so
R =
E
V
(46)
Figure 4. The effect of error percentage on gain percentage
IGI GLOBAL

( )
=
+ +
2
2 1 d logn
(47)
since d depends on the distance between the
elements of the list L it is never 0 since at least
each element is different than the other by one
location, also it is uncommon to have the value
of 1 since then all elements are far away from
each other, so 0 1 < d and the limit of the
ratio as n reach for infinity is 0 (i.e., the num-
ber of variables is far more than the number of
equation and hence the eavesdropper will not
be able to solve the system of equations)
P-RUQB Security Analysis
For P-RUQB the number of equation is far more
than RUQB, applying the same analogy, if the
list of indexes is L
j
L i i i
j j j s j
= { , , , }
1 2
(48)
Then the number of equation is
= =
2 2
j
j
L g n | |
'
(49)
Where| | L
j
is the length of the list L
j
, g
'

is P-RUQB gain, the number of variables used
also
= + = +
h
F
L h
F
g n
i
j
| |
| |
| |
2
2
2
2
'

(50)
Where| | F is the initial sifted key size,
and h is the percentage used from the initial
key 0 < h h , so the ratio is
R =
E
V

| |
=
+
2
2
2
g n
h
F
g n
(51)
And the limit of R as n reaches infinity
is dependent on h and since | | F
n
=
2
for BB84
then the limit
lim
'
'
n
R
g
g h
=
+
8
8
(52)
And since g
'
for BB84 and P-RUQB is
0.29 then the limit is
lim
.
.
n
R
h
=
+
2 32
2 32
(53)
based on h, the ratio R range from 0.6987
(for h =1 ; i.e., all initial key bits are used) to
0.958 (for h = 0 1 . ; only 0.1 of the initial key
bits are used), in practice approximately all
initial key bits will be used, so the ratio is 0.7
since this will give an eavesdropper more in-
formation than RUQB but still not enough to
solve the system of equations for large values
of n . So if an eavesdropper has any or both of
the following conditions hold:-
1. All the locations within the initial key that
happen to be within L
j
are known (i.e.,
specific 11% of the initial key is known
by Eve).
2. A known sequence of l or more bits within
the initial key and the start of this sequence
must be within L
j
(i.e., with a probability
of 11% this sequence contribute to new
bit).
Then the system of equation begins to solve.
Randomness Effect
Both algorithms (RUQB and P-RUQB) depend
on the interpretation of the sequence of bits that
contribute to the initial key (the original key).
The higher the randomness of bits generated,
the higher would be the expected gain obtained.
This is due to the fact that randomness will
cause the probability of any location to be in-
IGI GLOBAL
terpreted equal, see equation (11); where both
p
0 1
and D are constants and dependent on
randomness; the former is QKD protocol de-
pendent, while the latter is only randomness
dependent (the randomness of the generated
bits) and equal to
1
n
.
If the bits are not completely random, then
some interpretations (for new locations) are
more likely than others, which results in redun-
dancy in interpretation, and hence, degradation
in gain obtained. Back to equation (11), both
p
0 1
and D will be degraded by the same
redundancy factor (percentage) [ .. ] r = 0 1
(where 0 is for no redundancy in location in-
terpretations, and 1 is for full redundancy, i.e.,
all interpretations are for the same location, in
general r is close to 0 not 1), so
p r p
i
n
i i +
=
( )

1 0 1
1 (54)
Replacing 1
( )
r byc , and recalculating
the expected gain (equation (12)) will results
in same factor to be included in the expected
gain. This is given by
E cnp
n
r
m
=
2 1 1
1
2
0 1
(55)
Where c is the inverse of the redundancy
factor (i.e., 0 is for full redundancy and 1 is for
no redundancy; now c is close to 1 not 0). Note
that in Table 1 and Figure 2 as it can be seen
there is a small difference between calculated,
and the actual gain obtained; r varies between
0.0054 and 0.0384 with average of 0.0234. See
Figure 5.
Randomness is used for security improve-
ment, however gain is affected by randomness,
at the same time the number of both variables
and equations are also affected (degraded). As
a source of randomness some QKD schemes
and implementations are using entanglement
(ERP Based) quantum states as a source for
both information carrier and randomness, while
Pironio, et al. (2010) noted that random numbers
(generated and used during QKD protocol us-
ing EPR states) can be certified based on the
violation of Bells inequalities (Bell, 1964;
Gerhardt et al., 2011) showed that this certifica-
tion can be refuted experimentally if the viola-
tion was not loophole-free, i.e., there is no
classical communication between communicat-
ing parties used to fake results and there is no
kind of shared randomness between commu-
Figure 5. Calculated r based on the difference between expected and calculated gain as a result
of randomness for each x
IGI GLOBAL
nicating parties; both will result in statistical
violation of Bells inequalities, hence false sense
of randomness and security. Another research
(Bouda et al., 2012) showed that security of
QKD will ruined if an eavesdropper is in posi-
tion of partial and limited access to the source
of randomness that is used by the protocol.
CONCLUSION
QKD protocols uses quantum phenomena as a
source of randomness to generate bits or qubits,
sometimes, QKD protocols suffers from a gain
problem; In this research it was found that P-
RUQB achieved an enhancement over RUQB
by recovering around 50-60% from the unused
bits which is equal to around 30% gain, which
means that some quantum lost bits are recovered.
The recovery is achieved by re-examining
the initial 50% of the bits assumed to be lost. This
is based on the process of finding relationships
between the random bits generated.
A comprehensive analysis for both the gain
and the security for both algorithms (RUQB
and P-RUQB) were discussed; it was shown
that P-RUQB achieves higher gain than RUQB,
while the latter maintains better security. Both
RUQB and P-RUQB are randomness dependent
and the higher randomness the higher the gain.
The complexity of P-RUQB was shown to be
O
x F logn
; while for BB84 a near optimal

number of permutations 8 the algorithm is of
the order O nlogn
( )
.
REFERENCES
Abu-ayyash, A. M., & Ajlouni, N. (2008). QKD:
Recovering unused quantum bits. In Proceedings of
the 3
rd
IEEE International Conference on Informa-
tion and Communication Technologies, Damascus,
Syria (pp. 1-5).
Abu-Ayyash, A. M., & Jabbar, S. (2003). Fraction
integer method: Calculating multiplicative inverse.
In Proceedings of the 7th World Multi-conference
on Systemics, Cybernetics and Informatics, Orlando,
FL (pp. 49-53).
Barrett, J., Hardy, L., & Kent, A. (2005). No sig-
nalling and quantum key distribution. Physical
Review Letters, 95, 010503. doi:10.1103/PhysRev-
Lett.95.010503
Bell, J. S. (1964). On the Einstein Podolsky Rosen
Paradox. Physics, 1, 195.
Bennett, C. (1992). Quantum cryptography using
any two non orthogonalstates. Physical Review
Letters, 68(21), 31213124. doi:10.1103/PhysRev-
Lett.68.3121
Bennett, C. H., & Brassard, G. (1984). Quantum
cryptography: Public key distribution and coin
tossing. In Proceedings of the IEEE International
Conference on Computers, System, and Signal Pro-
cessing, Bangalore, India (pp. 175-179).
Bennett, C. H., Brassard, G., Salvail, L., & Smolin, J.
(1992). Experimental quantum cryptography. Journal
of Cryptology, 5, 3. doi:10.1007/BF00191318
Bostrom, K., & Felbinger, T. (2002). Deterministic
secure direct communication using entanglement.
Physical Review Letters, 89, 187902. doi:10.1103/
PhysRevLett.89.187902
Bouda, J., Pivoluska, M., Plesch, M., & Wilmott,
C. (2012). Weak randomness, completely trounces
the security of QKD. Retrieved from http://arxiv.
org/abs/1206.1287
Chou, C. W., Polyakov, S. V., Kuzmich, A., &
Kimble, H. J. (2004). Single-photon generation from
stored excitation in an atomic ensemble. Physical
Review Letters, 92, 213601. doi:10.1103/PhysRev-
Lett.92.213601
Diffie, W., & Hellman, M. E. (1976). New direc-
tions in cryptography. IEEE Transactions on
Information Theory, 22, 644654. doi:10.1109/
TIT.1976.1055638
Einstein, A., Podolsky, B., & Rosen, N. (1935). Can
quantum-mechanical description of physical real-
ity be considered complete? Physical Review, 47,
777780. doi:10.1103/PhysRev.47.777
Ekert, A. K. (1991). Quantum cryptography based
on Bells theorem. Physical Review Letters, 67,
661663. doi:10.1103/PhysRevLett.67.661
IGI GLOBAL
Gerhardt, I., Liu, Q., Lamas-Linares, A., Skaar, J.,
Scarani, V., Makarov, V., & Kurtsiefer, C. (2011).
Experimentally faking the violation of Bells in-
equalities. Physical Review Letters, 107, 170404.
doi:10.1103/PhysRevLett.107.170404
Gilbert, G., & Hamrick, M. (2000). Practical quan-
tum cryptography: A comprehensive analysis (Tech.
Rep.). Bedford, MA: Mitre.
Groblacher, S., Jennewein, T., Vaziri, A., Weihs,
G., & Zeilinger, A. (2005). Experimental quantum
cryptography with Qutrits. New Journal of Physics,
8, 75. doi:10.1088/1367-2630/8/5/075
Hwang, W. Y., Koh, I. G., & Han, Y. D. (1997).
Quantum cryptography without public announcement
bases. Physics Letters. [Part A], 244(6), 489494.
doi:10.1016/S0375-9601(98)00358-2
Kak, S. (2006). A three-stage quantum cryptogra-
phy protocol. Foundations of Physics Letters, 19,
293296. doi:10.1007/s10702-006-0520-9
Kanamori, Y., Yoo, S.-M., & Al-Shurman, M. (2005,
March 18-20). A quantum no-key protocol for secure
data communication. In Proceedings of the 43rd
Annual Southeast Regional Conference, Kennesaw,
GA (pp. 92-93).
Kuang, L.-M., & Zhoul, L. (2004). Generation
of atom-photon entangled states in atomic Bose-
Einstein condensate via electromagnetically induced
transparency. Retrieved from http://arxiv.org/pdf/
quant-ph/0402031.pdf
Lucamarini, M., & Mancini, S. (2004). Secure
deterministic communication without entangle-
ment. Physical Review Letters, 94(14), 140501.
doi:10.1103/PhysRevLett.94.140501
Menezes, A., van Oorschot, P., & Vanstone, S. (1997).
Handbook of applied cryptography. Boca Raton,
FL: CRC Press.
Pironio, S., Acn, A., Massar, S., Boyer de la Giroday,
A., Matsukevich, D. N., & Monroe, C. (2010). Ran-
dom numbers certified by Bells Theorem. Nature,
464, 10211024. doi:10.1038/nature09008
Price, W., Chissick, S., & Heisenberg, W. (1977). The
uncertainty principle and foundations of quantum
mechanics: A fifty years survey. New York, NY:
John Wiley & Sons.
Rivest, R. L., Shamir, A., & Adleman, L. (1978). A
method for obtaining digital signatures and publickey
cryptosystems. Communications of the ACM, 21,
120126. doi:10.1145/359340.359342
Santori, C., Fattal, D., Vuckovic, J., Solomon, G.
S., & Yamamoto, Y. (2004). Generation of single
photons and correlated photon pairs using InAs
quantum dots. Fortschritte der Physik, 52(11-12),
11801188. doi:10.1002/prop.200410188
Shannon, C. E. (1948). A mathematical theory of
communication. The Bell System Technical Journal,
27, 379423, 623656.
Shannon, C. E. (1949). Communication theory of
secrecy systems. The Bell System Technical Journal,
28, 656715.
Tisa, S., Tosi, A., & Zappa, F. (2007). Fully-integrated
CMOS single photon counter. Optics Express, 15(6),
28732887. doi:10.1364/OE.15.002873
Vernam, G. S. (1926). Cipher printing telegraph
systems for secret wire and radio telegraphic com-
munications. Journal of the IEEE, 55, 109115.
Wiesner, S. (1983). Conjugate coding. Sigact News,
15(1), 78. doi:10.1145/1008908.1008920
Yang, C.-N., & Kuo, C.-C. (2002). A new efficient
quantum key distribution protocol, quantum optics
in computing and communications. In Proceedings
of the SPIE Conference (Vol. 4917).
IGI GLOBAL
Abdulla M. Abu-Ayyash is currently the information security officer at CBJ, has both BSc and
MSc in computer science from Jordan University in 1987 and 1999 respectively, and PhD in CIS
from Arab Academy - Jordan in 2008. Formally was responsible for the design and implementa-
tion of the initial internal and external network structure, and for establishing and managing
the technical team at CBJ, published some papers in conferences related to cryptographic key
generation and QKD; member of IEEE, ACM, and IEEE Computer society.
Naim Ajlouni is currently the Chairman of Assuit University Program in Jordan and Palestin-
ian territory. Dr. Ajlouni obtained a BSc in Electrical from Bolton University, UK, in 1983 and
Electronics Engineering, MS in Robot Control from Salford University, UK in 1992, and PhD
in Intelligent Control from Salford University, UK, in 1995. Dr. Ajlouni worked for a company
called Dextralog Ltd. in UK as project manager (1983 to 1991) and in a number of educational
institutes including Salford University, Applied Science University, Amman Arab University,
Al Balqa' Applied University. Dr. Ajlouni has held a number of academic positions including
Head of Department, Dean Vice President, and President. Dr. Ajlouni published 35 papers in
International refereed journals and 43 papers in international conferences and is a member of
IEEE, and International Affairs Council Jordan.
IGI GLOBAL
Keywords: Autonomic Computing, Overlay Network, Quality of Service, Self-Healing, Service Specifc
Overlay Network
1. INTRODUCTION
A service specific overlay network (Shmid,
Hartung, & Kampmann, 2005; Capone, Elias,
& Martignon 2008) is an overlay network built
on top of the physical network and is designed
to provide end-to-end quality of service guaran-
tees in the internet and to facilitate the creation
and deployment of value added functionality
to the service without needing support by the
underlying network. It consists of Media Ports
(MPs), Media Servers (MSs), and Media Clients
(MCs). MPs are specific intermediate nodes
perform service specific data forwarding and
control functions in order to enable the correct
media delivery of the service to the end user as
required. Also the SSON consists of MS, the
provider of the requested service, and the MC
Autonomic Healing for Service
Specifc Overlay Networks
Ibrahim Al-Oqily, Hashemite University, Jordan
Bassam Subaih, Al-Balqaa Applied University, Jordan
Saad Bani-Mohammad, Al al-Bayt University, Jordan
Jawdat Jamil Alshaer, Al- Balqa Applied University, Jordan
Mohammed Refai, Zarqa Private University, Jordan
ABSTRACT
Service Specifc Overlay Networks (SSONs) have recently attracted a great interest, and have been extensively
investigated in the context of multimedia delivery over the internet. SSONs are virtual networks constructed
on top of the underlying network and they have been proposed to provide and improve services not provided
by other traditional networks to the end users. The increased complexity and heterogeneity of these networks
in addition to ever changing conditions in the network and the different types of fault that may occur make
their control and management by human administrators more diffcult. Therefore, self-healing concept was
introduced to handle these changes and assuring highly reliable and dependable network system performance.
Self-healing aims at ensuring that the service will continue to work regardless of defects that might occur in
the network. This paper introduces literature in the area of self-healing overlay networks, presents their basic
concepts, requirements, and architectures. In addition to that the authors present a proposed self-healing
architecture for multimedia delivery services. Their proposed solution is oriented to discover new approaches
for monitoring, diagnosing, and recovering of services thus achieving self-healing.
DOI: 10.4018/jitwe.2012040104
IGI GLOBAL
whose request the service. However, designing
services to meet users specific requirements
implies that huge number of services will ex-
ist in the network thus managing them is not
an easy task. Management complexity can be
tackled using the IBM self-management concept
(IBM Corporation, 2006). IBM introduced this
concept through Autonomic Computing (AC)
to enable systems to manage themselves ac-
cording to administrative objectives. The term
autonomic is inspired from human biology
autonomic nervous system.
AC system simplifies the design and
development of systems that can adapt them-
selves to changes in their environment to meet
requirements of performance, fault tolerance,
reliability and security with minimum human
intervention. The result is a great improvement
in management costs, reduced time and skills
requirements to perform the tasks. Hence IT
professionals can focus on improving their over-
all services rather than on managing them. AC
divides self-management into four functional
areas (Kephart & Chess, 2003; Parashar &
Hariri, 2005): 1) Self-configuration where an
autonomic system should be able to configure
components automatically to adapt them to
varying conditions, 2) Self-healing where an
autonomic system should be able to detect,
diagnose and repair potential problems resulting
from failures in hardware and software, 3) Self-
optimization where an autonomic system should
be able to monitor and seek ways to improve
their operations and to ensure optimal function-
ing, and 4) Self-protection where an autonomic
system should be able to detect, identify and
protect its resources from malevolent attacks
and cascading failures.
In this paper, self-healing systems for
overlay networks are surveyed. Self-healing is
the property that allows a system to perceive
that is not working properly and without human
intervention, make the necessary adjustment that
can automatically restore the services affected
by a failure in a manner that is seamless to the
end systems. Defects may occur due to over-
lay nodes that may join or leave the network,
the congestion on overlay links, or the ever
changing of routing information. This in addi-
tion to the rapid evolution of overlay networks
technologies and the various proposed schemes
that are based on self-healing. As the number
of mobile users increase, the demand of self-
healing overlay services will also increase. This
paper serves to capture a snapshot of current
design trends and techniques in self-healing
overlay networks. The goal is not to compare
one solution with another, but to identify the
common design goals and put them in context.
In this paper we also propose a self-healing
mechanism for services that are built to deliver
media and designed to meet users particular
requirements. Our proposed solution is oriented
to discover new approaches for monitoring,
diagnosing and recovering of services thus
achieving self-healing.
The rest of this paper is organized as fol-
lows. Section 2 discusses self-healing archi-
tecture models and requirements. Section 3
outlines and reviews the proposed self-healing
approaches. Section 4 briefly introduces our
proposed overlay self-healing architecture while
Section 5 provides the results of the evaluation
experiments. In Section 6 we conclude the paper.
2. SELF-HEALING
ARCHITECTURE MODELS
AND REQUIREMENTS
The architecture and quality requirements of
self-healing systems are proposed in Neti and
Muller (2007). It presents a framework that
can facilitate both design and maintenance of
self-healing systems. It identifies the quality
requirements for any self-healing system, char-
acterized as traditional quality attributes and
the autonomic-specific attributes. Traditional
attributes includes reliability and maintain-
ability (i.e., the main properties in any self-
healing system). Reliability means that the
self-healing system should be fault-tolerant;
specified service has to be delivered in spite
of the presence of faults, and be robust to use
the suitable recovery technique to restore its
normal operation. Where maintainability means
IGI GLOBAL
that the system architecture must be scalable
and flexible which allows for modifying self-
managing systems without breaking them.
The autonomic-specified attributes includes
a set of requirements such as 1) support for
detecting exceptional system behavior; the
ability to monitor and recognize deviation
behavior with respect to quality of service, 2)
support for failure diagnose; the ability to find
the source of failures, 3) support for testing of
correct behavior; the ability to test and verify
that autonomic elements working correctly.
The authors in Mikic-Rakic, Mehta, and
Medvidovic (2002) have identified a set of
requirements that an architectural style for self-
healing systems should satisfy: 1) adaptability
which means that the style should be easily
modified either its structure or its interactions,
2) dynamicity which means that the system
should be adapted to any changes during execu-
tion, 3) robustness which means that the style
must have the ability to effectively response to
exceptional conditions such as internal failures
and external malicious attack, and 4) awareness
which means that the style should be able to
check up system performance and identification
any violation in that performance.
Garlan and Schmerl (2002) have proposed
a self-healing architecture model for system
monitoring, fault detection, and executing the
repair actions. Their model concentrates on us-
ing a number of external components to monitor
system run time behaviour, determining when
a system is functioning normally by comparing
the monitored values with the properties of an
architecture model. A constraint violation is
used for inducing an adapting process in case of
the system is violated from the expected ranges
to make a particular repair, and choosing that
repair, based on architectural styles.
Another architectural-based approach to
self-healing systems is proposed in Dashofy,
Hoek, and Taylor (2002). Creating self-healing
systems are based on software architecture
that uses software components and connectors
for repair, where the changes and repairs to a
running software system are done at the archi-
tecture level. It presents tools and methods for
implementing architecture-based self-healing
systems. This approach focuses on describing
and executing architectural changes after the
system was deployed. So this approach does
not require pre-specified repair operations.
To ensure that the overlay network will
continue to achieve its goals and objectives and
the service will continue to work regardless of
any fault that might occur in the network, the
self-healing system should satisfy the following
requirements:
1. The system should provide support for
monitoring its performance and recognition
anomalies with respect to these perfor-
mance parameters.
2. The system should have the ability to locate
the source of failure.
3. The system should be able to respond to
varying conditions and have the flexible
mechanisms to deal with these violations.
4. The system should execute the appropriate
mechanisms to bring the system back to its
normal state of operation.
3. OVERLAY NETWORKS
SELF-HEALING APPROACHES
An overlay network is an application layer
network implemented on top of a physical
network (Doval & OMahony, 2003; Al-Oqily
& Karmouch, 2007). The nodes of the overlay
are end-hosts of the underlying network. Their
function is to receive and forward packets in an
application-specific way, and each host in the
overlay is connected with other hosts by some
logical connections. Many overlay networks can
be built on top of the same physical network.
Overlay networks are exposed to different types
of failures such as network nodes may fail,
links may get congested, routing information
may change along the time, and new users may
join or leave the network dynamically (Dingo,
Balasingham, & Bouvry, 2009).
There exist some approaches which intend
to present a generic repair approach for overlay
networks. For instance, the authors in Porter,
IGI GLOBAL
Taiani, and Coulson (2006) propose a self-repair
mechanism to detect and bypass failures of
overlay network nodes. The proposed scheme
was formulated by an algorithm that operates in
three main stages, and depends on the existence
of the following three major services (Porter,
Coulson, & Hughes, 2006): 1) a distributed
backup service which is used to restore the
overlay node state in another overlay node in
case that node fails, and this state is represented
by using two basic elements: Accessinfo and
Nodestate records. An Accessinfo record may
be used to represent a Node ID, or a link to a
neighboring node in the overlay. Nodestate
record represents other state that an overlay
node is interested in having restored when the
node is recovered; 2) a failure detection service
which is used to monitor and detect node fail-
ures, and to inform the recovery service about
that failure; and 3) a recovery service which
is used to execute the appropriate repair to the
failed section of the overlay. It was assumed
that these sub-service instances exists in each
overlay node, and can be implemented in a
form the overlay can use to monitor the nodes
of another overlay network.
The proposed scheme in Porter, Taiani, and
Coulson (2006) is executed by each node in the
overlay network. To enter the first stage of the
algorithm, the node must be informed by the
failure detection service instance that is one of
its neighbors has failed, and then try to discover
if there is another failed nodes by depending
on the information provided by the distributed
backup service and failure detection service.
Executing this procedure leads to determine
failed nodes failed section neighboring that
node and the living nodes bordering this failed
section border nodes. In the next stage a repair
coordinator is selected depending on specific pa-
rameters. Finally, the node that has been chosen
to be a repair coordinator must execute the repair
strategy using one of the two strategies: i) node
restoration by selecting a suitable alternative
node to backup failed node state; ii) structured
adaptation by adapting the overlay to perform
the same functions without the failed nodes.
To make the overlay nodes interact with their
generic self-repair service, they suggest using
an API (Application Programming Interface)
(Porter, Coulson, & Taiani, 2006). Through
this API, all management and guidance opera-
tions are executed. So, the proposed self-repair
service can inspect the characteristics of each
node and adapt the required procedures to carry
out repairs, and the overlay node can provide
the self-repair service with the required data
(Accessinfo record and Nodestate record).
A self-repair mechanism was introduced
by Stoica, Morris, Karger, Frans Kaashoek,
and Balakrishnan (2001), chord is a distributed
hash-table (DHT) overlay network designed
to be scalable and resilient to node failures. It
assigns keys to data items and organizes the
nodes into a graph that maps each data key to
anode, each node maintains a list of the succes-
sor nodes following it in a ring structure, and
this list is continuously updated and each node
maintains a pointer to its immediate successor
and if it fails, the node becomes linked to the
next live node in the list instead. Furthermore,
chord redundantly stores data on multiple nodes
in case the node fails.
Overcast (Jannotti, Gifford, Johnson,
Frans Kaashoek, & OToole, 2000) is a tree-
based overlay multicast system that employs
a self-repair mechanism to maintains its tree
structure in case of one of its nodes has failed.
Overcast organizes its nodes into a distribution
tree rooted at the source, which can efficiently
adapt itself into changing network conditions
such as congestion problems, and node failures.
Each node maintains a list of its ancestor, so
that in case of its current parent has failed the
communication with the network will be lost.
To restore that it will find an alternative parent
from the list and attempt to locate and rejoin a
surviving ancestor.
Another generic repair approach for over-
lay network was proposed in Baud and Bellot
(2009). It was developed to ensure and enhance
the routing capabilities between overlay nodes
in case of the presence of underlying network
failures. The proposed approach has the ability
to automatically recognize itself and reconfig-
ures dynamically the virtual links between the
IGI GLOBAL
nodes without any additional human interven-
tion for its maintenance. They suppose that the
network is consists of a set of groups and each
group is a set of fully virtually connected nodes.
The groups are organized into a chain, in such
a way that two sequential groups in the chain
have common node. To get an efficient and
flexible routing property the groups density
must be high, where the density is the minimum
number of physical network elements failures
that will isolate a node from the other nodes,
and achieving this requires reorganizing the
groups through splitting the high density groups
and allowing them to join and increase the low
density groups.
Self-Initiated and Self-Maintained Overlay
Networks (SIMON) (Elaoud, McAuley, Kim, &
Chennikara-Varghese, 2005) were proposed to
provide additional enhancement features such
as multipath routing. It consists of a set of do-
mains where each domain has two servers. The
first one is Local Group Server (LGS) which
maintains a list of its local domain members,
and Domain Virtual Network Service (DVNS)
to translate a SIMON name into an IP address.
In addition to that a Global Group Server (GGS)
is used to manage the work of local group serv-
ers in each domain. To achieve connectivity
between the member nodes and their server,
LGS periodically sends out a probe message
to its members every pre-specified amount of
time. If the member node does not receive a
probe message from its LGS because of server
failures or link failures, it will execute a set of
points. Firstly, every node set a timeout timer
to a predefined value and try to declare itself as
an LGS (some member nodes have the ability
to be a server) and sends a probe message to
other member nodes. Any node that receives the
probe message stops its timer and responds to
the message by reporting to the new LGS and
sending the probe message to its neighbours.
Also the node that declares itself as an LGS
will inform the GGS with that declaration. If
no new LGS is reachable by GGS it will accept
the request. Otherwise, the GGS responds to the
request by sending the IP address of the existing
LGS. In case of the timer is expired and no new
LGS is declared, the member node will contact
the DVNS to find appropriate LGS.
Moreover, there are several self-healing
techniques that are used in different fields
such as self-healing wireless sensor networks,
mobile ad-hoc networks, and web services, all
the presented algorithms assumed specific pa-
rameters drawn for their targeted environment
which render them unsuitable for our proposed
environment because each SSON consists of
MC, MS, and a set of MPs that present a spe-
cific service and each SSON is responsible for
providing a certain specific service that should
meet the user requirements. In addition to that
a repair in a certain SSON might result in af-
fecting the normal operation of other coexisting
SSONs. As a result this research work suggests
new ways to overcome the limitations of the
existing algorithms by solving the self-healing
problem. A detailed survey of self-healing
architectures is presented in Al-Oqily, Subaih,
Bani-Mohammad, and Alshaer (2012).
4. PROPOSED ARCHITECTURE
In this section we will briefly present our
proposed self-healing architecture. It is based
on proposed work in Al-Oqily, Karmouch,
and Glitho (2008), where autonomic overlay
networks architecture has been proposed. The
proposed self-healing architecture is targeting
the kind of applications that involve multimedia
delivery services.
Our architecture will be oriented to service
specific overlay network (SSON) interaction
between the media client that request the service
and the media server that provide the service.
In addition to the main component is the me-
dia ports (MPs) which allow for processing of
multimedia data along the end-to-end media
delivery path, and it is the most proposed for
failures. SSON must be efficient and satisfy
the optimal service supported to the end user,
and as we know the SSON is dynamic which
needs a continuous monitoring for the behavior
of each component.
IGI GLOBAL
As shown in Figure 1, the proposed
self-healing architecture consists of a system
monitoring unit which is responsible for detect-
ing failures that has an impact on network and
services performance, a fault management unit
which is responsible for executing the required
evaluation for the data provided by the system
monitoring unit, and consists of a Quality of
Service (QoS) analysis unit, a node failure
analysis unit, and a node joining analysis unit.
Finally, a recovery unit to select and execute
a set of actions that brings the system back to
the normal state.
System monitoring unit continuously
monitor the behaviour of each component on
the network and detect any interruption that has
negative effect on the service (presence of faulty
nodes, node joining process, etc.) and it used
for collecting and storing relevant QoS param-
eters values such as: end-to-end delay, through-
put and delay jitter. We assume that the system
monitoring unit is handled by the Service Spe-
cific Overlay Network-Autonomic Manager
(SSON-AM). It is the node that is responsible
for managing the overlay service. The system
monitoring unit will alert the fault management
module when it detects faulty nodes, when it
detects QoS degradation in the overlay service,
and when it detects node joining.
There are two aspects of node failures. The
unpredicted node failure resulted from node
malfunction or system failures. In such case,
we suggest to use the periodic transmission
of keeping alive packets (heartbeat message
mechanism) to indicate the aliveness of any node
in the overlay service. The system periodically
sends are you alive message (i.e., probe mes-
sage) to the nodes participating in the service.
If it does not receive a replying message (i.e.,
ACK message) from a node and after a certain
number of consecutive probe messages, it
considers the node as failed one. The other one
is the predictive departure, in which the node
informs the system before leaving the network
and reports that to the fault management unit
where it will take the necessary actions to cor-
rect the situation.
The next function for the system monitor-
ing unit is to detect QoS constraints violation.
QoS is an important feature when dealing with
interactive real time services. Maintaining the
service QoS is challenging due to the dynamic
nature of the underlying network and due to the
various parameters that are involved, such as
end-to-end delay statistics and available band-
width. It needs to compare the monitored QoS
parameters against the expected performance,
detect possible QoS degradation, and then adjust
network resources accordingly to preserve the
delivered QoS.
The last function for the system monitor-
ing unit is to detect new joining nodes. Media
Ports (MPs) services are dynamic and change
over time. The node that joins the network is
only being useful directly to its locality. This
implies that services that are collocated with
the new node location are the only candidate
services that are expected to benefit from it.
This will result in a low cost and fast discovery
service. However, new services may indeed use
Figure 1. Overlay services self-healing system architecture
IGI GLOBAL
new nodes that are outside its vicinity. When
the node (x) wants to join the SSON it goes
through the procedure outline in Figure 2, to
join the network.
Any deviation is detected by the system
monitoring unit will be reported to the fault
management module that will evaluate and
analyze the provided data. It will react to the
alert sent by the system monitoring unit, de-
pending on that it performs the required evalu-
ation for the system and determine if the system
requires a change or not. It consists of a QoS
analysis unit, a node failure analysis unit, and
a node joining analysis unit. The QoS analysis
unit will analyze the QoS parameters, if there
is any violation, the fault management module
will instruct the recovery unit to take a suitable
action. The node failure analysis unit will
verify if the failed node detected by the system
monitoring unit is part of another SSON or not,
in both cases the recovery unit will be notified
with this deviation to execute the proper action.
Finally, the node joining analysis unit will
analyze the node joining process, and determine
if it is participating in a service and can improve
or increase the performance.
The recovery unit will be triggered by
the fault management module. It provides and
adapts the appropriate mechanisms that reflect
the required change and control the execu-
tion of a set of actions (repair plan) stored in
policy repository to recover from a failure or
to prevent a QoS degradation. For example, in
case of QoS degradation, the recovery unit will
execute the necessary mechanism to bypass or
recover from this problem by finding an alterna-
tive overlay path that meets the requirements
of the requested service. Moreover, this unit
will adapt the appropriate mechanism that can
bring the service back to a consistent state after
a fault has been detected. Therefore, services
must replace the failed node with a new overlay
node. To this end, the system requires that each
node participating in an overlay service should
backup its information state in another node.
This will allow the SSON-AM to restore the
backup data into the node replacing the faulty
one. When a failed node is detected and this
node is part of the SSON, the AM will execute
the following procedure as shown in Figure 3,
to find the suitable alternative node.
5. EXPERIMENTAL
EVALUATION
We have conducted extensive simulation ex-
periments so as to evaluate the performance
and efficiency of the proposed algorithms. We
have used a discrete event simulator. A large-
scale network was used to test measurements
such as network load, success rate, stretch,
and disruption time. The bandwidth assigned
to each node was randomly selected between
512 Kbits/s and 5 Mbits/s. The links propaga-
tion delay was fixed at 1 ms. In a randomly
fashion, we have selected 650 nodes from the
network topology to be presented as MPs, all
MPs introduce the same type of service which
is adaptation service.
Figure 2. The procedure outline of joining process
IGI GLOBAL
All the presented algorithms assume
specific parameters drawn for their targeted
environment which render them unsuitable for
our proposed environment. So that, in order to
evaluate the proposed scheme, we have con-
ducted a set of experiments for the two cases,
leave case and fault case. Each experiment has
its specific settings.
5.1. Leave Case Experiment
For this experiment, the number of MPs is fixed
at 650. 95 SSONs were created, every SSON
was presented to deliver a different service, and
the MPs that participate in one SSON cannot be
used in another SSON. The number of MPs, that
was used to construct SSONs, was in the range
of (1-4). In our simulation, 5%, 15%, 25%, and
35% of MPs, respectively, left the network and
any number of MPs for each created SSON can
leave the network.
5.1.1. Network Load
Network load quantifies the cost of using
the self-healing approach. It represents the
total number of generated messages used to
self-heal the SSON. As shown in Figure 4,
x-coordinate stands for the percentage of left
MPs and y-coordinate displays the average
number of generated messages. It can be seen
that the network load increases when increasing
the number of MPs that left the network. This
is due to that fact that the MPs are randomly
distributed across the network and each MP has
a unique type of service. To find an alternative
Figure 3. Replacement discovery algorithm
Figure 4. Network load
IGI GLOBAL
solution, the algorithm enforced to search on
longest hops which increases the number of
created messages. In addition, there is a prob-
ability for more than one MP to leave the same
SSON, in such case the algorithm tries to replace
the all SSON which require more messages.
Moreover, increasing the number of MPs that
left the network will restrict the number of
alternatives and their location.
5.1.2. Disruption Time
Disruption time is the difference between the
failure detection time and completion of the
recovery. Figure 5, shows the disruption time
versus the percentage of MPs that have left the
network. The plot clearly shows that the disrup-
tion time increases by increasing the percentage
value of left MPs. Increasing the number of left
MPs decreases the available alternatives which
will induce the algorithm to search on the longest
hops. This will affect negatively on the amount
of time needed to discovery process.
5.1.3. Success Rate
Success rate measures the accuracy of the
proposed approach. Success rate is defined as
the number of SSONs that have been recovered
correctly divided by the number of SSONs that
have a permanent fault. A high success rate is
required as it indicates a better performance.
Figure 6 presents the percentage of MPs that
left the network with respect to the success rate.
As the number of MPs that left the network
increases, the success rate decreases. The fact
is the number of available alternative MPs in
the network that can be used in replacement
operation decreases when the number of left
MPs increases. So that, in some cases it is
difficult to find out resources, however, this
decreasing is not significant.
5.1.4. Stretch
The Stretch is defined as the number of hops
taken by an overlay packet divided by the num-
ber of hops which the packet takes when using
an IP-layer path between the same source and
destination. In Figure 7, the stretch is plotted
against the service density, where the dotted
line represent the stretch of the SSON after a
fault occurred, and the continued line represents
the stretch of the SSON before a fault occurs. It
displays that the stretch after healing the SSON
is greater than the stretch before healing the
SSON and increases with increasing the number
of left MPs. This is basically due to that fact
if MP leaves the SSON; the algorithm starts
to find out alternative solutions because each
MP has a unique type of service. Therefore, the
Figure 5. Disruption time
IGI GLOBAL
algorithm may use two MPs or more instead
of the left one.
5.2. Fault Case Experiment
In this experiment, the number of MPs is fixed
at 650. 95 SSONs were created, every SSON
was presented to deliver a different service, and
the MPs that participate in one SSON cannot be
used in another SSON. The number of MPs used
to construct SSONs was in the range of (1-4).
In our simulation, only one MP for each created
SSON can be faulted. These hard simulation
settings were supposed to measure the behavior
of the proposed algorithm on the worst case;
good results indicate an efficient algorithm.
5.2.1. Network Load
As shown in Figure 8, the network load is plot-
ted against the service density (faulted SSONs).
The number of generated messages increases
when the number of faulted SSONs increases.
This is due to that fact that each MP service
exists only once in the network, once a certain
MP is faulted, the algorithm tries to discover an
Figure 6. Success rate
Figure 7. Stretch
IGI GLOBAL
alternative solution and not a replacement. In
addition to latter reason, the restriction of using
the MP only once by any SSON in the network
increases the number of generated messages by
the algorithm to search for an alternative solu-
tion. The slight decrease in the curve from 30
to 45 services is due to the type of the service,
its location, and the faulted MP. Each service
has a showed a different behavior while looking
for a solution, some services find the solution
faster with lower number of messages, however,
considering the above restrictions, the number
of messages is not high.
5.2.2. Disruption Time
Figure 9 shows the service density with respect
to the disruption time. The amount of time
required for the SSON restores its normal opera-
tions after failure detection varies for different
service densities. This varying time is primar-
ily due to the MPs random distribution across
the network at different hops and the number
of available alternative MPs is restricted. The
MPs at the longest distance (large number of
hops) require much time than other MPs at
the smallest distance. However, the average
disruption time is 2388ms.
Figure 8. Network load
Figure 9. Disruption time
IGI GLOBAL
5.2.3. Discovery Distance
Discovery distance is defined by the number of
hops needed to get the appropriate MP in the
network. This indicates how long is the search
process will travel in the network to find a
solution. Figure 10 shows the number of hops
increases as the number of faulted SSONs in-
creases. This due to the large number of faulted
SSONs that requires a large number of MPs to
recover and repair SSONs, this leads to search
further to find the appropriate MPs; however,
the average number of hops is 9.
5.2.4. Average Number of MPs
Average number of MPs in the SSON is defined
as the number of MPs that is used to construct
the SSON. As shown in Figure 11, the average
number of MPs has increased after applying
our scheme. This is because the MP service
exists only once in the network. When a MP
is faulted, the solution most likely will be of
multiple MPs chained together to represent the
faulted service.
Figure 10. Discovery distance
Figure 11. Average number of MPs in SSON
IGI GLOBAL
6. CONCLUSION
Management of overlay networks becomes
more and more difficult. As these networks
are exposed for different kinds of faults such
as nodes joining and leaving, network conges-
tions, and node and link failures. Therefore,
self-healing ability is required; it aims at help-
ing systems autonomously control themselves.
It is essential to ensure network coverage and
continued network functionality. This paper
reviewed different design trends for self-healing
overlay networks and proposed self-healing
system architecture for overlay networks that
are designed specifically for the applications
that involve multimedia delivery services and to
meet individual users requirements. Extensive
simulation has been conducted to test the valid-
ity of the proposed algorithms and results show
the efficiency and validity of our approach in
term of disruption time and how they are effec-
tive in terms of success rate and network load.
Detailed design factors, distributed algorithms,
and the implementation of the proposed system
are planned in our future work.
REFERENCES
Al-Oqily, I., & Karmouch, A. (2007). Policy-based
context-aware overlay networks. In Proceedings of
the Global Information Infrastructure Symposium
(pp. 85-92).
Al-Oqily, I., Karmouch, A., & Glitho, R. (2008). An
architecture for multimedia delivery over service
specific overlay networks. In Proceedings of the
International Federation for Information Processing
Conference on Wireless Sensor and Actor Networks
II (Vol. 264).
Al-Oqily, I., Subaih, B., Bani-Mohammad, S., &
Alshaer, J. J. (2012, March). A survey for self-healing
architectures and algorithms. In Proceedings of the
9th International Multi-Conference on Systems,
Signals and Devices (pp. 1-5, 20-23).
Baud, L., & Bellot, B. (2009). ROSA: A step towards
a global virtual network. In Proceedings of the Ultra
Modem Telecommunications & Workshops (pp. 1-6).
Capone, A., Elias, J., & Martignon, F. (2008). Optimal
design of service overlay networks. In Proceedings
of the 4
th
International Telecommunication Network-
ing Workshop on QoS in Multiservice IP Networks
(pp. 46-52).
Corporation, I. B. M. (2006). An architecture blue-
print for autonomic computing, Autonomic comput-
ing (4th ed.). Armonk, NY: IBM.
Dashofy, E. M., Hoek, A., & Taylor, R. N. (2002).
Towards architecture-based self-healing systems. In
Proceedings of the First Workshop on Self-Healing
Systems (pp. 21-26).
Dingo, J., Balasingham, I., & Bouvry, P. (2009).
Management of overlay networks: A survey. In
Proceedings of the 3
rd
International Conference of
Mobile Ubiquitous Computing, Systems, Services
and Technologies (pp. 249-255).
Doval, D., & OMahony, D. (2003). Overlay networks
a scalable alternative for P2P. IEEE Internet Com-
puting, 7, 7982. doi:10.1109/MIC.2003.1215663
Elaoud, M., McAuley, A., Kim, G., & Chennikara-
Varghese, J. (2005). Self-Initiated and Self-Main-
tained Overlay Networks (SIMON) for enhancing
military network capabilities. In Proceedings of the
Military Communications Conference (Vol. 2, pp.
1147-1151).
Garlan, D., & Schmerl, B. (2002). Model-based
adaptation for self-healing systems. In Proceedings
of the ACM SIGSOFT Workshop on Self-healing
Jannotti, J., Gifford, D. K., Johnson, K. L., Frans
Kaashoek, M., & OToole, J. W. (2000). Overcast:
Reliable multicasting with an overlay network. In
Proceedings of the 4
th
Symposium on Operating
Systems Design and Implementation (pp. 197-212).
Kephart, J. O., & Chess, D. M. (2003). The vision of
autonomic computing. IEEE Computer Magazine,
36, 4150.
Mikic-Rakic, M., Mehta, N., & Medvidovic, N.
(2002). Architecture style requirements for self-
healing systems. In Proceedings of the ACM First
Workshop on Self-healing Systems (pp. 49-54).
Neti, S., & Muller, H. A. (2007). Quality criteria and
an analysis framework for self-healing systems. In
Proceedings of the IEEE International Workshop
on Software Engineering for Adaptive and Self-
Managing Systems (p. 6).
IGI GLOBAL
Parashar, M., & Hariri, S. (2005). Autonomic com-
puting: An overview. In J.-P. Bantre, P. Fradet,
J.-L. Giavitto, & O. Michel (Eds.), Proceedings
of the International Workshop on Unconventional
Programming Paradigms (LNCS 3566, p. 97).
Porter, B., Coulson, G., & Hughes, D. (2006). Intel-
ligent dependability services for overlay networks.
In F. Eliassen & A. Montresor (Eds.), Proceedings of
the 6
th
International IFIP Workshop on Distributed
Applications and Interoperable Systems (LNCS
4025, pp. 199-212).
Porter, B., Coulson, G., & Taiani, F. (2006). A generic
self-repair approach for overlays. In Proceedings
of the Internet System Workshop (pp. 1490-1499).
Porter, B., Taiani, F., & Coulson, G. (2006). General-
ized repair for overlay networks. In Proceedings of
the 25
th
IEEE Symposium on Reliable Distributed
Schmid, S., Hartung, F., & Kampmann, M. (2005).
SMART: Intelligent multimedia routing and adapta-
tion based on service specific overlay networks. In
Proceedings of the Eurescom Summit, Heidelberg,
Germany (pp. 69-77).
Stoica, I., Morris, R., Karger, D., Frans Kaashoek,
M., & Balakrishnan, H. (2001). Chord: A scalable
peer-to-peer lookup service for internet applications.
In Proceedings of the Conference on Applications,
Technologies, Architectures, and Protocols for Com-
puter Communications (pp. 149-160).
IGI GLOBAL
Keywords: Dataset Selection, Debt Valuation, Distance Measures, Intelligent Systems, Prediction Methods,
Supervised Learning
INTRODUCTION
Supervised learnings task is based on assump-
tion that one is able to provide proper training
data which will be used for further generaliza-
tion and inference process. The quality of data
directly affects performance of prediction and
classification algorithms. Training data consist
of a set of training examples composed of a pair
of input features X and a desired output value Y.
The main aim is to analyze training data
and produce an inference function that maps
input to output, : XY. The output can be
twofold: in case of discrete one, the function
is called a classifier. Otherwise, in case of
continuous output, it is called a regression. If
function maps to interrelated set of more
than one values it is structured prediction or
structured output learning algorithm. All in
all, the inferred function should be able to
predict the correct output value for any valid
input object. This requires learning algorithm
to be able to generalize based on the training
data. Consequently the quality of data is of
great importance. Therefore, the selection of
training set for learning algorithm should be
performed carefully in order to select optimal
dataset. The most straightforward and clear
situation arises when learning concerns data
from particular domain and describes always the
New Entropy Based Distance
for Training Set Selection in
Debt Portfolio Valuation
Tomasz Kajdanowicz, Wroclaw University of Technology, Poland
Slawomir Plamowski, Wroclaw University of Technology, Poland
Przemyslaw Kazienko, Wroclaw University of Technology, Poland
ABSTRACT
Choosing proper training set for machine learning tasks is of great importance in complex domain problems.
In the paper new distance measure for training set selection is presented and thoroughly discussed. The dis-
tance between two datasets is computed using variance of entropy in groups obtained after clustering. The
approach is validated using real domain datasets from debt portfolio valuation process. Eventually, prediction
performance is examined.
DOI: 10.4018/jitwe.2012040105
IGI GLOBAL
same stationary object. In such case, properties
of data and statistical dependencies between
examples remain unchanged and training may
be performed using the same source of training
and testing data. Such data, as long as being of
appropriate size, may deliver satisfactory gen-
eralization abilities. In order to generalize from
data describing non-stationary objects, learning
algorithms are expected to model concept drift
(Kurlej & Woniak, 2011) phenomenon identi-
fied by changes in data probability distributions.
As concept drift may be caused by changes of
prior, conditional or posterior probabilities of
data, appropriate methods must be incorporated
to address the problem.
Another situation occurs when general-
ization needs to be performed for objects for
which training data is not available or hardly
accessible. In such case, learning is performed
using data describing other similar objects. An
example of such a situation are across-network
classification where learning performed on one
network adjust models used in generalization
on another network (Lu & Getoor, 2003) or
debt portfolio value prediction where value
of appraisal of particular portfolio is done us-
ing other similar portfolios (Kajdanowicz &
Kazienko, 2009).
The paper considers the latter problem of
training set selection in the prediction task for
prediction of future debt recovery. Intuitively,
the greater similarity/smaller distance between
objects used in learning and those the inference
is applied to, the better performance of infer-
ence methods. Similarity/distance identification
between training and testing objects can be
reduced to similarity/distance measurement
between datasets describing their input features,
namely similarity/distance between X
train
and
.X
test
. Aforementioned similarity and distance
can be invoked interchangeably as similarity can
be measured by distance, i.e., two objects are
similar if the distance is close to zero. Growing
distance results in higher dissimilarity.
In general, distance is defined as a quanti-
tative degree of how far apart two objects are
(Cha, 2007). The choice of distance measure
depends on the representation of objects and type
of measurement. In supervised learning tasks
datasets are usually represented by matrices
in which columns denote attributes and rows
object instances. A single cell of such matrix
contains a value of particular attribute for a
given instance. Hence, the problem of training
set selection based on measuring the distance
between two datasets X
train
and X
test
is actually
a matrix distance based selection.
Altogether, training dataset can be obtained
by performing one of the following scenarios
(Figure 1):
Examples selection from available histori-
cal datasets based on the distance between
particular testing and training examples
(e.g., Cano, Herrera, & Lozano, 2003; Son
& Kim, 2006).
Training set selection from a set of available
historical datasets based on the distance
between particular testing and training sets.
This paper concerns the latter approach
and introduces a method for training dataset
selection. The method is designed to improve
inference results using entropy-based distance
between probability density functions of train-
ing and testing datasets.
The rest of the paper is organized as follows.
In section Related Work various approaches
and distance measures that may be utilized to
training set selection are enumerated. In order
to provide a better perspective on the problem,
section Debt Portfolio Value Prediction presents
a real-world training set selection problem in
debt portfolio value prediction. In section
Training Set Selection Using Entropy Based
Distance a new approach to measure distance
for comparison and selection of training data-
sets is described. Evaluation of the impact on
prediction accuracy using proposed method is
discussed in section Experiments and Results.
Finally, section Conclusion summarizes the
work.
IGI GLOBAL
RELATED WORK
Training set selection from a set of historical
datasets may be conducted using two distinct
approaches. Both of them, however, are based
on measuring the distance between training and
testing set. Comparison of matrices of non-equal
size underlies the first approach. The latter one
is based on calculating measure of goodness of
fit between probability density functions. While
calculation of probability density for discrete
random variables is performed with respect to
the counting measure over the sample space, the
density of continuous random variables is given
by the integral of this variables density. This
may imply problems as the exact density is not
known and the empirical one can be obtained
only. In the literature either the estimation of
probability density function (Rencher, 2002) or,
consideration of discrete and finite histogram
of random variable (Cha, 2007; Toussaint,
1974) have been proposed. The histogram can
be considered as a vector, i.e., coordinates in
some space, and numerous distances proposed
in the literature can be applied to compare two
densities.
There exist a substantial number of dis-
tance measures derived from various fields
such as computer science, information theory,
mathematics, physics, or statistics and used for
specialized domain tasks, e.g., standard Euclid-
ean distance and KullbackLeibler distance to
name just a few (Kajdanowicz, Plamowski, &
Kazienko, 2012). For a v and w, a vector ver-
sion of probability density functions of V and W
matrices, with length of both vectors equal to d,
Euclidean and KullbackLeibler distances are
defined as in equation (1) and (2), respectively.
dist v w v w
i i
i
d
,
( )
=
=
2
1
(1)
In general, Euclidean distance measures
shortest distance between two points as a length
of line and belongs to L
p
Minkowski family of
Figure 1. Training set construction for supervised learning using closest example selection ap-
proach and closest dataset selection
IGI GLOBAL
distance measures. Applying Shannons con-
cept of probabilistic uncertainty (entropy) to
Kullback-Leibler distance introduces relative
entropy, called information deviation (Deza &
Deza, 2006), see equation (2).
dist v w v
v
w
i
i
i
i
d
, ln
( )
=
=
1
(2)
Obviously, distance measures presented
are only example ones. Proper choice of repre-
sentative distance measure depends on the type
of measured data and the measurement itself.
As the measure affects inference properties
of classification system, the choice must be
based on thorough analysis. For further list of
distance measures please refer to Cha (2007)
and Ullah (1993).
The approach of calculating distances
between vector version of probability density
functions tends to be reasonable but requires
estimation of probability density function and
sometimes it might be troublesome.
According to aforementioned approaches,
distance of two datasets might be computed by
means of other concept - distance matrices.
This, however, is limited to situation when both
datasets are of the same size (number of rows).
Additionally, the mapping bijection that states
the clear relation of corresponding data ex-
amples must be known. As the size of compared
distinct datasets may differ and the mapping
between data examples is not known, it is not
always possible to compute distance matrices,
which is a drawback of the approach that must
be addressed. One of the solutions includes
incorporation of matrix norms (Meyer, 2000).
The norms are defined in terms of well known
vector norms (Zhou, Doyle, & Glover, 1996).
The exits variety of straightforward norms that
can be utilized, e.g., for a given matrix A matrix
1-norm returns maximum of A column sums,
matrix -norm returns maximum of A row
sums and matrix 2-norm returns square root of
largest eigenvalue of A x A.
Aforementioned norms may provide unsat-
isfactory results due to theirs simplicity. In order
to fully describe matrix, more sophisticated
solutions need to be applied (Meyer, 2000).
One of them can be Frobenius norm, which is
the sum of the squares of the Euclidean norms
of the matrix columns (Meyer, 2000). Thus it
is able to model variability of the data. Investi-
gating the literature, it needs to be emphasized
that norms are not perfect solution to model
distances between matrices.
On the other hand, transfer learning (TL)
is concept that provides additional ability to
learning system making it able to recognize and
utilize knowledge learned from previous predic-
tion to new tasks (Pan & Yang, 2010). Transfer
learning aims to extract the knowledge from
several source tasks and apply the knowledge
to a target task. In contrast to aforementioned
learning based on similarity or distance mea-
sures, instead of learning models individually
on selected datasets, TL applies generalized
knowledge of all known tasks obtained in
single run learning (Kajdanowicz, Plamowski,
Kazienko, & Indyk, 2012).
Summarizing, trying to avoid the situation
when a vector version of probability distribu-
tion function is required in order to calculate
distances between two datasets, a notion of
entropy based distance measure is introduced
in following sections. Distances are then used
to select datasets for training prediction algo-
rithms. The selection is based on two distinct
approaches. For the first one we use distance
ranking to select several closest datasets, and for
the second one the transfer learning approach
is utilized to use generalized knowledge in
prediction task.
DEBT PORTFOLIO
VALUE PREDICTION
Determining the value of debt portfolios and
choosing these having the greatest revenue po-
tential is of a great importance for debt traders.
Economically crucial decisions are based on the
amount of possible repayment of liabilities. As
traders (both buyers and sellers) apply distinct
collection processes, amount of receivables
IGI GLOBAL
obtained may be different. This constitutes the
area for trading and to establish a transaction
price. Therefore debt portfolio assessment is
complex task. However, as far as machine
learning is concerned, this problem may be
understood as a prediction task that assesses the
possible repayment value from all debt cases
belonging to particular portfolio. The repayment
is calculated based on historical data of debts.
The most common routine of debt portfo-
lio trade starts when a seller, usually a bank,
telecommunication company, etc. offers a set
of debts, called debt package or portfolio,
expecting a purchase proposal from buyers.
Purchasers, usually a specialized debt recovery
entities, offer price and the most suitable offer is
chosen (Figure 2), the process of debt portfolio
purchase with utilization of prediction model.
The price proposed by a particular buyer may
be obtained in variety of ways, among which
the utilization of historical data of debt recovery
in order to build a prediction model seems the
most reasonable one. Such model provides an
estimation of possible return from the package.
In considered situation the valuation of
debt portfolio is based on the data of historical
claims with their repayment profiles over time.
A debt collection company usually assumes
that gathered repayment data reflects all im-
portant dependencies influencing repayment
results like recovery procedures, cash flow plans
and other external conditions. Such assumption
simplifies the problem as changes in the prob-
abilities caused by evolving business environ-
ment are ignored. The model trained on his-
torical data is applied to predict the repayment
amount of the offered portfolio. Basing on the
obtained results, bids are offered to the seller.
Summarizing, the most significant and
sensitive part of debt trade is repayment value
prediction process. The accuracy of prediction
for offered portfolio relies mainly on model gen-
eralization capabilities and quality of training
data. As it is very difficult to provide prediction
using whole, large historical data, some train-
ing dataset selection mechanism needs to be
employed. In the further part of the paper we
present the method for training set selection
for model learning that is applied to considered
business scenario.
TRAINING SET SELECTION
USING ENTROPY
BASED DISTANCE
As previously stated in the paper, training
set can be treated as a matrix. Therefore the
problem of training set selection is equivalent
Figure 2. The process of debt portfolio purchase with utilization of prediction model
IGI GLOBAL
to the matrix selection using some notion
of distance. Assuming that the environment
remains stationary, the generalization can be
done basing on historical datasets of the same
domain, which describe similar objects with
the same attributes. In the aforementioned debt
prediction problem, historical dataset consist of
debt portfolios that have already been repaid.
They are used to predict the repayment value
of unknown, new portfolio. The learning could
be done using all historical datasets, but from
the practical point of view it would not always
be possible (e.g., massive training data) and
of high quality (poor inference from complex
and non-discriminative data). Therefore some
methods for training set selection needs to be
applied. Hereby we propose a method based on
distance between entire datasets. Such consid-
eration of training sets selection can provide
additional latent information on the origin of
the data that is not visible while considering
each data example individually.
The proposed method of training set selec-
tion is based on the assumption that there exists
a set T of k training sets A
i
.i{1,,k},kN. The
actual task is to select a single or several best
training sets A
i
from T in order to train an infer-
ence model, and using it, to predict unknown
output values for a dataset called testing set B.
The method calculates distances between
the test set B and all training sets A
i
T, namely
dist(B,A
i
). Basing on that distance, a ranking of
the shortest distances between B and all A
i
sets
is created. From this point, two routines are
considered. The first one is based on selecting
few closest datasets. The selection of the closest
datasets may be performed in numerous way,
e.g., select first one, select top three, select five
ones, etc. The pseudo code of the method is
presented in the Algorithm 1 listing.
On the other hand transfer learning ap-
proach may be utilized. The method of learning
based on similarity utilizes closest A
i
sets for
training procedure, whereas transfer learning
generalizes all datasets and from now is able
to transfer knowledge to new tasks, see differ-
ences in algorithms 1 and 2. Each dataset is
generalized by one learning model and in order
to balance the knowledge from multiple tasks,
weights vector is calculated as sum of dis-
tances between considered train sets dist(B,A
i
)
and test set B, namely dist(B,A
i
), divided by the
sum of distances, see equation 3. Trained pre-
dictors are then used to obtain results from
testing set B (Algorithm 2).
weights
dist A B
dist A B
i
i
k
k
M
=
( )
( )
=
,
,
1
(3)
In proposed method, separate prediction
algorithm is used for each train set A
i
. Afterwards
we use these predictors to infer targets on test set
B. Results of inference from different tasks are
weighted by the aforementioned weights vector.
The proposed method allows calculation
of a distance between two datasets according
to any scheme of measuring distance between
matrices. However, we propose a new distance
measure for distance calculation between test
and training datasets. For distance between two
datasets B and A
i
, it incorporates a clustering
Algorithm 1. The pseudo code of learning phase of the method for training set selection based
on similarity
Require: set T of k training sets A
i
.i{1,,k},kN, testing set B
1. for all training sets A
i
T do
calculate dist(B,A
i
)
2. end for
3. build a distance ranking
4. select several training dataset(s) using ranking
5. build model on selected dataset(s)
6. return inferred targets for B
IGI GLOBAL
method on joint B and A
i
. It is done in order to
obtain number of similar data example groups.
Without loss of generality it is assumed that
each obtained group contains data examples
originating from similar probability distribution.
Let M=M
1
,,M
n
, iN, i be a set of groups
indicated by clustering method. Any non-ran-
dom grouping algorithm is acceptable. It can
be noticed that in each group different number
of examples from both input datasets B and A
i

may be found. However, if all groups gather
data examples from similar distribution, one
can assume that there is similarity between
datasets. In other words, the distance between
them is relatively small. To quantify that, the
method calculates the entropy of share of data
examples coming from B and A
i
for each groups
as follows:
S p x p x
l j j
x M
j l
=
( ) ( ) ( )
log
2

Having computed entropy values for all
groups, the method calculates variance of
results, which is then used as a distance mea-
sure. Employment of variance incorporates
some important properties. Desired variance
of 0 denotes that all groups obtained after
clustering process share the same properties
of probability density distribution. This means
that for all groups there is the same probability
of observing examples from both datasets, and
consequently, those datasets are similar. The
entropy based distance calculation method is
presented in Algorithm 2 listing.
EXPERIMENTS AND RESULTS
The main objective of performed experiments
was to test and evaluate the proposed method
in several scenarios using three basic regres-
sion algorithms: ridge, lasso and partial least
square (Theodoris & Koutroumbas, 2009).
Mean-squared error was used as standard per-
formance measure.
Experiments were carried out on fifteen
distinct, real datasets from the same application
domain of debt portfolio pattern recognition
(Kajdanowicz & Kazienko, 2009). Datasets
represent the problem of aggregated prediction
of sequential repayment values over time for
a set of claims.
The experimental procedure works as fol-
lows: suppose there exists a debt portfolio B of
unknown output, for which prediction needs
to be conducted in order to determine value of
possible repayment. From among all known
portfolios these indicated by the proposed
training set selection method are selected. The
number of selected portfolios is described later
in this section. Using selected packages, the
regression algorithms are trained and eventu-
ally basic tests for portfolio B are performed.
Basing on described procedure, seven
distinct scenarios are created. They vary in the
number of selected portfolios for training. For
the first scenario the closest package is selected,
for second the package from the middle of the
ranking, for third -- the furthest package accord-
ing to distance measure (this package is used
just for confirmation). The fourth scenario uses
Algorithm 2. The pseudo code of learning phase of the transfer learning approach
Require: set T of k training sets A
i
.i{1,,k},kN, testing set B
1. for all training sets A
i
T do
calculate dist(B,A
i
)
2. end for
3. build a distance ranking
4. create weights vector basing on the ranking
5. for all training setsA
i
T do
learn the model
6. end for
7. return inferred targets for B using weight vector to balance knowledge
8. return selected training dataset(s)
IGI GLOBAL
three closest packages, the fifth -- five closest
ones. The sixth scenario uses all known pack-
ages at once. The last scenario utilized transfer
learning. From this point on, these scenarios
are denoted as: C (as for closest package), M
(middle one), F (furthest one), C3 (three closest
ones), C5 (five closest ones), A (all packages)
and TL (as for transfer learning) respectively.
For examined scenarios Friedman test is run.
Results are presented in Table 1.
For each prediction algorithm statistical
ranking is created to indicate which approach
is the best, see Table 1. We incorporated Fried-
man statistical test as intuitive and convenient
procedure for comparison of different used
approaches. Mean rank position for each com-
bination of method and scenario is shown in
parentheses. The lower rank value, the lower
mean-squared error yielded by prediction pro-
cess.
Furthermore, in Table 2, results of Wil-
coxon test for pairwise comparison of different
scenarios are presented. Acceptance of null
hypothesis stating there is no statistically sig-
nificant difference between pair of compared
scenarios is marked with . On the other
hand, + and - are used to mark rejection of
null hypothesis. Intuitively, + denotes that
the scenario in row performed better than the
one in column and - vice versa. The / sign
is used to separate the results for each of the
prediction algorithms: ridge, lasso and partial
least square regression.
We set the significance level considered
for the null hypothesis rejection to 5%.
The results in Table 1 can be read as follows:
for each regression algorithm used in experi-
ments (ridge, lasso and partial least square),
selecting five closest datasets for training,
denoted by C5, results in the smallest mean
squared error. Friedman test places this way
of selecting training sets in the first place of
ranking, which means that for all experiments
conducted, C5 usage is optimal. According
to this interpretation, selecting three closest
training sets, denoted by C3 is slightly worse.
Moreover, selecting five closest packages yields
smaller error rate than using all other known
Algorithm 3. The pseudo code of distance calculation between two datasets based on entropy
variance
Require: training set A, testing set B
1. Construct a multiset M=AB
2. Cluster M into n groups
3. Calculate entropy S
i
for each group M
i
, i{1,,n}
4. Calculate dist(A,B)=var(S
1
,,S
n
)
5. return dist(A,B)
Table 1. Average rank positions determined in Friedman test for three distinct prediction algorithms
Rank Ridge Lasso PLS
1 C5 (2.27) C5 (2.40) C5 (2.13)
2 C3 (2.47) C3 (2.47) C3 (2.33)
3 TL (2.66) A (2.67) TL (2.67)
4 A (4.67) TL (4.40) A (4.80)
5 C (4.93) C (4.67) C (5.00)
6 M (5.33) M (5.47) M (5.33)
7 F (5.87) F (5.93) F (5.73)
IGI GLOBAL
packages, which confirms that entropy based
distance is useful in reducing amount of data
needed for training and can be used success-
fully with noisy data in domains such as debt
portfolio valuation.
As shown during Friedman test, usage of
multiple closest (C3, C5) datasets for training
results in better performance than using just one
closest (C) or just one dataset from the middle
position in the ranking (M). Obviously, results
confirm that selecting the furthest (F) package
generated the biggest error. Transfer learning
approach (TL) is slightly worse that selecting
multiple closest datasets (C3, C5), however
for one of the regression algorithms (lasso
regression), selecting all available datasets for
training (A) outperforms transfer learning. As
can be noticed in mean ranks, approaches on
the first three positions are far better than the
remaining ones.
Results have been confirmed by Wilcoxon
pairwise test. As shown in Table 2, whenever
there exists statistical difference between ap-
proaches, using multiple closest datasets is far
better than using any other approach, including
transfer learning. What is more, the results show
that for two of three regression algorithms, there
is no significant statistical difference between
approaches C, M, F, and TL. Significant dif-
ferences exist between approaches C3, C5,
A, and TL. What is interesting is that transfer
learning TL appears in both aforementioned
groups, which indicated that this routine share
similar probability properties with the group
of methods that are based on selecting just one
training dataset and with the group of methods
that are based on selecting multiple training sets.
The results obtained during experiments
indicate that proposed entropy based distance
measures difference between datasets correctly
and yields training sets that can be used to
train variety of prediction algorithms fitted for
specific domain.
CONCLUSION
In this paper, the problem of training set selec-
tion for machine learning problems has been
discussed. New method for selecting datasets
for training phase has been proposed and in-
troduced. The method is based on comparing
variance of entropy in groups obtained by clus-
tering joint datasets. Such routine has significant
advantage as it does not require vectorization
of probability density functions of datasets in
order to compare them.
During experiments, the method confirmed
its usefulness and effectiveness for selecting
proper training sets for real domain problem of
debt value prediction. Methods performance
was compared with several other straightfor-
ward approaches and yielded decent results.
Further research in the field of training
set selection will consider comparison of the
method with other sophisticated algorithms that
use vector version of probability distribution
functions. Next experiments should focus on
revealing other properties of the method and
Table 2. Results of Wilcoxon signed rank test for ridge / lasso / pls regression algorithms
C M F C3 C5 A TL
C // //+ -/-/- -/-/- -/-/- //
M // //+ -/-/- -/-/- -/-/- //
F //- //- -/-/- -/-/- -/-/- //
C3 +/+/+ +/+/+ +/+/+ // // +/+/+
C5 +/+/+ +/+/+ +/+/+ // // +/+/+
A +/+/+ +/+/+ +/+/+ // // -/+/-
TL // // // -/-/- -/-/- +/-/+
IGI GLOBAL
its usefulness for other specialized domain
problems. Additionally, the method presented
in this paper needs further attention. It will be
verified whether the metrics used for clustering
process influence performance of the method
and eventually whether they vary goodness of
fit between datasets.
ACKNOWLEDGMENT
This work was partially supported by The Pol-
ish Ministry of Science and Higher Education
the research project 2011-2012, 2011-2014 and
Fellowship co-financed by The European Union
within The European Social Fund.
REFERENCES
Cano, J. R., Herrera, F., & Lozano, M. (2003). Us-
ing evolutionary algorithms as instance selection
for data reduction in KDD: An experimental study.
IEEE Transactions on Evolutionary Computation,
7(6), 561575. doi:10.1109/TEVC.2003.819265
Cha, S. H. (2007). Comprehensive survey on distance/
similarity measures between probability density
functions. International Journal of Mathematical
Models and Methods in Applied Sciences, 300-307.
Deza, E., & Deza, M. M. (2006). Dictionary of
distances. Amsterdam, The Netherlands: Elsevier.
Kajdanowicz, T., & Kazienko, P. (2009). Prediction
of sequential values for debt recovery. In E. Bayro-
Corrochano & J.-O. Eklundh (Eds.), Proceedings
of the 14th Iberoamerican Congress on Pattern
Recognition (LNCS 5856, pp. 337-344).
Kajdanowicz, T., Plamowski, S., & Kazienko, P.
(2012). Distance measures in training set selection
for debt value prediction. In M. K. Kundu, S. Mitra,
D. Mazumdar, & S. K. Pal (Eds.), Proceedings of
the First Indo-Japan Conference on Perception and
Machine Intelligence (LNCS 7143, pp. 219-226).
Kajdanowicz, T., Plamowski, S., Kazienko, P., &
Indyk, W. (2012). Transfer learning approach to
debt portfolio appraisal. In E. Corchado, V. Snel,
A. Abraham, M. Wozniak, M. Graa, & S.-B. Cho
(Eds.), Proceedings of the 7
th
International Confer-
ence on Hybrid Artificial Intelligence Systems (LNCS
7209, pp. 46-55).
Kurlej, B., & Woniak, M. (2011). Active learning
approach to concept drift problem. Logic Journal of
the IGPL. doi:10.1093/jigpal/jzr011
Lu, Q., & Getoor, L. (2003). Link-based classification
using labeled and unlabeled data. In Proceedings of
the Workshop on the Continuum from Labeled to Un-
labeled Data in Machine Learning and Data Mining.
Meyer, C. D. (2000). Matrix analysis and ap-
plied linear algebra. Philadelphia, PA: Soci-
ety for Industrial and Applied Mathematics.
doi:10.1137/1.9780898719512
Pan, S. J., & Yang, Q. (2010). A survey on transfer
learning. IEEE Transactions on Knowledge and
Data Engineering, 22(10), 13451359. doi:10.1109/
TKDE.2009.191
Rencher, A. (2002). Methods of multivariate
analysis. New York, NY: John Wiley & Sons.
doi:10.1002/0471271357
Son, S., & Kim, J. (2006). Data reduction for instance-
based learning using entropy-based partitioning. In
M. Gavrilova, O. Gervasi, V. Kumar, C. J. K. Tan,
D. Taniar, A. Lagan, Y. Mun, & H. Choo (Eds.),
Proceedings of the International Conference on
Computational Science and its Applications (LNCS
3982, pp. 590-599).
Theodoris, S., & Koutroumbas, K. (2009). Pattern
recognition. Amsterdam, The Netherlands: Elsevier.
Toussaint, G. T. (1974). Bibliography on estima-
tion of misclassification. IEEE Transactions on
Information Theory, 20(4), 472479. doi:10.1109/
TIT.1974.1055260
Ullah, A. (1993). Entropy, divergence and distance
measures with econometric applications. Riverside,
CA: Department of Economics, University of Califor-
nia - Riverside. doi:10.1016/0378-3758(95)00034-8
Zhou, K., Doyle, K., & Glover, K. (1996). Robust
and optimal control. Upper Saddle River, NJ:
Prentice Hall.
IGI GLOBAL
IJITWE is currently listed or indexed in: Bacon's Media Directory; Burrelle's Media Directory; Cabell's Directories;
Compendex (Elsevier Engineering Index); CSA Illumina; DBLP; Gale Directory of Publications & Broadcast
Media; GetCited; Google Scholar; INSPEC; JournalTOCs; MediaFinder; SCOPUS; The Index of Information
Systems Journals; The Standard Periodical Directory; Ulrich's Periodicals Directory
Mission
The main objective of the International Journal of Information Technology and Web Engineering (IJITWE) is to pub-
lish refereed papers in the area covering information technology (IT) concepts, tools, methodologies, and ethnography
in the contexts of global communication systems and Web engineered applications. In accordance with this emphasis
on the Web and communication systems, this journal publishes papers on IT research and practice that support seamless
end-to-end information and knowledge fow among individuals, teams, and organizations. This end-to-end strategy for
research and practice requires emphasis on integrated research among the various steps involved in data/knowledge
(structured and unstructured) capture (manual or automated), classifcation and clustering, storage, analysis, synthesis,
dissemination, display, consumption, and feedback. The secondary objective is to assist in the evolving and maturing
of IT-dependent organizations, as well as individuals, in information and knowledge based culture and commerce,
including e-commerce.
Subscription Information
IJITWE is published quarterly: January-March; April-June; July-September; October-December by IGI Global. Full
subscription information may be found at www.igi-global.com/ijitwe. The journal is available in print and electronic
formats.
Institutions may also purchase a site license providing access to the full IGI Global journal collection featuring more
than 100 topical journals in information/computer science and technology applied to business & public administration,
engineering, education, medical & healthcare, and social science. For information visit www.igi-global.com/isj or
contact IGI at eresources@igi-global.com.
Copyright
The International Journal of Information Technology and Web Engineering (ISSN 1554-1045; eISSN 1554-1053).
Copyright 2012 IGI Global. All rights, including translation into other languages reserved by the publisher. No part
of this journal may be reproduced or used in any form or by any means without written permission from the publisher,
except for noncommercial, educational use including classroom teaching purposes. Product or company names used in
this journal are for identifcation purposes only. Inclusion of the names of the products or companies does not indicate
a claim of ownership by IGI Global of the trademark or registered trademark. The views expressed in this journal are
those of the authors but not necessarily of IGI Global.
International Journal of Information Technology
and Web Engineering
An offcial publication of the Information Resources Management Association
Editorial: Ghazi I. Alkhatib
Ernesto Damiani
Editors-in-Chief
IJITWE
E-mail: Alkhatib@psut.edu.jo;
ernesto.damiani@unimi.it

Subscriber Info: IGI Global
Customer Service
701 E Chocolate Avenue
Hershey PA 17033-1240, USA
Tel: 717/533-8845 x100
E-mail: cust@igi-global.com
Correspondence and questions:
IGI GLOBAL

Jitwe PDF

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Jitwe PDF

Încărcat de

Drepturi de autor:

Formate disponibile

Editors-in-Chief: Ghazi I. Alkhatib, Princess Sumaya U.

for Technology, Jordan

= packet / sec (1)

value for the multiplication factor

; while for BB84 a near optimal

S-ar putea să vă placă și