Query Similarities in Various Web Databases

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue10 Oct 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page3485

Query Similarities In Various Web Databases

1
N. Rajasekhar,
2
G. Raju

1
PG Scholar, Department of Computer Science and Engineering,
CMRCET College of Engineering and Technology
Hyderabad, A.P, India

2
Associate Professor, Department of Computer Science and Engineering,
CMRCET College of Engineering and Technology
Hyderabad, A.P, India

Abstract:
In data integration Record matching takes a
key role, which gets the data that explore the
similarity in one entity itself. Almost previous
methods consist of a main drawback that user
have to give the training data for those methods
initially, which needs the user to provide
training data. Such methods are not used for
the databases of web, whenever the matching
records are found out in the case of query
stream then at that time results dynamically
created on the- fly. Those records are
dependents on query- and also a relearned
method utilizing examples of training from the
query results which are previous ones may not
get success of a new query results. To get the
crisis of matching record in the database of
Web, we explore which is not supervised,
record matching method in online, UDD,
response of provided query, perfectly capable
of find out duplicates from the records of query
result various Web databases. After reducing of
the similar-source duplicates, the presumed
which are not duplicate records from the source
same which can be utilized as examples of
training decreases the work tense of users
containing manually examples of label training.
Begins from the non duplicate set, we utilize
couple of cooperating classifiers, a weighted
component similarity gathering classifier and an
SVM classifier, to iteratively find out duplicates
in the query results from various Web databases.
Researches explore that UDD capability is good
in the view of Web database where the previous
methods are not efficient.
I INTRODUCTION
Present data base system is working on efficient
and relevant searches on users queries. Web pages
are generating with respect to queries of users.
Such Web databases contains complicated
structure consists of different links that
interrelated with another data bases or links to

web pages. And those contains huge amount of
data and also good quality, those data which is
a high speed in growth rate that differs from
the Web static. Almost Web databases give
response
based upon users query. After user entering
query through interface, the Web database will
connect to the back-end of database and
respond to the user. To develop a system that
supports users relevant and also mainly, differ
from the query results responded from various
Web databases, a complicated task have to be
similar to various sources maintains that
shows to real-world entity which was same.
For clear explanation, the below diagram
explores few of results of the query given by
couple of bookstores in online, abebooks.com
and booksamillion.com, as responding to the
similar query Harry Potter in the Title bar.
We can observe that the number which was
recorded as 3 in diagram and also record which
was third in diagram refers same book, where
they have the similar ISBN number any have
their authors are different. When compared to,
the number recorded as 5 in diagram and the
record of second one in diagram also explore to
the similar book if we are willing only in the
title of book and also author.1 The crisis of find
outing duplicates,2 that is, couple of records
giving the one entity only which was same, has
gaining more concerned more from a lot
research areas, consisting of, Data Mining,
Databases, Natural Language Processing and also
Artificial Intelligence. Almost old work depends
upon matching rules which were predefined hand-
coded by experts of domain or may be matching
rules seeked offline by few gaining method from a
set of training examples. Those working way
good enough in a database environment this was
traditional. Where each and every instance of the
selected databases can be accessed instantly, until
a set of high-quality exploring records can be
examined by experts or chosen for the label usage
by user. In the Web database scenario, the records
get to are highly dependent query, they only be
known through online queries.

And also, they are only a partial and based portion
of complete inform the Web databases source.
Hand-coding or offline-learning way of getting is
not sufficient for couple of reasons. Initially, the
complete information set which is not sufficient

beforehand, and also, perfect representative
data for training are obtained hard.
And next, and also perfectly, although if better
representative information are discovered and
labeled for knowing, the rules learned on the
representatives of complete data set may not
work well on a partial and biased part of that
data set.

II PROBLEM STATEMENT
We concern on Web databases from the one
domain only that refers Web databases that
gives the same records according to queries of
user. For example if s records are present there
in data source A and also t records are present
there in data source B, with every record
containing a collection of attributes/ fields.
Every t records in data source B can potentially
be a clone of each of the s records in data
source A. The aim of duplicate detection is to
explain the matching status, means duplicate or
non duplicate, of these s t record pairs.
III SYSTEM DEVELOPMENT
DUPLICATE DETECTION IN UDD
Different users may have different criteria for
what constitutes a duplicate even for records
within the same domain. For example, in diagram,
if the user is only interested in the title and author
of a book and does not care about the ISBN
information, the records numbered 5 in diagram
and the second record in diagram are duplicates.
Furthermore, the records numbered 5 and 6 in
diagram are also duplicates under this criterion. In
contrast, some users may be concerned about the
ISBN field besides the title and author fields. For
these users, the records numbered 5 and 6 in
diagram and the second record in diagram are not
duplicates. This user preference problem makes
supervised duplicate detection methods fail. Since
UDD is unsupervised, it does not suffer from this
problem.
Assumptions and Observations
In this section, we present the assumptions and
observations on which UDD is based. First, we
make the following two assumptions:
1. A global schema for the specific type of result
records is predefined and each databases
individual query result schema has been matched
to the global schema (see, for example, the
methods.

2. Record extractors, i.e., wrappers, are
available for each source to extract the result
data from HTML pages and insert them into a
relational database according to the global
schema
Besides these two assumptions, we also make
use of the following two observations:
1. The records from the same data source
usually have the same format.
2. Most duplicates from the same data source
can be identified and removed using an exact
matching method.

Fig. 2. Duplicate vector identification algorithm.
IV RELATED WORK
Almost every methods of record matching gets a
framework that utilizes Couple of major steps:
1. Identifying a similarity function. Using training
examples and a collection of predefined depended
similarity functions/ measures all over digits and
may be string fields, one function of composite
similarity over one couple of records, and it is a
basis functions weighted combination, is find
outed with experts of domain or known by a
learning method, like decision tree , Expectation-
Maximization, , SVM or Bayesian network.
2. Matching records. The composite similarity
function is utilized to get the similarity of user
pairs and most common similar pairs are same
and discovered as referring to the one entity only.
Main thing of duplicate detection is to decrease
the count of comparisons of record pairs. Lot
methods are there which was proposed for this
sake in addition with standard blocking, sorted
neighborhood method, record clustering and
Bigram Indexing. Still these methods varies in
how to split the data set into varies blocks, they
are undertaken to decrease the count of
comparisons with the help of comparing records
from the block which was same. If such methods
incorporated into UDD to decrease the count of
comparisons of record pair, then there is no need
to take this issue as a consideration. Where the
existing record matching work is targeted at
matching a single type of record, more recent

work has mentioned the matching of different
types of records with high associations middle
of records. Still the complexity of matching
increases huge with the count of record types;
this used to maintain capturing the
dependencies of matching between various
record types and uses such dependencies to
gain the matching accuracy more of each and
every record type. The drawback is these
dependencies from multiple record types are
not existed in different domains.
Compared to these previous works, specifically
for Web database scenario UDD is designed
where the records have to be similar are of a
unique type with number of string fields. Such
records are mostly depends upon query and
also only a biased portion and partial of
complete data, that shows the previous work
depends upon offline learning which gone be
inappropriate. Moreover, our work concern on
researching and identifying the field weight
assignment concept more than on the measure
of similarity. In UDD, any similarity measure,
or some combination of them, can be easily
incorporated.
V CONCLUSION
A key point in data integration is Duplicate
detection in and also state-of-the-art methods
are depends upon learning techniques which
are offline, which need training data. In
database, where records to get similar are
highly query-dependent, a pre specified approach
is not used as the collection of records in each and
every querys results is a based subset of the
complete information data set. To overcome this
problem, we presented an unsupervised, online
approach, UDD, for detecting duplicates over the
query results of multiple Web databases. Two
classifiers, WCSS and SVM, are used
cooperatively in the convergence step of record
matching to identify the duplicate pairs from all
potential duplicate pairs iteratively. Experimental
results show that our approach is comparable to
previous work that requires training examples for
identifying duplicates from the query results of
multiple Web databases.

REFERENCES
[1] R. Ananthakrishna, S. Chaudhuri, and V.
Ganti, Eliminating Fuzzy Duplicates in Data
Warehouses, Proc. 28th Intl Conf. Very Large
Data Bases, pp. 586-597, 2002.
[2] R. Baeza-Yates and B. Ribeiro-Neto, Modern
Information Retrieval. ACM Press, 1999.
[3] R. Baxter, P. Christen, and T. Churches, A
Comparison of Fast Blocking Methods for Record
Linkage, Proc. KDD Workshop Data Cleaning,
Record Linkage, and Object Consolidation, pp.
25-27, 2003.
[4] O. Bennjelloun, H. Garcia-Molina, D.
Menestrina, Q. Su, S.E. Whang, and J. Widom,
Swoosh: A Generic Approach to Entity

Resolution, The VLDB J., vol. 18, no. 1, pp.
255-276, 2009.
[5] M. Bilenko and R.J. Mooney, Adaptive
Duplicate Detection Using Learnable String
Similarity Measures, Proc. ACM SIGKDD,
pp. 39-48, 2003.
[6] S. Chaudhuri, K. Ganjam, V. Ganti, and R.
Motwani, Robust and Efficient Fuzzy Match
for Online Data Cleaning, Proc. ACM
SIGMOD, pp. 313-324, 2003.
[7] S. Chaudhuri, V. Ganti, and R. Motwani,
Robust Identification of Fuzzy Duplicates,
Proc. 21st IEEE Intl Conf. Data Eng., pp. 865-
876, 2005.
[8] P. Christen, Automatic Record Linkage
Using Seeded Nearest Neighbour and Support
Vector Machine Classification, Proc. ACM
SIGKDD, pp. 151-159, 2008.

Query Similarities in Various Web Databases

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Query Similarities in Various Web Databases

Încărcat de

Drepturi de autor:

Formate disponibile

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue10 Oct 2013

ISSN: 2231-2803 http://www.ijcttjournal.org Page3485

S-ar putea să vă placă și