Sunteți pe pagina 1din 3

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)

Web Site: www.ijettcs.org Email: editor@ijettcs.org


Volume 3, Issue 5, September-October 2014

ISSN 2278-6856

Vector space model for deep web data retrieval


and extraction
Dr. Poonam yadav
D.A.V College of Engineering. & Technology, India,

Abstract
Deep web data extraction is challenging problem recently since
the structured data from deep web pages underlie intricate
structure. So, extraction of web data from deep web pages
received much attention among the researchers. In this
research, vector space model and content features are utilized
for deep web data extraction. Initially, extracted deep web
pages are taken as input for the proposed method and
Document Object Model (DOM tree) is constructed. Through
the DOM tree, information given in the whole web pages is
split into block wise and block with its contents are given for
feature computation process. Here, frequency level, title level
and numerical level features are calculated after constructing
vector space model which is a vector of words and its
frequency. From the feature score value of every block, the
important blocks are chosen as final useful data for the taken
web page. The proposed approach of deep web data extraction
is implemented using deep web pages which are collected from
the complete planet web site and performance of the system is
evaluated using precision and recall.

Keywords:- Deep web data extraction, deep search engine,


web data extraction, DOM tree, precision, recall

1. INTRODUCTION
Currently, deep web database are usually accessed through
deep search engine which is the system utilized for
extracting underlying database encoded. For the encoded
data units into machine process able, deep web data
extraction and collection is an important task in recent
years to assign useful tags for the data units. Literature
presents several algorithms for deep web data extraction
which has been significant research for the past decade [69]. But, most of the system needs human intervention to
find the desired information from the sample pages which
is really a hectic process if the retrieved web pages for the
user query is large. Because of this, automatic
identification of web data extraction from web pages is
needed for the current world to achieve high extraction
accuracy. On the other hand, they endure from poor
scalability and are not appropriate for applications that
require extracting information from a large number of web
sources [1-5]. In this paper, vector space model is adapted
to extract deep web data useful for indexing and retrieval
of web pages. The ultimate way of extracting deep web
data is to split web pages into a set of blocks and each
block are given to feature identification phase or feature
map phase. Then, contents of web pages are analyzed
blocks wise to find whether important contents are
presented through three different feature formulae. Based
on the feature score value, important block or web data is

Volume 3, Issue 5, September-October 2014

extracted to do further process of indexing or retrieval


matching. The basic organization of the paper is given as
follows. Section 2 discusses the existing system and section
3 presents the proposed system for web data extraction
using vector space modelling. Section 4 discusses the
experimentation with results and finally, conclusion is
given in section 5.

2. EXISTING SYSTEM
In order to obtain data records and data items presented in
the web database to machine processable, which is
significant for various applications such as deep Web
crawling and metasearching, the structured database
should be constructed for those applications. In order to
accomplish this task, a vision-based approach is presented
in [1] which is Web-page programming-languageindependent method. Here, different visual features are
utilized on the deep Web pages for web data extraction.
Disadvantages: After analyzing the work presented in [1],
it can process only deep web pages having one data region.
But, most of the deep web pages have number of dataregion. Also, visual features of Web pages is computed by
calling the programming APIs of IE, which is a timeconsuming process. These two problems are solved in the
proposed work. The first problem is easily solved with
identification of blocks and the extraction is done based on
blocks of web pages. The second problem is solved using
vector space model which reduces the time complexity
significantly.

3.PROPOSED

METHOD: VECTOR SPACE MODEL


FOR DEEP WEB DATA RETRIEVAL AND
EXTRACTION

This paper presents a method for deep web data extraction


using vector space modelling and content-based feature
score computation. Extraction and retrieval of deep web
data can have signification application for indexing,
storing and retrieval of deep web pages. This leads to
develop a better method for web data extraction. Here, the
main objective is solved using two important phases such
as, vector space modelling for every blocks and selection of
blocks using content-based feature score. The entire steps
of the proposed method are given in figure 1.
3.1 Block extraction using DOM tree
Deep web is generally defined as the content on the Web,
not available through a search on common search engines.
This content is occasionally also referred to as the hidden

Page 274

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)


Web Site: www.ijettcs.org Email: editor@ijettcs.org
Volume 3, Issue 5, September-October 2014
or invisible web. The Web is a composite entity that
contains information from a assortment of source types and
includes an developing mix of dissimilar file types and
media. It is much more than static, self-contained Web
pages. Here, deep web pages are extracted from Complete
Planet (www.completeplanet.com), which is now the
largest deep web repository with more than 70,000 entries
of web databases. first, extracted input web pages W from
deep web which are in the html format is taken as input for
the proposed method. Through HTML data, Document
Object Model (DOM tree) is constructed based on the html
tag information. Through the DOM tree, web page is split
into a set of blocks which cumulate as a web page,
W b1, b2 , , bn . Every block is given for the feature word
extraction process for building vector space model.
3.2 Feature word extraction and vector space modeling
The block extracted from the previous step bi is given for
feature word identification using stop word removal and
stemming which are very common pre-processing
techniques used in text mining. Once feature words are
extracted, vector space model is constructed for
representing web pages as vectors of keywords. The
representation of the vector space model for every block is
as follows, bi w1i , w2i , , wni .Here, each dimension
represents an individual keyword and value of that element
would be frequency of the word in particular document.
3.3 Content-based feature score computation using
VSM
Vect or space model constructed from the previous step is
used to build up the content-based feature for every block.
The feature score for every block is computed based on
three different formulae. In the first formulae, frequency
based score is computed including three different
parameters like, frequency of the word f , Highest
frequency of the word in entire block Fi , total number of
words presented in the block N . In the second formulae,
title word matching is included as when the title word is in
the block, then it would one for T ; otherwise, it would e be
zero. In the third formulae, numerical word matching is
included as when the numerical word is in the block, then
it would one for N ; otherwise, it would e be zero.

1 k fi
F1
Ni1Fi
1 k
F2 Ti
N i1
1 k
F3
Ni
N i 1

Volume 3, Issue 5, September-October 2014

ISSN 2278-6856

Once the above three features are computed, average of


those three feature values is given as the final feature score
value for every block as per the following formulae.

Fb

1
F1 F 2 F3
3

.
3.4. Identification of blocks
After identifying feature score for every block, blocks
which are having feature score value greater than the
threshold T are taken as final output of this approach.
These blocks are important and it can be useful for the
indexing and retrieval

4.RESULTS AND DISCUSSION


The experimental results of the proposed method of vector
space model-based web data extraction are presented in
this section.
4.1. Performance evaluation based on precision and
recall
For experimentation, ten pages are collected from the
complete planet web site (www.completeplanet.com).
Complete-planet is presently the largest depository for
deep web, which has collected the search entries of more
than 70,000 web databases and search engines. For the ten
web pages, blocks are extracted using the proposed
method. Then, the extracted blocks are evaluated manually
whether the blocks are important or not. Then, precision is
computed using the following formulae,

Pr ecision

Relevant Retrieved
Retrieved

Then, recall is computed using the following formulae,

Re call
9

Re levant Re trieved
Re trieved

4.6.

Page 275

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)


Web Site: www.ijettcs.org Email: editor@ijettcs.org
Volume 3, Issue 5, September-October 2014

ISSN 2278-6856

presently the largest depository for deep and the


performance is analysed with precision and recall. From
the outcome, average performance for the proposed system
in precision is 69.2 and the average performance for the
existing system is 64.6. Also, average performance for the
proposed system in precision is 70.8 and the average
performance for the existing system is 64.6.

REFERENCES

Figure 2. Precision plot for the proposed web data


extraction

Figure 3. Recall plot for the proposed web data extraction


Figure 2 plot the precision of the proposed web data
extraction and the existing system. The precision is plotted
by computing different number of web pages as input. For
10 pages taken as input, proposed method achieved 72%
and existing system achieved 70%. Also, average
performance for the proposed system in precision is 69.2
and the average performance for the existing system is
64.6. Similarity, Figure 3 plots the recall of the proposed
web data extraction and the existing system. The recall is
plotted by computing different number of web pages as
input. For 10 pages taken as input, the proposed method
achieved 78% and the existing system achieved 70%. Also,
average performance for the proposed system in precision
is 70.8 and the average performance for the existing
system is 64.6.

5. CONCLUSION
This paper presented a deep web data extraction approach
which is adapted to extract deep web data useful for the
indexing and retrieval of web pages. At first, extracted
input web pages from deep web is taken as input for the
proposed method and then, DOM tree is constructed to
split web page into a set of blocks. Then, from every block,
feature words are extracted to build up vector space model
which is then used to build up the content-based feature for
every block. The feature score for every block is computed
based on the frequency, title matching and numerical
matching. Finally, three features are combined as final
score value for every block to decide the importance score
of the block. For experimentation, deep web pages are
collected from the complete planet web site which is

Volume 3, Issue 5, September-October 2014

[1] Wei Liu, Xiaofeng Meng, Weiyi Meng, "ViDE: A


Vision-Based Approach for Deep Web Data
Extraction", IEEE transactions on knowledge and data
engineering, Vol. 22, No.3, pp-447-460, 2010.
[2] Junnila, V., Laihonen, T., "Codes for Information
Retrieval With Small Uncertainty", IEEE Transactions
on Information Theory, vol. 60, no. 2, pp. 976-985,
2014.
[3] Bohm, T, Klas, C.-P. ; Hemmje, M., "ezDL:
Collaborative Information Seeking and Retrieval in a
Heterogeneous Environment", computer, IEEE, vol.
47, no. 3, pp. 32-37, 2014.
[4] Sumiya, K., Kitayama, D. ; Chandrasiri, N.P.,
"Inferred Information Retrieval with User Operations
on Digital Maps", IEEE Internet Computing, vol. 18,
no. 4, pp. 70-73, 2014.
[5] Xiaogang Han, Wei Wei ; Chunyan Miao ; Jian-Ping
Mei ; Hengjie Song, "Context-Aware Personal
Information Retrieval From Multiple Social
Networks", Computational Intelligence Magazine,
IEEE, vol. 9, no. 2, 2014.
[6] Jer Lang Hong, "Data Extraction for Deep Web Using
WordNet", IEEE Transactions on Systems, Man, and
Cybernetics, Part C: Applications and Reviews, vol.
41, no. 6, pp. 854-868, 2011.
[7] Kayed, Mohammed ; Chia Hui Chang, "FiVaTech:
Page-Level Web Data Extraction from Template
Pages", IEEE Transactions on Knowledge and Data
Engineering, vol. 22, no. 2, pp. 249-263, 2010.
[8] Caverlee, J. ; Ling Liu, "QA-Pagelet: data preparation
techniques for large-scale data analysis of the deep
Web", IEEE Transactions on Knowledge and Data
Engineering, vol. 17, no. 9, pp. 1247-1262, 2005.
[9] Geller, J. ; Soon Ae Chun ; Yoo Jung, "Toward the
Semantic Deep Web", IEEE computer, vol. 41, no. 9,
pp. 95-97, 2008.

BIOGRAPHY
Dr. PoonamYadav obtained B.Tech in
Computer Science &Engg. fromKurukshetra
University Kurukshetra and M.Tech in
Information Technology from Guru Govind
Singh Indraprastha University in 2002 and
2007 respectively. She had Awarded Ph.D
inComputer Science& Engg.
fromNIMS University,
Jaipur. She is currently working as Principal in D.A.V
College of Engg. & Technology, Kanina (Mohindergarh).
Her research interests include Information Retrieval, Web
based retrieval and Semantic Web etc. Dr. PoonamYadav
is a life time member of Indian Society for Technical
Education and her
Page 276

S-ar putea să vă placă și