Sunteți pe pagina 1din 5

Extracting Product from E-Commercial web site using Entropy Estimation

Mr.Vijay R. Sonawane , Prof.P.P.Halkarnikar


Department of Computer Science and Technology,
Shivaji University, Kolhapur,Maharashtra,India
Email:vijaysonawane11@gmail.com

Department of Computer Science and Engineering ,


Dr.D.Y.Patil College of Engineering,Kolhapur,Maharashtra,India
Email:pp_halkarnikar@indiatimes.com

sample pages.MDR [1], OMINI [2], and IEPAD [4] are also
Abstract-Explosive growth of the web has made information search automatic and HTML-aware tools. OMINI uses a set of
and extraction harder to the web. Mining product descriptions from extraction algorithms to locate the smallest subtree that
e-commercial web sites is useful for product information integration contains all objects of interest. Then it employs object
from multiple sources where user is able to easily locate desired object extraction algorithm to find correct object separator tags that
from huge number of web sites. In this paper, we propose simple can separate objects. IEPAD proposes a method to find
technique to extract product description from e-commercial web site. patterns from the HTML tag strings, and then use the patterns
This technique takes the benefits of hierarchical structure of HTML
language. First it discovers the set of product descriptions based on
to extract objects. The method uses a PAT tree to find patterns.
the measure of entropy at each node in the HTML tag tree. However, both OMINI and IEPAD return many noisy
Afterwards, a set of association rules based on heuristic features is objects.MDR employs the edit distance string matching
employed for more accuracy in the product extraction. algorithm to recognize data regions containing a set of
generalized nodes. Afterwards, MDR determines data records
Keywords – Entropy, representative value ,association rule, in each generalized node by using coarse-grained heuristic
filter,product description. observations (such as dollar sign). The first drawback of MDR
is the computational complexity of the edit distance algorithm.
The second drawback of MDR is that its coarse-grained
1. INTRODUCTION heuristic observations such as dollar sign are not enough for
identifying true data records and, therefore, results in many
A great amount of effort is often required for user to manually noisy objects.
locate and extract product description from e-commercial web sites In this paper, we propose a new and simple but efficient
.As it facing several challenges in how to effectively recognize technique to automatically extract product descriptions from
data region in the web page ,how to exactly split data region into web collections with various representation styles. Our method
separated data descriptions and therefore get ride of noisy object. is based on the observation that product descriptions usually
Following categories of tools are available to extract the data from have similar display format and they are contained in similar
web site. These group include Wrapper induction tool, NLP based HTML subtrees. To extract these product descriptions, the
tools, Ontology based tools, and HTML-aware tools. Wrapper source HTML page must first be parsed to form DOM-like
induction tools such as WIEN[9],SoftMealy[8],STALKER[11] HTML tag tree. Then, entropy estimation is performed, by a
develops wrapper based on domain specific sample pages to extract proposed entropy measurement, at each internal node of the
to get extraction rules which are used to extract data object in tree to measure the similarity among subtrees. Nodes with high
future similar pages. However these tools are time consuming and entropy value, i.e. high similarity among its subtrees, should
labor-intensive due to difficulties in writing and maintaining contain potential product descriptions in their descendant
wrapper as well as annotating sample pages. NLP based tools such nodes. Finally, entropy information is combined with a set of
as RAPIER [3], SRV[7],WHISK[13] uses the traditional NLP association rules based on heuristic features to identify exact
approach such as lexical and part-of –speech tagging to learn rules product descriptions.
for extracting relevant data existing in highly grammatical The remained part of the paper is organized as follows.
documents. These tools are not appropriate for less grammatical Section 2 presents overview of the proposed system. Section
pages. In Ontology based tools [6] data extraction is accomplished 3 demonstrate the experimental results, and section 4
by directly relying on the data objects. But it encounters the concludes the paper.
problem in writing and maintaining domain ontologies. HTML
aware tools rely on such as W4F[12],XWRAP[10], and
RoadRunner[5], rely on
inherent structural feature of HTML document for accomplishing
data extraction.

Although these tools achieve higher level of automation


comparing to those of wrapper induction ,they still need user
intervention either at building extraction rules or labeling

1
Tw tag weight

tl tag level

dp depth of this subtree


rv Representative value of subtree
er Entropy ratio of the node
sr score of the tree node
st a list of pointers to its subtrees
sp no. of adjacent subtrees
Table 1: Information associated with tag node

2.3.Entropy Measurement
Entropy estimation is necessary to identify product regions
Figure 1.Architectural Diagram containing potential product descriptions. This idea originates
from the observation that similar product descriptions are
2 .SYSTEM OVERVIEW usually contained in similar subtrees of tags. The term
As shown in the figure 1, proposed approach includes similar, in this sense, means that these subtrees have
following main steps: Crawling all the links of web site, analogical structures in both subtree skeleton and tag
building HTML tag tree, entropy measurement, and product position. The essential problem is to measure the similarity
description extraction. among the subtrees .Edit distance algorithm can also be used
to compare the similarity between two tag strings. But
2.1. Crawler Module computational complexity of this approach is high especially
To target whole web site given as input Crawler need to when the size of HTML file is large .To measure the
define to crawl each link of the web site .Following is the similarity of subtrees at each tree node, we perform the two
process by which Web crawlers work: following steps. First, structure of each subtree is mapped
1. Download the Web page. into a representative value that should satisfy three
2. Parse through the downloaded page and retrieve all the properties: (1) two subtrees have close representative values
links. if they are similar in structure, (2) two trees have divergent
3. For each link retrieved, repeat the process. representative values if they are dissimilar in structure, and
In the first step, a Web crawler takes a URL and downloads (3) the computational complexity for this mapping is as small
the page from the Internet at the given URL. Oftentimes the as possible. Second, entropy on representative values
downloaded page is saved to a file on disk. In the second corresponding to subtrees will be estimated. The higher the
step, a Web crawler parses through the downloaded page and entropy is, the larger similarity these subtrees have.
retrieves the links to other pages. Each link in the page is
defined with an HTML anchor. After the crawler has 2.3.1Representative Value calculation
retrieved the links from the page, each link is added to a list The representative value of a tag tree T (T is also the root
of links to be crawled. The third step of Web crawling repeats node), denoted as T.rv,is calculated by the following formula,
the process. All crawlers work in a recursive or loop fashion,
but there are two different ways to handle it. Links can be T.rv=T.tw +∑(Ni.tl ×Ni.co ×Ni.tw)………………….…….
crawled in a depth-first or breadth-first manner. (1)
Ni ϵN
2.2. HTML Tag Tree Where N is the set of all descendant nodes of T. Ni.tl is the
This approach taking the benefits of hierarchical and tag level, i.e. the distance from tag node Ni to the root node
nesting tag structure of HTML documents.But not all web T. Ni.tw is the tag weight of the tree node Ni. The main usage
pages tightly confirm the W3C HTML specification. Thus, of Ni.tw value is to help distinguish among different HTML
input web pages are first preprocessed to fix potential mistake tags. Ni.co is the child order of the node Ni among its
and thus confirm W3C specification. In HTML tree node is siblings. Figure 2[14] shows an example of representative
corresponding to a tag. Each tag node together with its value calculation. In this figure, each oval contains tag name
descendants constitutes a subtree. Additionally, each node in and tag weight of each tree node. The number associated with
our HTML tag tree contains following types of information. tree edge is the child order. The tag level of the root node, i.e.
T, starts from 1.

tt tag type, i.e. single or paired tag

2
14.Entropy+=-portion*log(portion)
15.End If
16.End If
17.End For
18.If totalCount≥1
19.Ti.er=entropy/log(totalCount)
20.End If
20.End for
Figure 2.Representative value Calculation
2.4. Product description Extraction
2.3.2Entropy Calculation Nodes with large entropy ratios will have high probability
Supposing that a tree node T has a set of subtrees S = {T1, of containing regularly formatted objects. However, not all
T2, . .Tn} with the corresponding set of representative values these objects are actual product descriptions. A large portion
R = {T1.rv, T2.rv, . . Tn.rv}. Let P = {Ti|Ti ϵ S and Ti. dp ≥ of these objects are noisy object i.e advertisements,
DTh}. The Shannon’s entropy of T with respect to their navigation links, etc. and we need to remove them.
children’s representative values, denoted as E(T), is Obviously, information about entropy ratio is not enough to
calculated as, do this. Thus, a scoring technique based on a set of
association rules is combined with entropy ratio to extract and
filter the output. Extracted objects having the score greater
than or equal to the minimum score threshold (STh) will be
considered product descriptions. Other objects are noise and
will be ignored. This proposed approach also uses a set of
association rules [15] to give score for each tree node.These
Product description usually resides in subtrees with depth association rules are based on the 19 features or items are
greater than or equal to minimum depth threshold DTh. E(T) listed in Table 2[14]. The first 18 items chosen based on
in formula tends to be high when cardinality of p is growing heuristics are useful for classification. For example, nPrices is
,when all representative values are equal Shannon’s entropy the number of occurrences of strings like “List Price”, “Our
get maximum value ln|p| so we normalized value of E(T) Price”, “Sale Price”, or “Outlet Price” a node contains. The
between 0 and 1,denote as last item (isProduct) is the target attribute.
Feature Description
er Entropy ratio of tree node
dp The depth of subtree
Algorithm 1 (Entropy ratio calculation) is for entropy ratio rv Representative value of tree node
computation in formula (3) for all nodes in the HTML tag nLinks No. of <a> tag
tree . This algorithm uses T and minimum depth threshold nImages No. of <img> tag
DTh as inputs. After calculating, entropy ratio at each tree nFonts No. of <font> tag
node will be updated.
nBolds No. of <b> tag
Algorithm 1-ERC:Entropy ratio calculation. nItalic No. of <i> tag
Input:Node T, DTh. nBLs No. of <br> tag
Output: Entropy ratio of all nodes nLists No. of <li> tag
1: Entropy = 0; Eri = 0; totalCount = 0; TotalRV = 0; nDollars No. of currency signs (e.g. $, £)
2: For Ti ϵT do nPercents No. of % signs
3. For TjϵT do nDigits No. of digital characters
3: If Tj.dp < Ti.dp+1 then
nPrices No. of “List Price”, “Our Price”
4: If Ti.co<Tj.co
nSeeMore No. of “See More”, “Read More
5.totalCount++;
s
6.totalRV+=Tj.rv;
7.End If nRatings No. of “Rate”, “Rating”
8.End If nLengths The length of non-tagged text
9.End For isProduct True is product, False
10.For TjϵT do otherwise
11.If Tj.dp<Ti.dp+1 nSaves No. of “Saving”, “You Save”
Association
12.If Ti.coRule
<Tj.co Sup. Conf. isProduct True is product, False else
nLinks ≤1 ^ nImages ≤0
13.Portion=Tj.rv/totalRV 2.9% 100% Table 2: Heuristic feature for filtering output
^ 0 < nFonts ≤1 ^ nBolds The antecedent of each rule is a combination or subset of the
≤0  isP roduct = False first 18 items in Table 2. The consequent of each rule is the
2 ≤nLinks ≤ 5 ^ 0 < nImages ≤ 9.3% 98% target attribute, i.e. isProduct. Some example of association
2 ^ nBold ≤ 3 isProduct = rules accompanied by their support and confidence are give
True in table 3.
0 < nDollars ≤3 ^ 0 < nPrices 15.5% 99%
≤3 ^ nIamges ≤ 2
 isP roduct = True Table 3:Association Rule for Scoring

3
We use this set of association rules to give score for each links of website to be crawled as input. And crawl all the
future tree node. If the consequent of a rule is “isProduct links that match the search string.
=False” (negative rule) we decrease the score of the node by
x points. If the consequent of a rule is “isProduct = True”
(pos- itive rule) we increase the score of the node by y. Both
x and y depend on the confidence and support of each rule. In
our system, we set x = y = support × conf ident/100. The
higher the score of a tag node is the larger possibility that tag
node is a product description. We use a minimum score
threshold, ST h, to determine whether a tag tree node is a
product description.

Algorithm 2 -Product description Extraction from the


Web
Require: T , ERTh, STh, DTh.
Ensure: Product descriptions in the tree T.
1: if T.dp < DTh then
2: Return;
3:end if
4:if T.er ≥ ERTh then
5:for Ti∈ T.st do
6: if Ti.sr ≥ STh then
7: Extract Ti;
8: else
9: PEWeb(Ti, ERTh, STh, DTh); Figure 3: C
10: end if
11: end for All the crawled links are fetched and preprocessed to fix the
12: else potential mistakes. By taking balanced tag tree as input
13: for Ti ∈T.st do Entropy ratio is calculated for each node of the tree. Table 4
14: PEWeb (Ti, ERTh, STh, DTh); shows the sample output of the entropy ratio calculation
15: end for algorithm.
16: end if isproduct depth Entropy Line
0 975 0.53800805824690 6
0 974 0.63553739799325 7
1 844 0.63548066716732 229
1 924 0.51398288605239 339
The algorithm2[14] uses inputs as T (tag tree node), ERTh 1 834 1.0 445
(minimum entropy ratio threshold), STh (minimum score
threshold), and DTh (minimum depth threshold). The Table 4: Entropy ratio
algorithm visits all nodes under T recursively. At each node, These are intermediate results. Next expected results by
it compares the node’s entropy ratio (Ti.er) with ERTh. If the combining the association rules with the entropy ratio will
node’s entropy ratio is greater than or equal to the threshold, give the exact product description. Afterward by using filters
it will consider all the node’s children. At each child node Ti, based on extraction pattern user is able to filter extracted
if Ti.sr is greater than or equal to the minimum score output to get only information regarding desired product from
threshold (STh), Ti is a product description; otherwise it will website.
be called recursively with Ti to search for potential product
descriptions below Ti. If T.er is smaller than ERTh threshold,
it will also be called recursively to search for potential
product descriptions under node T. Whole web site can also 4. CONCLUSION
be targeted by recursively applying strategy to every page.
Afterward In our system we provide feasibility to the user to In this paper, we proposed a novel and efficient method to
get information regarding desired product by filtering automatically extract product descriptions from the Web. Our
extracted output based on extraction pattern (i.e. product technique takes the benefits of hierarchical structure of
name, price, colour etc.) HTML language to extract product description. It uses
entropy estimation and association rule to exactly find
3. EXPERIMENTAL RESULTS product region. As it Classifies outputs based on association
In this section we evaluate the experimental results. Java rules of important heuristic features which helps to effectively
language is used to develop the proposed system. We used eliminates noisy objects. And finally again extraction pattern
product based website www.compusa.com as Data set for the based filter helps to sort output.
evaluation. Crawler module takes search string and number of

4
REFERENCES
[1] L. Bing, G. Robert, and Z. Yanhong. Mining data records in web pages.
[2] D. Buttler, L. Liu, and C. Pu. A fully automated extraction system for the
World Wide Web. Proceedings of IEEE ICDCS- 21, 2001.
[3] M Califf and R. J. Mooney. Relational learning for patternmatch rules for
information extraction. Proceedings of the Sixteenth National Conference on
Artificial Intelligence, pages 328–334, 1999.
[4] C.-H. Chang and S.-L. Lui. Iepad: Information extraction based on pattern
discovery. Proceedings of WWW-10, 2001.
[5] V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic
data extraction from large web sites. Proceedings of the 26th VLDB, pages
109–118, 2001.
[6] D. W. Embley, D. M. Campbell, Y. S. Jiang, S. waW. Liddle, Y.-K. Ng, D.
Quass, and R. D. Smith. Conceptual-model-based data extraction from
multiple-record web pages. Data and Knowledge Engineering, 31(3):227–251,
November 1999.
[7] D. Freitag. Machine learning for information extraction in informal
domains. Machine Learning, 39(2-3):169–202, 2000.
[8] C.-N. Hsu and M.-T. Dung. Generating finite-state transducers for semi-
structured data extraction from the web. Information Systems, 23(8):521–538,
1998.
[9] N. Kushmerick. Wrapper induction: Efficiency and expressiveness.
Artificial Intelligence, 118(1-2):15–68, 2000.
[10] L. Liu, C. Pu, and W. Han. Xwrap: An xml-enable wrapper construction
system for web information sources. Proceedings of the 16th IEEE
International Conference on Data Engineering, pages 611–621, 2000.
[11] I. Muslea, S. Minton, and C. A. Knoblock. Hierarchical wrapper induction
for semi structured information sources. Autonomous Agents and Multi-Agent.
[12] A. Sahuguet and F. Azavant. Building intelligent web applications using
lightweight wrappers. Data and Knowledge Engineering, 36(3):283–272, 2001.
[13] S. Soderlan. Learning information extraction rules for semistructured and
free text. Machine Learning, 34(1-3):233–272, 1999.
[14] Hieu Xuan Phan , Susumu Horiguchi, Bao Tu Ho,PEWeb: Product
Extraction from the Web Based on Entropy Estimation
[15] R. Agrawal, T. Imielinski, and A. Swami. Mining association
rules between sets of items in large databases. Proceedings of
the ACM SIGMOD, pages 207–216, 1993.

S-ar putea să vă placă și