Sunteți pe pagina 1din 6

AIML 05 Conference, 19-21 December 2005, CICC, Cairo, Egypt

WEB MINING BASED ON GENETIC ALGORITHM

M. H. Marghny and A. F. Ali


Dept. of Computer Science, Faculty of Computers and Information,
Assuit University, Egypt,
Email: marghny@acc.aun.edu.eg.

Abstract
As the web continues to increase in size, the relative Results of the individual search engines are combined
coverage of web search engine is decreasing, and search into a single result set. The Advantage of metasearch
tools that combine the results of multiple search engines engines includes a consistent interface to multiple
are becoming more valuable. We propose a framework engines and improved coverage. Genetic search is
for web mining, the applications of data mining and characterized by the fact that a number N of potential
knowledge discovery techniques to data collected in solutions of an optimization problem simultaneously
World Wide Web (WWW), and a genetic search for samples the search space. We assume that it is possible
search engines by showing that important relation existed to perform additional computation to the result from
between web statistical studies and search engines standard search engines, a point that is lacking to
standard techniques in optimization. It is straightforward standard search engines. This may consist of instance in
to define an evaluation function that is a mathematical formu lat
iona“r icher“r equ esttod ownl o adt hepagesi n
formulation of the user request and to define a steady order to well analyze their content [1], to propose a
state genetic algorithm (GA) that evolves a population of textual clustering of the results [2], and to perform
pages with binary tournament selection. Querying additional search with a given strategy [3]. We deal in
standard search engine performs the creation of this paper with the last point and we make use of the
individuals. The crossover operator that with probability optimality of genetic algorithms [4], with respect to
of crossover Pc is performed by selecting two parent finding the most interesting pages for the user. From this
individuals (web pages) from the population. It chooses intuitive view, we show that GAs and more generally
one crossover position within the page randomly and evolutionary algorithms can positively contribute to the
exchanges the links after that position between both problem of defining an efficient search strategy on the
individuals (web pages). We present a comparative web.
evaluation that is performed with the same protocol as Section (2) formalizes the problem. We deal with it as an
used in optimization. Our tool leads to pages of qualities optimization problem by making a relationship between
that are significantly better than those of the standard concepts use in optimization and concepts used in studied
search engines. dealing with web statistical properties and with web
search. Section (3) contains the principles of our GA that
Keywords: search engines, Meta search, crossover, evolves a population of web pages. Section (4) reports
genetic algorithm, web mining. the experimental tests, comments and comparisons with
metasearch. Section (5) contains conclusions.
1. Introduction
The WWW is an information environment made of a 2. Web Search As An Optimization
very large distributed database of heterogeneous Problem
documents, using a wide area network (WAN) and a We argue that the web search can be seen as standard
client server protocol. The structure of this environment optimization problem, and may thus benefit from
is that of a graph, where nodes (web pages) are connected knowledge learned in previous studied in optimization.
by edge (hyperlinks). The typical strategy for accessing We establish a parallel between web search and general
information on the WWW is to navigate cross documents problem of function optimization. Recent statistical
through hyperlinks, retrieving the information of interest studies have modeled the web as a graph in which the
along the way. A metasearch engine searches the web by nodes are web pages and the edges are the links that exist
making requests to multiple search engines such as Alta between these pages [5-6]. The search space S of our
vista, Yahoo, etc. optimization problem is the set of web pages and is
structured with neighborhood (Links going out of a page)

82
AIML 05 Conference, 19-21 December 2005, CICC, Cairo, Egypt
k
relationship V:S→ S. We associate to S an evaluation or 1- Generate a random number (float) r in
fitness function, which can numerically evaluate web the interval [0,1] that equals to the
pages. A search engine tries to output pages, which number of pages.
maximize this function, and thus tries to solve that 2- If r < Pc select the page for crossover.
optimization problem. To scan S, optimization 3- If the numbers of selected pages are
algorithms and search engines make both uses of the odd we remove one selected page (this
following similar search operators: choice is made randomly).
Creation operators that initialize points from S. In
optimization, random generation is a common creation Links selection steps of the parent page are: -
operator, but in the web context, randomly generating IP
addresses for instance has already been studied [1]. For 1- Generate a random number (integer) d
other purposes but only gives a valid web server with a ranging from 1 to the total number of
low chance (one over several hundreds). So this kind of the links per page to determine the
ra ndo mc reati o
no peratordo esn’
ts eems ui
tablef orwe b crossover point.
search. In optimization, another example of creation 2- Exchange the links after the crossover
operator is the use of a heuristic that builds a solution point of both selected parent page.
from the description of the problem. Many search
engines, either based on a metasearch or agents, use such From an intuitive point of view, the behavior of this
an operator for the web by querying one or more index- search algorithm can range from a meta-search engine
based search engines and outputting the obtained links. (with Pc= 0) which only analyzes/evaluates the results of
From the evolutionary computation point of view, this standard search engines, to a search engine which
operator would be used for the initial generation of the explores in parallel as many local links as possible (P c=
population. 1) with the help of selective pressure to guide the search
Operators that will modify existing points in the through the links. When P c= 0, the selection of the GA
population. Web robots and more generally web agents decides about the survival of a page in the population and
[7-8], use such strategy by exploring the links found in about the number of offspring. It controls the intensity
pages. From this point of view, a standard heuristic in with which pages are explored. (i.e. When P c= 0.25
optimization such as hill climbing can be directly means we select 25% of pages to make the crossover
adapted to the web, starting from a given page to explore operator). As far as we know, other applications of GAs
its links and select the best one according to the to the problem centered on the web are for instance [7-
presented fitness function F in order to define a new 8,10-14]. For instance, [7] has presented an adaptive
starting point. search with a population of agents. These agents are
selected according to the relevance of the documents
3. The Proposed GA return to the user. Our approach models the problem at a
1- Get the user request and define the level that is closer to the fitness landscape. The GA
evaluation function F. search does not optimize the parameters of searching
2- t ← 0 (iteration No =0 ,pop size =0) agents but rather directly deals with points in the search
3- Initialize P (t). space.
4- Evaluate P(t) (page from standard
search engine). 3.1. Evaluation Function According
5- Generate an offspring page O. To The User Request
6- t ← t +1 ( ne w po pulat
io n ). The fitness function F that evaluates web pages is a
7- Select P (t) from P (t-1). mathematical formulation of the user query and
8- Crossover P (t). numerous evaluation functions. We have used function F,
9- Evaluate P (t). closing to the evaluation functions used in standard
10- Go To 5 (while not termination search engines.
condition (no of iterations)). First, let us define the followings in the simplest forms
11- Sort P (t) (sort the pages given to the for practical considerations.
user in descending order according to
their quality values). 1) Link quality F (L)
12- Stop P (t) and give the outputs to the
user.

n
It combines the concepts described in the previous
F (L) =
i 1
# Ki (1)
section with those of a steady state GA [9]. An individual
in the population is a web page that can be numerically Where n is the total number of input keywords, #(ki)
evaluated with a fitness function. Initially, the first mean number of occurrence in link and k1, K2, k3… a
re
individuals are mostly generated with a heuristic creation the keywords given by the user.
operator which queries standard search engines to obtain
pages. Then, the individuals can be selected/deleted 2) Page quality F (P)
according to their fitness, and can give birth to offspring
with selection/crossover operators. Crossover steps of

m
parent pages are: -
F (P)= j
1
F j ( L) (2)

83
AIML 05 Conference, 19-21 December 2005, CICC, Cairo, Egypt

Where m is the total number of links per page. 4. Results And Comments
4.1. Settings of The Experiment
3) Mean quality function Mq
The proposed GA has been implemented using C++
Fmax ( P) Fmin ( P) language. The obtained program was tested using PC PIV,
Mq = (3)
2 2.8GHz, 256 MB RAM, and HDD of 80GB having 7200
rpm. For each problem, we used 300 downloaded pages
Where Fmax (P) and Fmin (P) are the maximum and from the standard four search engines (Yahoo, Google,
minimum values of the pages qualities, respectively after AltaVista, Msn). They were stored in the HDD for
applying the GA. It should be noted that the upper value further operations. Tabulated results are averaged over 5
of Fmax is m*n, and the least value of Fmin(P) is zero. runs.
Hence, the upper limit of Mq is (m*n)/2. Application of
the GA to web pages will increase some qualities of 4.2. Results
pages and decrease others. Table 1 shows results of ten queries of different
keywords at different values of Pc (Pc = 0) means results
3.2. Crossover Operators And Other from standard search engines without applying the GA)
other values of Pc
Search Engines (Pc = 0.25, 0.5, 0.75, 1) are after applying our algorithm.
We use a heuristic creation operator that outputs a web These results show the population averaged mean quality
page from the results given by four standard search fo r di f
f erentv alueso fp op ulations’s i
z ea f
ter3 000
engines (AltaVista, Google, Msn, Yahoo). It consists of iterations applied on 100 pages. Referring to Fig.1 of
querying each search engine with the keywords (K1, K 2…) query No 1, we note that, if the population size is too
and in extracting the results. The links found are stored in
small for example 20 pages, the GA decreases the quality
a list sorted in the same order given by each search of the results. If the population size is large for example
engine (1st link of the 1st engine, 1st link of the 2nd 100 pages, the binary selection does not concentrate the
nd
e ngine …t he n2 link of the 1st engine,), and each time search on important pages. We can note that when Pc is
the creation operator is called to output the next link on large (i.e. pc = 0.75 or 1) the search algorithm spends
this list. When none of these engines can provide further more time with unsuccessful exchanging links. To
links, the creation operator is not used anymore and is improve obtained results at these
replaced by the crossover operator. This creation operator Pc’ s, number of iterations should be increased. As
allows the genetic search to start with points of good results, the execution time will rapidly increase as shown
quality. As it will be seen in the further results, those in Fig. 5. Figures 2,3,4,5 illustrate these notes.
heuristically generated individuals (pages) can be greatly
improved with the crossover operator. Each time the
creation operator is called, the next link on the list is 5. Conclusions
given as an output. When the list is empty, the crossover We have proven experimentally the relevance of our
operator is used. From a selected parent pages, the approach on the presented queries by comparing the
crossover operator generates an offspring by combining qualities of output pages with those of the original
the parent pages (exchanging the links between the two downloaded pages. As the number of iterations increases
pages after crossover point). In order to speed up the better results are obtained still with reasonable execution
pages evaluation, only mismatched links are considered time. The small size of pages Pc limits chances of
and links having maximum qualities are transferred improving the page qualities and reducing execution time
directly to the output list before applying the GA. at a specified number of iterations. It should be noted that
the results depends on the preparation methods of
constituting pages under test. Here, a page consists of
links from yahoo, AltaVista, Google and Msn
downloaded links sequentially. (One link per search
engine each time).

84
AIML 05 Conference, 19-21 December 2005, CICC, Cairo, Egypt
Table 1. Comparative results for M q

The value of averaged mean quality


No Keywords (K1, K2 …) Pc=0 Pc=0.25 Pc=0.5 Pc=0.75 Pc=1
1 Web mining with genetic algorithm 12.5 17.1 18 17.5 15.9
2 Low pass filters operational amplifier 14 18.1 19.3 18.7 16.2
3 Network security Notes 11 15.3 16.1 15.6 13.1
4 Improving search engine results with genetic 17.1 23.1 25.3 24.1 19.2
algorithm
5 Egyptian football Players 11.1 15.1 16.5 15.7 13.5
6 Implementing and supporting Microsoft windows Xp 18.3 24.1 26.3 25.3 21.1
Professional
7 Microsoft internet security and acceleration server 21 27.6 29.1 28.1 24.1
2000 ISA
8 Implementing, managing and marinating a Microsoft 28 36.3 39.1 34.2 30.1
windows server 2003 network infrastructure
9 Planning, implementing and maintaining a Microsoft 29 38.1 42.3 35.1 32
windows server 2003 active directory
10 Oracle internet application developer track 14 19.1 21.1 20 16.1

26
24 Pc=0.5
Mean Quality

22 Pc=0.75
20
18
16 Pc=0.25
14
12 Pc=1
10 Pc=0
0 50 100 150 200
Download Pages

Figure 1. Population averaged mean quality for different values of pop size at 3000 iterations.

25
Pc=0.75
23
Pc=1

21
Mean Quality

19

17

15

13 Pc=0.25
11 Pc=0.5
0 2000 4000 6000 8000 10000 12000
Iterations No.

Figure 2. Population averaged mean quality for different values of iterations number at 20 pages.

85
AIML 05 Conference, 19-21 December 2005, CICC, Cairo, Egypt

25 Pc=0.5

23
Pc=0.25
21

Mean Quality
19

17
Pc=0.75
15

13
Pc=1
11
0 2000 4000 6000 8000 10000 12000
Itrations No.

Figure 3. Population averaged mean quality for different values of iterations number at 120 pages.

25 Pc=0.25
23
Mean quality

21
19
17
15 Pc=0.5
13 Pc=1 Pc=0.75
11
0 2000 4000 6000 8000 10000 12000
Iteretions No.

Figure 4. Population averaged mean quality for different values of iterations at 250 pages.

300 Pc=1
250
200 Pc=0.75
150
Time

100
50
0
500 1000 1500 2000 2500 Pc=0. 5
3000
No. of iterations
Pc=0.25

Figure 5. Variation of time (sec.) with number of iterations using 250 pages.

86
AIML 05 Conference, 19-21 December 2005, CICC, Cairo, Egypt

6. References
[1] Lawrence S. and Giles C.L. 1999b [9] Whitley D. The Genitor algorithm and selective
Text and image meta-search on the web. pressure: why rank-based allocation of
International Conference on Parallel and reproductive trials is best. Proceedings of the third
Distributed Processing Techniques and International Conference on Genetic Algorithms,
Application, 1999. 1989, J.D. Schaffer (Ed), Morgan Kaufmann,
[2] Zamir O. and Etzioni O. 2000, Grouper: pp116-124.
a dynamic clustering interface to web search [10] Fan W., Gordon M.D., Pathak P. Automatic
results. Proceedings of the Ninth International generation of a matching function by genetic
Worldwide Web Conference, Elsevier, 2000. programming for effective information retrieval,
[3] F.Picarougne, N.Monmarche, A.Oliver, Proceeding of the 1999AmericasConference on
G.Venturini. Search of information on the Internet Information Systems,pp49-51.
by evolutionary algorithm, 2002. [11] Monmarché N., Nocent G., Slimane M. and
[4] HollandJ.H. Adaptation in natural and artificial Venturini G. Imagine: a tool for generating HTML
systems. Ann Arbor: University of Michigan Press style sheets with an interactive genetic algorithm
1997. based on genes frequencies. 1999IEEE
[5] Albert R, Jeong H. and BarabasiA.-L. 1999, International Conference on Systems, Man, and
Diameter of the Worldwide Web. Nature, Cybernetics (SMC'99), Interactive Evolutionary
401:130-131, 1999. Computation session, October 12-15, 1999,
[6] Broder A., Kumar R., Maghoul F., Raghavan P., Tokyo, Japan.
Rajagopalan S. State R., TomkinsA. And Wiener [12] Morgan J.J. and Kilgour A.C. Personalizing
J. 2000. Graph structure in the Web, Proceedings information retrieval using evolutionary modeling,
of the Ninth International Worldwide Web Proceedings of Poly Model Applications of
Conference, Elsevier, 2000. Artificial Intelligence, ed by A.O. Moscardini and
[7] Menczer F, Belew R.K., Willuhn W. Artificial life P. Smit h, 142-149, 1996.
applied to adaptive information agents. Spring [14] Sheth B.D. A learning approach to personalized
Symposium on Information Gathering from informa tio nf i
lte r
ing,.Ma ster
’sthesis, Department
distributed, HeterogeneousDatabases, AAIPress, of Electrical Engineering and Computer Science,
1995. MIT, 1994.
[8] MoukasA, Amalthea. Iinformation discovery and [15] Vakali A. and ManolopoulosY. Caching objects
filtering using a multiagent-evolving ecosystem. from heterogeneous information sources,
Applied Artificial Intelligence, 11(5):437-457, Technical report TR99-03, Data Engineering Lab,
1997. Department of Informatics, Aristotle University,
Greece 1999.

87

S-ar putea să vă placă și