Sunteți pe pagina 1din 8

Tadpole: A Meta search engine

Evaluation of Meta Search ranking strategies


www.stanford.edu/~pavan/tadpole.html

Mahathi S Mahabhashyam Pavan Singitham


mmahathi@stanford.edu pavan@stanford.edu

Abstract search engines crawl the WWW from time


In this write up, we explain the design of to time and index the web pages. However,
Tadpole, a Meta search engine which obtains it is virtually impossible for any search
results from various search engines and engine to have the entire web indexed. Most
aggregates them. We discuss three meta- of the time a search engine can index only a
search ranking strategies – two positional small portion of the vast set of web pages
methods and a scaled foot rule optimization existing on the Internet. Each search engine
method and study the response-time/result crawls the web separately and creates its
quality trade-offs involved. own database of the content. Therefore,
searching more than one search engine at a
1.Introduction time enables us to cover a larger portion of
the World Wide Web.
A Meta search engine transmits user’s
search simultaneously to several individual Secondly, crawling the web is a long
search engines and their databases of web process, which can take more than a month
pages and gets results from all the search whereas the content of many web pages
engines queried. We could thus save a lot of keep changing more frequently and
time by initiating the search at a single point therefore, it is important to have the latest
and sparing the need to use and learn several updated information, which could be present
separate search engines. This can be even in any of the search engines.
more helpful, if we are looking for a broad
Meta Search engines help us achieve the
range of results.
afore-mentioned objectives. However, we
In our project, we have implemented a Meta need good ranking strategies in order to
search engine, which queries Google, aggregate the results obtained from the
Altavista and MSN databases. We have various search engines. Quite often, many
provided an interface for searching these web sites successfully spam some of the
search engines along with several advanced search engines and obtain an unfair rank. By
options for phrase search, conjunction, using appropriate rank aggregation
disjunction and negation of the key words. strategies, we can prevent such results from
In order to rank the results obtained, we appearing in the top results of a meta-search.
have made use of three rank aggregation
Our primary motivation was to develop a
strategies and evaluated the results obtained.
simple meta-search engine and study the
Out of these, two are positional methods,
response-time and performance trade-offs
which make use of the result’s rank in each
involved.
of the separate search engine to obtain a new
rank by simple aggregation. The third one is
3.Previous Work
a scaled foot rule optimization technique.
There are quite a few Meta search engines
2.Motivation available on the Internet, which can be
categorized as follows
There are primarily two motivating factors
behind our developing a meta-search engine. 1. Meta search engines for serious deep
Firstly, the World Wide Web is a huge digging Ex: Surfwax, Copernic Basic
unstructured corpus of information. Various
2. Meta Search engines which aggregate the to the top of the Meta search-ranking list.
results obtained from various search engines This is effective in avoiding spam.
Ex: Vivisimo, Ixquick
4.Organization
3. Meta Search engines which present results
The organization for the report is as
without aggregating them Ex: Dogpile
follows:Section 5 discusses the architecture
Meta-search engines of the first kind and design of Tadpole, the meta-search
are not available as free-software. So, their engine developed by us. Section 6 gives a
benefits are not reaped by most users. Some study of the tradeoffs involved. In Section 7.
of the other issues involved and drawbacks we describe a few problems we encountered
of meta-search engines are provided in [3]. during the project. Section 8 gives the
conclusion and future work.
An aggregation of the results obtained
would be more useful than just dumping the 5.Architecture of Tadpole
normal results. For such an aggregation,
When a user issues a search request,
Ravi Kumar et al [1] have suggested several
multiple threads are created in order to fetch
Rank aggregation methods for the web,
the results from various search engines.
broadly categorized as Borda’s positional
Each of these threads is given a time limit of
methods, Foot rule /Scaled Foot rule
3 seconds to return the results, failing which
Optimization methods, Markov Chain
a time out occurs and the thread is
methods for rank aggregation. They also
terminated.
suggest a local Kemenization technique,
which brings the results that are ranked Each process converts the given query to the
higher by the majority of the search engines format specific to the search engine it is

Ranking
S Algorithm

S Array of Aggre-
TreeMaps gated
Results
Parallel processes query
different search engines
and obtain the results TreeMap sorted
on rank

Figure 1
dealing with. This request is sent to the the results are obtained in the form of a
search engine via the java URL object and HTML page. This HTML results page is
parsed by the process and for each result, the 5.2 Ranking Aggregation Methods
URL, Title, Description, Rank and Implemented
SearchSource are stored, creating a Result Take the Best Rank
object. These results are entered into a In this algorithm, we try to place a URL at
TreeMap data structure with the key as the the best rank it gets in any of the search
url and the item as the Result object. engine rankings.
That is,
The GUI also provides for advanced search
MetaRank (x) =
options for entering Boolean queries, Phrase
Min(Rank1(x),Rank2(x),…. , Rankn(x));
searches, selecting the number of results per
Clashes are avoided by an ordering of the
search engine and the selection of search
search engines based on popularity. That
engines to be queried.
means, if two results claim the same position
5.1 Design Decisions in the meta-rank list, the result from a more
During the design of Tadpole, we popular search engine, (say Google) is
various design decisions were taken. Some preferred to the result from a less popular
of them are listed below: one.

Why TreeMap? Borda’s Positional Method


TreeMap data structure combines the nice In this algorithm, the MetaRank of a url is
features of a tree ( low search and retrieval obtained by computing the Lp-Norm of the
time) and Map (easy association) data ranks in different search engines.
structures. By storing the results with the MetaRank(x)=
URL as the key, we can retrieve a result in [Σ(Rank1(x)p,Rank2(x)p,…. , Rankn(x) p)]1/p
(log n) time while removing the duplicates In our algorithm, we have considered the
and merging them in the ranking algorithm. L1-Norm which is the sum of all the ranks
This helps in a considerable speed up when in different search engine result lists.
we have hundreds of results from each Clashes are again avoided by search engine
search engine. popularity.
The search source for a URL, which is
The TreeMaps thus obtained from each of displayed in the meta search results, is set as
the threads are then inserted in an array and the search engine in which the URL is
passed on to the Ranking algorithm. The ranked the best.
Ranking algorithm then returns a tree map
sorted on rank. Scaled Footrule Optimization Method
Why these three ranking strategies? In this algorithm, the scaled footrule
The positional methods are computationally distances are used to rank the various
more efficient. They give a good precision results. Let T1, T2 , … Tn be partial lists
when compared to just aggregation of results obtained from various search engines. Let
without using any ranking. The scaled- their union be S. A weighted bipartite graph
footrule method is computationally more for scaled footrule optimization (C,P,W) is
complex, but is proven to have given much defined as
better performance. It is also useful in the C = set of nodes to be ranked
reduction of spam to an extent. As the basic P = set of positions available
idea of this project was to study the trade- W(c,p) = is the scaled- footrule distance
offs involved, we wanted to get a gradation ( from the Ti’s ) of a ranking that places
in the level of computational complexity and element ‘c’ at position ‘p’, given by
performance and so we chose these three W(c,p) = ∑I=1k | Ti(c)/|Ti| - p/n|
rank aggregation methods. Where n = number of results to be ranked
and |Ti| gives the cardinality of Ti.
Computation of foot-rule aggregation for
partial lists is NP-hard [1]. Hence the use of
scaled foot-rule distance measure. This Scaled Footrule optimization can be solved
problem can be converted to a minimum using the Hungarian algorithm for Bipartite-
cost perfect matching in bipartite graphs matching.
described above. There are various
algorithms for finding the minimum cost 6.2 Rank Aggregation Time
perfect matching in bipartite graphs. We
have used the Hungarian method for doing The aggregation times of various ranking
it. strategies were measured with respect to
The Hungarian method proceeds as follows: each other and with normal search
- Obtain the reduced cost matrix from
engines. The evaluation was carried out
the given cost matrix by subtracting
the minimum of each row and each
with respect to the following set of 38
column from all the other elements queries, which were previously used in
of it. other studies [1,4,5]
- Try to cover all the zeroes with the
minimum number of horizontal and affirmative action,
vertical lines. alcoholism, amusement
- If the number of lines equals the size parks, architecture,
of the matrix, find the solution. bicycling, blues, cheese,
- If you have covered all of the zeroes citrus groves, classical
with fewer lines than the size of the guitar, computer vision,
matrix, find the minimum number cruises, Death Valley, field
that is uncovered. hockey, gardening, graphic
- Subtract it from all uncovered design, Gulf war, HIV, java,
values and add it to any value(s) at Lipari, lyme disease, mutual
the intersections of your lines. funds, National parks,
- Repeat until a solution is obtained. parallel architecture,
A detailed description of the algorithm is Penelope Fitzgerald,
provided in [3] recycling cans, rock
climbing, San Francisco,
Shakespeare, stamp
6.Evaluation of Ranking Strategies collecting, sushi, table
6.1 Algorithmic Complexity tennis, telecommuting,
The first parameter for testing the Thailand tourism, vintage
three ranking strategies is the time cars, volcano, zen
complexity of the algorithms. The positional buddhism, and Zener.
methods – MinRanker and Borda’s
positional method take linear time, that The results are summarized below:
means they have a complexity of O(n).
Rank aggregation time

400
Time( in milli

300
seconds)

Naïve Ranking
200 Borda's Ranking
100 Foot Rule Ranking

0
1

10
13

16

19

22
query

Average Rank Aggregation Times


Naïve Ranking - 18.6 msec
Borda’s Ranking - 51.2 msec
FootRule Ranking - 161.5 msec

We observe that the rank aggregation regarded as a better search engine


times for the foot rule ranking are on an considering that the overlapping results are
average thrice those for the Borda’s more relevant.
positional ranking.
6.4 Performance of the various rank
6.3 Overlap across search engines – aggregation methods
Relative Search Engine Performance In evaluating the performance of the
Among the top 10 results obtained for each ranking strategies for all the queries, we
query , we found the results that overlap have chosen precision as a good measure
across multiple search engines. An of relative performance. because all the
interesting observation would be to find ranking strategies work on the same set
which search engines rank the overlapping of results and try to get the most relevant
results better. An intuition behind such a ones to the top. Hence, a strategy that
measure is that a search engine, which ranks has a higher precision at the top can be
the overlapping results, better can be rated better from the user’s perspective.

Performance of search engines for


overlapped results

19%
Google
Altavista
22% 59%
Msn
We have plotted the precision of the ranking We have taken the relevance feedback from
strategies with respect to both the number of two different judges. The Kappa measure of
search results and the recall. this relevance feedback is 0.78. In the
following graphs, we present the results for
In considering the recall, we have taken the
two out of the 38 queries run. We also
total number of relevant documents based on
present the average of the results obtained
user evaluation of all the top 10 results
over the 38 queries.
retrieved by each search engine. The recall
is calculated as the number of relevant 6.4.1 Precision with respect to Number of
documents retrieved/ total number of Results returned
relevant results thus judged.

Query:Gardening

1.5
Borda Method
Pre cision

0.5 Naïve Ranking

0 Foot rule
22
10
14
18

26
2
6

Ranking
Number of Results

Query:Alcoholism

1.5
Borda Method
Pre cision

0.5 Naïve Ranking

0 Foot rule
26
10
14
18
22
2
6

Ranking
Number of Results

Average Precision over 38 queries

1.5
Borda Method
Pre cision

0.5 Naïve Ranking

0 Foot rule
18
10
14

22
26
2
6

Ranking
Number of Results
It can be observed that on an average, the set of results. Also, easily computable
footrule distance ranking aggregation Borda’s method does a good job when
method gives better precision for the given compared to the Naïve ranking method.

6.4.2 Precision vs. Recall

Query: Alcoholism

1.2
1
Precision

0.8 Naïve Ranker


0.6
0.4 Borda's
0.2 Ranker
0 Foot rule
0.7
0.28

0.56

0.84
0.98
0.14

0.42

ranker

Recall

Query: Gardening

1.5
Precision

1 Naïve Ranker

0.5 Borda's
Ranker
0
Foot rule
0.14

0.98
0.28
0.42
0.56

0.84
0.7

ranker
Recall

A similar observation can be made with respect to the precision at a given recall for each of the
ranking strategies.

7.Problems encountered language specific search which have not


During the design of the advance been explored as part of this project.
search interface, we realized that all the Another major issue we faced was
options that normal search engines provide, finding an optimal algorithm for
could not be made available because, each implementing minimum cost bipartite
search engine provides a different set of matching. We chose to implement the
advanced options. Hungarian method, but in retrospect we
Some of the advanced search think other efficient algorithms would have
options implemented in the different search been better.
engines are tabulated below. There are other
advanced search options like file format,
Feature Google MSN Altavista Tadpole
Conjunction Yes Yes Yes Yes
Disjunction Yes Yes Yes Yes
Negation Yes Yes Yes Yes
Phrase Search Yes Yes Yes Yes
Number of No (for the API) No Yes No
results per page

8.Conclusion and Future Work Methods for the web. In proceedings of the
In the context of our project, we have Tenth World Wide Web Conference, 2001.
studied some trade-offs that are involved in [2]Hungarian Method
the design of meta-search engines. We have http://www.math.nus.edu.sg/~matcgh/MA
observed that the computational complexity 3252/lecture_notes/Hungarian.pdf
of ranking algorithms used and performance http://www.cob.sjsu.edu/anaya_j/HungMe
of the meta-search engine are conflicting th.htm
parameters. A compromise must be achieved [3]http://www.lib.berkeley.edu/TeachingLib/
between these two, based on the perceived Guides/Internet/MetaSearch.html
applications and environment in which the [4]K. Bharat and M. Henzinger, Improved
meta-search engine will be used. algorithms for topic distillation in a
hyperlinked environment.ACM SIGIR, pages
Future work involves, incorporating more 104--111, 1998.
number of search engines in the study, [5]S. Chakrabarti, B. Dom, D. Gibson, R.
studying the performance for the most Kumar, P. Raghavan, S. Rajagopalan, and A.
popular queries published by the various Tomkins.
search engines, incorporate local Experiments in topic distillation. Proc. ACM
kemmenization to e spam, to incorporate SIGIR Workshop on Hypertext Information
methods for avoiding mirrored search Retrieval on the Web, 1998.
results. [6]H. P. Young. An axiomatization of
Borda's rule. Journal of Economic Theory,
Bibliography 9:43--52, 1974.
[1] Cynthia Dwork, Ravi Kumar, Moni
Naor, D Siva Kumar, Rank Aggregation

S-ar putea să vă placă și