Sunteți pe pagina 1din 7

Proxy Server: integrating client side information and query

Sankararao Majji and Sanasam Ranbir Singh


Department of Computer Science and Engineering
Indian Institute of Technology Guwahati
Guwahati-781039, Assam, India
{sankara,ranbir}@iitg.ernet.in

ABSTRACT limited to query and corresponding clicks. As pointed out


Mining user’s search context promises major improvements in [5], query logs often do not have sufficient information to
in several key aspects of Web search. Usage information recognize query intent. Moreover, user’s search needs are
such as search click-through has been a valuable source of often influenced by user’s other Desktop and Web activities.
information to determine what user wants. However, such The simplest way to determine user’s search context is to
server side information is limited to query and corresponding ask the user for explicit feedback at the time of search. As
clicks. Such query logs often do not contain sufficient infor- observed in the studies [2], users are not willing to provide
mation to determine user’s search intent. Therefore, it is explicit feedback. For servers, add-on tool bars become a
important to capture user’s other activities such as the doc- handy and effective resource to capture information related
uments read or URLs opened before submitting a query. In to other client side Web activities. However, tool bars invites
this paper, we propose a framework to capture information certain issues of user’s concern; (i) Transparency: user’s are
about client side activities and share them with Web search not fully aware of the kind of information that it collects, (ii)
engine effectively using Proxy server. From extensive ex- Scope: the collected information is most probably limited to
periments based on real user’s experience, it is evident that the Web activities alone.
user’s search need is indeed influenced by the user’s Desktop
and Web activities. From further analysis, we observe that Study on the importance of exploring client side information
with an overhead of 2.32ms Proxy server can extract and is not new. In a recent study by Guangyu Zhu et. la [8], au-
send the additional information to Web search engine along thors explore user’s non-search Web browsing information to
with user’s search request. determine rich session context to improve web search.In this
paper, we propose a client side framework to collect infor-
Categories and Subject Descriptors mation about user’s Desktop and Web activities implicitly,
without any extra overhead to users, through a proxy server.
H.4 [Information Systems Applications]: Miscellaneous;
The framework collects information about user’s client side
D.2.8 [Software Engineering]: Metrics—complexity mea-
activities (for both Desktop and Web) implicitly and can
sures, performance measures
share the aggregated information with server with negligi-
ble execution cost. Such a client side tool is important in
Keywords several aspects; (i) Transparency, (ii) Scope, (iii) Authority:
Client side activities, search request, proxy server, informa- user can share the information at his own will, (iv) Flexibil-
tion integration ity: user can share the information based on user’s needs.
This is the motivation of this paper.
1. INTRODUCTION
Web search engines need information such as user’s search Rest of the paper is organized as follows. Section 2 dis-
pattern, user’s profiling information like recent activities on cusses current method to collect user’s information. Sec-
client’s side to improve user’s search experience. Query tion 3 presents the proposed framework. Section 4 discusses
search logs and tools bar are effectively exploited by the the definition of the existing techniques used in this paper.
engines to address the above issue. Usage information such Experimental setup and results are discussed in Section 5.
as search click-through has been a valuable source of infor- The paper concludes in Section 6.
mation to determine what user wants. However, the server
side information such as user’s click-through information is
2. EXISTING METHOD OF COLLECTING
USER’S INFORMATION
Presently, commercial Web search engines(WSE) such as
Google 1 , AOL 2 collect information about user’s Web activ-
ities through opt-in toolbars. In general, WSEs provide tool
bar that has user’s privacy protection policy, for instance

1
http://toolbar.google.com/
2
http://toolbar.aol.com/
Figure 1: Proposed Dataflow

Google 3 . However, majority of the user’s are not aware of


the privacy policy or not aware of the information that tool
bars collect. If user wants to verify the information collected
by the WSE, there is no way to check it.

On the other, the proposed framework provides complete Figure 2: Design and implementation of modified
transparency to the user. Users are aware of the information Proxy server and Time tracker
sent to the WSE. Most importantly, as it is done through
proxy server, information sent to the server can be checked
by the user. Moreover, proxy server can share the informa- 3.1 Modified Proxy Server
tion with negligible overhead. The Proxy Server designed based on protocols defined in
RFC 2616 and implemented using socket programming con-
cepts of Java. The Proxy Server can be configured to listen
3. PROPOSED FRAMEWORK on any available port and can be used by different client
Almost in all organizations, users have access to Internet machines at same time. In order to use it as proxy server,
through a proxy server. A proxy server receives URL re- one has to change the proxy setting, i.e. IP and Port, of web
quests from users and it in turn submits the requests to browser to IP and Port of the machine where proxy server is
actual target Web server. For a Web search engine, a proxy running. The Proxy Server uses persistent TCP connection
server, therefore, has the capacity to add additional infor- between proxy to host & target, and to speed up processing
mation required by the engine to optimize the search re- of requests it also takes multiple requests at a time and uses
sults. Figure 1 shows an abstract level data flow of the the pipe lining concept. Later requests will be served the
proposed framework. Broadly, the proposed framework has same way the requests are received.
three modules namely proxy server, information miner and
client side activities database. All the inward and outward traffic of web browser goes
through proxy server, it stores the URL, referrer informa-
The client side activities database stores the following infor- tion along with client machine IP Address and Port infor-
mation about the activities performed by individual users, mation during request time; web page content and encoding
i.e., list of visited URLs and their contents, list of non- type etc during response time. As there will be several un-
Web documents read by user on Desktop and their contents, necessary URLs dynamically generated by the request URL
query requested and corresponding clicks, and the time at page, such as several ad URLs, the proposed proxy filters
which each activity is performed. We design and implement out them by checking the response content type. In the cur-
two customized tools i.e., a modified proxy server and a time rent implementation, we consider only text/html/pdf type
tracker to gather the above information. We discuss these Web content.
tools in Sections 3.1 and 3.2.

The information miner is the module which processes the 3.1.1 Why Proxy Server
database and extract information which may be useful by User’s Web activities can be tracked from the web browser
WSEs to determine what user wants and customize the search history database. This method has two drawbacks - 1)
results accordingly. In this paper, for experimental under- We found that history database getting updated with de-
standing we focus on two kinds information only: (a) class lay span of 1-4 minutes. 2) The visited URLs needs to be
distribution of the documents visited by the user in recent crawled once again for their web content, leads to bandwidth
time and (b) informative terms representing the documents wastage. By using customized proxy server, it can overcome
visited by the user in recent time. However, it can be ex- both drawbacks.
tended to share other detail information as well. We discuss
the procedure in Sections 5.3.
3.2 Time Tracker
The desktop activities time tracker is one the most crucial
The modified proxy server is the module which receives
module. It not only captures the documents that user opens
search requests from users and adds (if user want so) ad-
on the desktop, but also keeps track of each window’s active
ditional information such as class distribution and informa-
life time. It enables us to monitor how long a user spends
tive term list. This module interacts with information miner
reading a particular document. It was built on top of UNIX
module to get the required information. The detail of the
utility tool called wmctrl. Time Tracker gets the current ac-
implementation is presented in section 3.1.
tive application name and title of the window from wmctrl
3
www.google.com/toolbar/labsprivacypolicy.html continuously, based on these information it can keep check-
ing the time user spends on a particular document. In our 4.1.4 χ2
2
implementation we tracked time user spends on a web page The χ (CHI) is defined by the following expression in [7].
and pdf and word documents as these are the main sources
of information. This model assumes that if window is active N × (AD − CB)2
χ2 (t, c) =
for more than 30 secs it been read by user. (A + C) × (B + D) × (A + B) × (C + D)
where N is the number of documents, A is the number of
4. MONITORING USER’S ACTIVITIES documents of class c containing the term t, B is the number
What is the class label of the document that user wants to of documents of other class (not c) containing t, C is the
visit? What are the terms that can represent user’s inter- number of documents of class c not containing the term t and
est? and Does user submit a query while reading a document D is the number of documents of other class not containing
ot writing/reading an e-mail or while exploring a social net- t. It measures the lack of independence between t and c and
work? Such information can play important role while deter- comparable to χ2 distribution with one degree of freedom.
mining user’s search goal. This paper focuses on extracting The χ2 static is known to be unreliable for low-frequency
the above issues using client side information. Before dis- terms [3]. The commonly used global goodness estimation
cussing them in detail, we first discuss few definitions that functions are maximum and mean functions i.e.,
are used in this study. χ2 (f ) = arg max χ2 (f, ci )
ci

4.1 Definitions or
X
4.1.1 Document Representation χ2 (f ) = P r(ci )χ2 (f, ci )
We use vector space model to represent documents [4]. In i

vector space model, it is assumed that each document d is


represented by a term vector of the form d = {w1 , w2 , ..., wn }, In the study [7], χ2 is found to be robust feature section
where wi is a weight associated with the term fi . In boolean mechanism for text classification. In this study, we also used
representation, wi is either 0 or 1 to represent absence or χ2 as the feature selection mechanism for the text classifier
presence of the term fi in a document. Otherwise, wi is to assign class label to the documents that user read.
defined by some weight. We use boolean representation in
this study.
4.1.5 Naive Bayes
In this study, we use Naive Bayes classifier for its simplicity
4.1.2 Similarity between two documents and scalability for on line classification. The design and
Given two documents di and dj , cosine similarity between implementation is discussed in Appendix A.
the two document vectors is defined as follows.
di · dj Assuming naives condition i.e., features are conditionally
cosine(di , dj ) = (1) independent, we defined naive Bayes classifier by
|di | · |dj |
Q
P r(ck ). j P r(dij |ck )
P r(ck |di ) = Q
P r(dij )
j
To estimate co-relation between two Desktop activities and
search activities, we estimate cosine similarity between the where ck is a class label and di is the document vector to
non-search documents and search documents. be classified.

4.1.3 Kullback-Leibler divergence 4.2 What does the proxy server add to search
Given two probability densities pi and pj , the distance be- request
tween pi and pj can be defined by the Kullback-Leibler di- 4.2.1 Dominant document class
vergence as follows. It analyses the distribution of the categories of the docu-
  ments or Web pages visited by the user in the last n min-
pi
KLD(pi ||pj ) = pi . log (2) utes. Such category information is important for customiz-
pj
ing search results. For instance, return the documents be-
KLD is a non-symmetric measure of difference between two longing to the dominant class alone.
probability distributions. KLD is also known as relative en-
tropy in information theory. KLD between pi and pj is zero 4.2.2 Informative terms
if pi = pj . It indicates maximum cross-entropy4 between pi Assume that user has been visiting k documents in the last
and pj . Considering a collection of documents, high KLD n minutes, it extracts informative terms which can repre-
between the probability distribution of a term in a local set sent user’s search context. KL-divergence is a commonly
of document and the probability distribution of the term in used mechanism in determining informative term in query
the entire set of documents indicates that the term is rela- expansion with local analysis [6]. We also use KL-divergence
tively frequent in the local set in contrast to the entire col- to determine informative terms from the k documents be-
lection. KLD is effectively used to determine popular terms cause of its conceptual similarity.
in a local set of document which is often necessary for local
analysis based query expansion [1]. Similarly, we also use
KLD to extract popular term. 4.2.3 Statistical Analyzer
We can also analyze the information such as how long user
4
http://en.wikipedia.org/wiki/Cross_entropy spend time on different activities such as email checking,
Table 1: Data Set Statistics Cosine Similarity with KLD Features
Parameter Count 1
No.Of.Users 13
No.Of.Days 3 Weeks 0.8

Cosine Similarity
No.Of.Search Queries 744
No.Of.Desktop Clicks 586 0.6
No.Of.Search Clicks 170
Avg. Words/Query 4.24 0.4

0.2
Table 2: Desktop & Search Doc Click Count
Time Span #Desktop Clicks #Search Clicks
0
5 Minutes 4.35 1.34
10 Minutes 8 1.36 Query Instances
15 Minutes 10.52 1.44
20 Minutes 11.03 1.43 Figure 3: Distribution of Non-zero cosine similarity
25 Minutes 12.45 1.41 between client desktop activities and Web search ac-
30 Minutes 13.17 1.47 tivities
35 Minutes 13.24 1.45
40 Minutes 13.93 1.45
45 Minutes 14.38 1.45
50 Minutes 14.24 1.43
55 Minutes 13.72 1.53
60 Minutes 13.81 1.53

social networking sites, Web search, browsing and reading


non-web documents. It can also analyze the correlation be-
tween user’s search activities and user’s other desktop ac-
tivities. For instance, user’s Web search requests while he
is reading a document. However, we have not included this
study in this paper.

5. EXPERIMENTAL SETUP Figure 4: Average cosine similarity at different time


instance
5.1 Dataset
In this section, we discuss the experimental setup. As we
setup the experiments and integrate our tools with real user 5.2 Web search and other activities
machine, finding volunteers was the biggest challenge. We The idea of integrating client side information through proxy
have installed the tool and monitor the usage pattern from server had been motivated by our first hand observations
thirteen volunteers for three weeks. For three weeks, we reported in this section i.e., is the client’s search request
monitor every activities that are performed by the users influenced by the client’s Desktop activities performed before
on their machines. The volunteers are asked to perform submitting the query? To address this question, we first
their normal activities and the tools monitor the activi- investigate the correlation between non-search documents
ties in the background. Table 1 shows the characteristics that user read on Desktop before submitting a query and the
of the dataset. It has 774 number of query instances. Users documents that user clicked through search engine. In the
read about 586 number of desktop documents and about study [8], the importance of client side information has been
170 number of Web documents. The average query length noted. In our experiment, it is observed that many query
is 4.24 words. The modified proxy server is installed on a instances are indeed influenced by the documents read by
centralized machine and the time tracker is installed on in- the users before submitting a query. We use cosine similarity
dividual machine. For three weeks, volunteers are asked to and categories of the documents that user read as the two
used our modified proxy server so that we can capture every different measures of correlation between user’s non-search
Web activities and download the respective pages. activities and search activities.

Table 2 further shows the average number of documents user We use boolean VSM (see Section 4.1.1) to represent the
explores before submitting a query at different time span. It documents and top KLD terms to build the vocabulary set.
clearly shows that the document number converses at around Figure 3 shows the average cosine similarity between the
15 minutes. Surprisingly, user often opens the same docu- search documents and the documents read within a time
ments in between queries or different queries are asked while span of 30 minutes. We observe that around 65.71% of
reading a document which resulted repeated open of the the total query instances have non-zero cosine similarity. It
same document. clearly indicates that the number of query instances which
% of dominant class matching over different time spans Table 4: Time Overhead
Percentage of matching dominant class
100 Time to get data
Min 2ms
80 Avg 2.23ms
Max 10ms
60

using Java Deflater 5 class. Table 3 (6th column to 7th


40 column) shows the improvement after compression. On an
average, the request length is 0.1093KB. From our proxy
20 log, we have investigated average URL length of all the URL
requests received by the proxy server. On an average it is
0 recorded as 0.1085KB. Compare to this, it clearly shows that
10 20 30 40 50 60 the resulted request length after adding extra information is
Time Span (mins) still relatively smaller compare to average URL length. Thus
the average byte overhead is negligible.
Figure 5: Design and implementation of modified
Proxy server and Time tracker 5.4 Execution Overhead
What is the delay incurred by the Proxy server in the pro-
cess of adding extra information to the search request? The
are influenced by before search activities are significant. In delay is contributed by the following factors: (a) time to
Figure 4, we further investigate the distribution of non-zero get the dominant class label, (b) time to get the URLs, (c)
cosine similarity over different time spans. It is observed time to get popular terms, (d) time to get user’s previous ac-
that as the span increases, the average similarity increases tive window , (e) compression time. We perform the above
then stable. It clearly indicates that short span contains less operations using background threads. As a results, the over-
information and long span saturates. head is just the collection time. Table 4 shows the execution
overhead incurred by the proxy server for adding the addi-
Unlike cosine similarity, category distribution (refer Fig- tional information. On average, it requires 2.23 ms to get
ure 5) shows very high correlation. Almost in 78.56% of the the additional information and compress. Considering the
query instances, user’s explore the documents of the same average latency time of sending a request and receiving the
class label. This is a very strong evidence that in many in- response from Web server (i.e., average latency time of our
stances user’s search queries are influenced by the documents experimental proxy server is about 3.4 seconds), the aver-
read by the user before submitting a query. age execution overhead is just the 0.7% of the proxy latency,
which we consider as acceptable overhead.

5.3 Byte Overhead Analysis Table 5: Proxy Overhead (in ms)


It is up to the WSEs how do they make the shared informa-
tion useful. Proxy server can share the information such as Delay(ms) % Of.Queries
(a) list of URLs that user visited in last t minutes, (b) list =0ms 3.8%
of dominant class labels of the documents that user read in ≤1ms 35.38%
last t minutes, (c) list of popular terms, (d) whether user is ≤2ms 77.69%
checking mails or exploring social network, (e) whether user ≤3ms 83.84%
is reading a document at the time of submitting a query etc. ≤4ms 88.46%
In this paper, we investigate the effect of a simple method ≤5ms 95.38%
to integrate the above information i.e., add the information ≤6ms 96.15%
with the query request. Suppose a request to Google has the ≤7ms 97.69%
following structure http://www.google.com/search?hl=en&q=jaguar, ≤8ms 99.23%
proxy server can add the additional information to the query ≤9ms 99.23%
i.e., dc=entertainment, pt=auto,car,model,..., lurl=url1,url2,.. ≤10ms 100%
and act=doc where dc for dominant class, pt for popular
terms, lurl for list of URLs and act for activity such as read- Table 5 further shows the distribution of the cumulative
ing document at the time of search. number of queries with cumulative delay time.

With the additional information, now the resultant URL is


longer. Each request required additional bandwidth to carry
6. CONCLUSIONS
extra information. In this section, we study the average In this paper, we proposed a framework to integrate infor-
amount of bytes required by such additional information. mation about the client’s activities along with query request
Table 3 (3rd column to 5th column) shows the statistics of through a proxy server. We further investigate the coller-
byte overhead compare to the request length over Google ation between user’s non-search activities and Web search
search engine. activities. It is observed that user’s search request are often
5
http://download.oracle.com/javase/1.4.2/docs/api/
To reduce the byte overhead, we further compress the data java/util/zip/Deflater.html
Table 3: Shows effective Byte Overhead due to additional information on URL
Time Span Original Additional Overhead % Compression Comp. Ratio (%)Effective Overhead
Min 5 97 19.40% 91 6.18% 18.2%
5 Minutes Avg 19.79 404.72 20.45% 219.20 45.83% 11.1%
Max 51 995 19.50% 576 42.11% 11.3
Min 5 97 19.40% 91 6.18% 18.2%
10 Minutes Avg 22.39 686.91 30.67% 293 57.34% 13.1%
Max 61 2422 39.70% 647 3970% 10.6%
Min 5 93 18.60% 83 10.75% 6.6%
15 Minutes Avg 23.67 891.85 37.67% 344.59 61.36% 11.6
Max 61 2606 42.72% 683 73.79% 11.2
Min 4 93 23.25% 88 2325% 22%
20 Minutes Avg 23.88 924.9 38.73% 360.49 3873% 15.1%
Max 61 2606 42.72% 870 4272% 14.3%
Min 4 93 23.25% 88 5.37% 22%
25 Minutes Avg 23.62 1005.98 42.59% 376.21 62.6% 15.9%
Max 61 3127 51.26% 996 68.14% 16.3%
Min 4 93 23.25% 88 5.37% 22%
30 Minutes Avg 23.27 1043.57 44.84% 379.87 63.76% 16.3%
Max 61 3442 56.42% 1093 68.24% 17.9%
Min 4 48 12.00% 58 17.24% 14.5%
35 Minutes Avg 22.76 1040.68 45.72% 378.15 63.66% 16.6%
Max 61 3683 60.37% 1163 72.42% 19.1%
Min 4 48 12.00% 58 17.24% 14.5%
40 Minutes Avg 22.76 1089.89 47.88% 394.22 63.82% 17.3
Max 61 4217 69.13% 1163 72.42% 19.1%
Min 4 48 12.00% 58 17.24% 14.5%
45 Minutes Avg 22.81 1147.32 50.29% 404.58 64.73% 17.3
Max 61 5351 87.72% 1226 77.08% 20.1%
Min 4 48 12.00% 58 17.24% 14.5%
50 Minutes Avg 22.98 1131.55 49.24% 401.15 64.54% 17.45%
Max 61 5533 90.70% 1290 76.68% 21.14
Min 4 48 12.00% 58 17.24% 14.5%
55 Minutes Avg 22.61 1085.86 48.02% 386.94 64.36% 17.1%
Max 61 5667 92.90% 1305 76.97% 21.4
Min 4 48 12.50% 58 14.50% 14.5%
60 Minutes Avg 22.61 1090.23 48.21% 388.44 64.37% 17.2%
Max 61 5667 92.90% 1305 76.97 21.4%

influenced by the user’s activities performed before submit- on Query Log Analysis, 2007.
ting the query. We also verify that the overhead incurred by [6] J. Xu and W. B. Croft. Query expansion using local
the proxy in both banswidth and execution time is negligi- and global document analysis. In SIGIR’96:
ble. Proceedings of the Nineteenth Annual International
ACM SIGIR Conference on Research and Development
in Information Retrieval, pages 4–11, 1996.
7. REFERENCES [7] Y. Yang and J. O. Pedersen. A comparative study on
[1] G. R. C. Carpineto, R. de Mori and B. Bigi. An
feature selection in text categorization. In Proceedings
information-theoretic approach to automatic query
of the Fourteenth International Conference on Machine
expansion. ACM Transactions on Information Systems,
Learning, ICML ’97, pages 412–420, San Francisco, CA,
19(1):1–27, 2001.
USA, 1997. Morgan Kaufmann Publishers Inc.
[2] J. M. Carroll and M. B. Rosson. Paradox of the active
[8] G. Zhu and G. Mishne. Mining rich session context to
user. In Interfacing thought: cognitive aspects of
improve web search. In Proceedings of the 15th ACM
human-computer interaction, pages 80–111, 1987.
SIGKDD, KDD ’09, pages 1037–1046, 2009.
[3] T. Dunning. Accurate methods for the statistics of
surprise and coincidence. Computational Linguistics,
19(1):61–74, 1993.
[4] A. W. G. Salton and C. S. Yang. A vector space model APPENDIX
for automatic indexing. ACM Communication,
18(11):613–620, 1975. A. CLASSIFIER
[5] C. Grimes, D. Tang, and D. M. Russell. Query logs We design and implement Naive bayes text classifier as de-
alone are not enough. In Proceedings of the Workshop fined in Section A. we build the classifier with Chi-square
feature selection mechanism using DMOZ 6 datasets. We
use NB for its efficiency and scalability and Chi-square is
used because it is one of the most effective feature selec-
tion mechanism for text classification [7]. In our estima-
tion, we assume that all classes are equally likely, otherwise
P r(ck |di ) often Q
biases toward the class with larger example
i.e., P r(ck ) >> j P r(dij |ck ). As denominator is indepen-
dent of class, we have ignore it.

6
http://www.dmoz.org/

S-ar putea să vă placă și