Documente Academic
Documente Profesional
Documente Cultură
1067
WWW 2008 / Poster Paper April 21-25, 2008 · Beijing, China
e
im
r: t
:semantics
0o 0o 0o 0o
Polar Transformation Subspace Estimation Subspace Pruning Cluster Generation
1
0.2 DECK- NP
Entropy
W (xi ) = s(NNx )
(1) 0.15
the subspace direction and n(N Nxi ) is the variance of its neigh- 0
5K 10K 20K 50K 100K
bors along the orthogonal direction of the subspace. When the data No. of Query Sessions
point xi lies in a cluster where data points spread along the di-
s(NN )
rection of the subspace, both the value of n(NNxxi ) and the value Figure 2: Entropy comparison between algorithms.
i
of s(N Nxi ) + n(N Nxi ) are small. Hence, the weight W (xi ) is One of our performance evaluation using entropy measure is
s(NN )
close to 1. Otherwise, n(NNxxi ) and/or s(N Nxi ) + n(N Nxi ) are shown in Figure 2. For each generated cluster i, we compute pij as
i the fraction of query sessions (or query-page pairs for the existing
large, which results in a small W (xi ).
3. Subspace pruning. Not every subspace is interesting such
approach [7]) representing the true event j. Then, the entropy the
of cluster i is Ei = − j pij log pij . The total entropy can be
that it contains clusters corresponding to real events. Hence, we calculated as the sum of the entropies of each cluster weighted by
ni ×E
prune uninteresting subspaces in this step. Based on our polar the size of each cluster: E = m i n
i
, where m is the number
transformation schemes, the temporal “burst" and the semantic “burst" of clusters, n is total number of query sessions (query-page pairs)
of query sessions should be reflected by the certainly distribution and ni is the size of cluster i. As shown by the figure, our approach
of data points along the subspace direction and the orthogonal di- (denoted as DECK in the figure) works better than the existing ap-
rection of the subspace respectively. In order to measure the cer- proach (denoted as 2PClustering in the figure). The figure also re-
tainty of the distribution of data points along the two directions, veals that our approach outperforms two of its alternative versions:
we project data points to the two directions respectively and calcu- DECK-GPCA (which does not improve the robustness of GPCA)
late the respective histograms of the distributions. Let h1 , h2 , · · · , and DECK-NP (which does not prune uninteresting subspaces).
hm and v1 , v2 , · · · , vn , where hi and vi are individual bins, be
the two corresponding histograms. We employ the entropy measure In general, we proposed a novel approach for detecting events
to define the interestingness of a subspace si as nfollows. from Web click-through data. Our approach based on robust sub-
m space analysis considers the temporal feature and semantic feature
I(si ) = 1 − [−p hi log hi − (1 − p) vi log vi ] (2) of query sessions simultaneously. Experiments on real-life Web
i=1 i=1 click-through data [5] showed the effectiveness of the proposed ap-
where p ∈ [0, 1] is a weight which adjusts the importance of the
entropy values in the two directions. The interestingness measure proach.
takes values from 0 to 1. The more certain the distributions in two
directions, the smaller the entropies in the brackets of equation (2), 4. REFERENCES
the greater the value of interestingness. Given some threshold ζ, [1] Technical report. In http://www.l3s.de/˜ lchen/TR/deck.pdf.
subspace si will be pruned if I(si ) < ζ. [2] D. Comaniciu and P. Meer. Mean shift: A robust approach toward
4. Cluster generation. After pruning uninteresting subspaces, feature space analysis. In IEEE TPAMI, volume 24, 2002.
events can be detected from the remaining subspaces by cluster- [3] W.-S. Li, K. S. Candan, Q. Vu, and D. Agrawal. Retrieving and
ing. Particularly, we detect various events from interesting sub- organizing Web pages by “information unit”. In WWW, 2001.
spaces by employing a non-parametric clustering method called [4] Z. Li, B. Wang, M. Li, and W.-Y. Ma. A probabilistic model for
retrospective news event detection. In SIGIR, 2005.
Mean Shift [2].
[5] G. Pass, A. Chowdhury, and C. Torgeson. A picture of search. In The
First International Conference on Scalable Information Systems,
3. EXPERIMENTS & CONCLUSIONS 2006.
We conduct experiments on the real-life Web click-through data [6] R. Vidal, Y. Ma, and S. Sastry. Generalized principal component
analysis. In IEEE CVPR, 2003.
collected by AOL [5] from March 2006 through May 2006. We
[7] Q. Zhao, T.-Y. Liu, S. S. Bhowmick, and W.-Y. Ma. Event detection
manually labelled a set of events from the data set. After filter- from evolution of click-through data. In KDD, 2006.
ing events which are represented by less than 50 query sessions, a
1068