0 Voturi pozitive0 Voturi negative

16 (de) vizualizări2 paginiJul 26, 2009

© Attribution Non-Commercial (BY-NC)

PDF, TXT sau citiți online pe Scribd

Attribution Non-Commercial (BY-NC)

16 (de) vizualizări

Attribution Non-Commercial (BY-NC)

- Krishna Google Hacking
- GT-2
- Identifying and Filtering Near-Duplicate Documents (2008, 2009)
- Crawl Wave
- A Graph theoretical Approach of Three-Phase Power System using Short Circuit Analysis
- Modelling and Simulation of Search Engine
- IIJIT-2018-07-06-8
- blendedkit-blueprint-week 1
- Basic Minimal Dominating Functions of Euler Totient Cayley Graphs
- Multimedia Answer Generation from Web Information
- 06-CPM
- pxc3896972.pdf
- EN010 301B Engineering Mathematics II(1)
- some_MBA_1
- Core_13
- Eulerian Graph PDF
- Integrating With Google Apps GSA
- Development of Internet Parties
- BAMS
- TSL 5345 M&M Graph Lesson Plan

Sunteți pe pagina 1din 2

Stanford University Yahoo! Inc. Stanford University

Stanford, CA 94305, USA Sunnyvale, CA 94089, USA Stanford, CA 94305, USA

ppapadim@stanford.edu dasdan@yahoo-inc.com hector@cs.stanford.edu

Web graphs are approximate snapshots of the web, created These anomalies refer either to failures of web hosts that do

by search engines. Their creation is an error-prone proce- not allow the crawler to access their content or to hard-

dure that relies on the availability of Internet nodes and ware/software problems in the search engine infrastructure

the faultless operation of multiple software and hardware that can corrupt parts of the crawled data. The detection

units. Checking the validity of a web graph requires a no- of such anomalies is critical since they can have a significant

tion of graph similarity. Web graph similarity helps measure impact on the search results returned to user queries, e.g.,

the amount and significance of changes in consecutive web they can affect ranking.

graphs. These measurements validate how well search en- We address the anomaly detection problem for the web

gines acquire content from the web. In this paper we study graph component through the calculation of similarities or

five similarity schemes: three of them adapted from existing differences between consecutive snapshots. To make this

graph similarity measures and two adapted from well-known possible, we propose different web graph similarity metrics

document and vector similarity methods. We compare and and we check experimentally which of them yield similarity

evaluate all five schemes using a sequence of web graphs for values that differentiate a web graph from its version with

Yahoo! and study if the schemes can identify anomalies that injected anomalies.

may occur due to hardware or other problems.

2. POTENTIAL ANOMALIES

Categories and Subject Descriptors In this section we give some examples of the types of

H.3.3 [Information Search and Retrieval]: Search Pro- anomalies that we are interested in detecting. We can clas-

cess sify these anomalies into the following categories:

Missing Connected Subgraph: A web graph may miss

a connected subgraph of the web because of public infras-

General Terms tructure failures or network problems. For example, a web

Algorithms, Design, Experimentation host that is unavailable at crawl time may cause us to miss

its content, as well as the content reachable from that host.

Keywords In this case the resulting web graph misses a connected sub-

graph that may consist of all the hosts of a specific domain.

anomaly detection, graph similarity, web graph LSH

Missing Vertices: A web graph may also miss a sub-

graph that consists of vertices that are not necessarily con-

1. INTRODUCTION nected. We can have such corrupted web graphs either be-

Web graphs represent the graph structure of the web and cause of failures during crawl time or because of machine

constitute a significant offline component of a search engine. failures in the cluster where we store the web graph. In the

Web graphs are useful in many ways but their main purpose latter case, although we are usually aware of the anomaly

is to compute properties that need a global view of the web. we would like to be able to estimate its extent.

PageRank is one such property, but there may be hundreds Random Topological Changes: The anomalies of this

of other properties that need a global view. These proper- category are results of data corruption or software bugs in

ties are used during crawling, indexing, and ranking of web the crawler or the graph data management code. For exam-

pages. ple, if we use column orientation to store the graph vertex

Search engines crawl the Web on a regular basis and recre- attributes, buggy software may map outlinks to source nodes

ate web graphs according to the data of the latest crawl. incorrectly. Anomalies of this type are usually the most dif-

The generated web graphs consist of tens of billions of ver- ficult to detect.

tices and hundreds of billion of edges. Since the graph In all the anomalies above, the problems get more serious

data is massive, it is spread over multiple machines and if the anomalies affect important vertices that have very high

files. The quality of the web graphs that we obtain from a PageRank (our quality score) or are major authorities or

hubs. If graph similarity is sensitive to the importance of

vertices, we hope to detect the problem if we were unaware

Copyright is held by the author/owner(s).

WWW 2008, April 21–25, 2008, Beijing, China. of it, or we hope to quantify the loss if we were already aware

ACM 978-1-60558-085-2/08/04. of it.

1167

WWW 2008 / Poster Paper April 21-25, 2008 · Beijing, China

3. ANOMALY DETECTION features and then applying SimHash [2]. The set of fea-

It is very hard to detect certain problems of a web graph tures for each graph consists of its vertices and edges,

simply by examining a single snapshot or instance. For ex- which are appropriately weighted using properties like

ample, how can one tell that an important part of the web PageRank (used in Signature Similarity).

is missing? Or that IP addresses were not properly grouped

into hosts? 5. EXPERIMENTAL RESULTS

Because of the difficulty of identifying anomalies in a sin- We performed experiments to evaluate our approach to

gle data snapshot, it is more practical to identify anoma- anomaly detection using the similarity metrics that are briefly

lies based on “differences” with previous snapshots. Our presented in Section 4. For our experiments we used real web

approach to detecting anomalies in web graphs can be de- graphs provided the Yahoo! search engine over a month long

scribed as follows. We have a sequence of web graphs, period in 2007. We give here a summary of our experiments

G1 , . . . , Gn built consecutively. We want to quantify the and their results, while a detailed description is available in

changes from one web graph to the next. We do this by the extended version of the paper [3].

computing one or more similarity scores between two con- We found experimentally how similarity varies over time

secutive web graphs, Gi and Gi+1 . Similarity scores, when for the different similarity metrics. This analysis helped us

viewed along the time axis, create a time series. Anomalies determine the thresholds that identify “significant” changes

can be detected by comparing the similarity score of two in a graph and indicate possible anomalies. Then we simu-

graphs against some threshold, or by looking for unusual lated anomalies like the ones presented in Section 2 to cer-

patterns in the time series. tain graphs. For each graph we checked whether the sim-

ulated anomaly affected its similarity score relative to its

4. SIMILARITY COMPUTATION predecessor graphs so that the anomaly can be detected.

The challenge in this approach to anomaly detection is in The experimental analysis on real web graphs showed that

developing similarity metrics that (a) can be computed in a it is feasible to detect anomalies through similarity compu-

reasonable time and in an automated way, and (b) are useful tations among consecutive web graphs. The effectiveness of

for detecting the types of anomalies experienced in practice. the different similarity metrics varies for anomalies of differ-

A metric that is too sensitive or sensitive to differences that ent types. For example, the use of the Vertex/Edge Overlap

do not impact the quality of search results, will yield too allowed us to detect a missing connected graph, but was

many false positives. Similarly, a metric that is not sensitive unsuccessful in detecting arbitrary topological changes in a

enough will yield too many false negatives. There is of course web graph.

also the challenge of selecting the proper thresholds that tell Of the five schemes explored, Signature Similarity was the

us when similarities are “too high” or “too low.” best at detecting what we consider significant anomalies,

In the extended version of this paper [3], we present some while not yielding false positives when the changes were in-

candidate metrics and discuss how they can be implemented significant. Vector Similarity also gave promising results in

efficiently. Here we only give a brief description of the main all the studied cases, while the other three schemes proved

similarity or dissimilarity measures that we use in the defi- to be unsuccessful at detecting anomalies of certain types.

nition of each metric. These measures are: The common feature of the two successful schemes is the

similarity calculation based on appropriately weighted web

1. The edit distance between graphs, which is equal to graph features. The proposed algorithms are independent

the minimum number of edit operations required to from the weighting schema that is used and, hence, we be-

transform one graph to another (used in Vertex/Edge lieve that they can be effective in anomalies that are not

Overlap). studied here.

vectors of the graph adjacency matrices (Vector Simi-

6. CONCLUSIONS

larity). We defined the web graph similarity problem and pro-

posed various metrics to its solution. We have experimen-

3. The Spearman’s ρ applied to calculate the rank corre- tally shown that they work well on real web graphs in de-

lation between sorted lists of vertices of the two graphs. tecting anomalies. Our future work will include research on

Vertices are sorted based on the quality or some other feature selection and weighting schemas.

properties (used in Vertex Ranking).

4. The similarity of vertex sequences [1] of the graphs

7. REFERENCES

that are obtained through a graph serialization algo- [1] A. Broder, S. Glassman, M. Manasse, and G. Zweig.

rithm [3]. Such an algorithm creates a serial ordering Syntactic clustering of the web. In Proc. of Int. World

of graph vertices which is maximally edge connected. Wide Web Conf. (WWW), pages 393–404, Apr 1997.

That means that it maximizes the number of vertices [2] M. Charikar. Similarity estimation techniques from

that are consecutive in the ordering and are edge con- rounding algorithms. In Proc. of Symp. on Theory of

nected in the graph (used in Sequence Similarity). Comput. (STOC), pages 380–388. ACM, May 2002.

[3] P. Papadimitriou, A. Dasdan, and H. Garcia-Molina.

5. The Hamming distance between appropriate finger- Web graph similarity for anomaly detection. Technical

prints of two graphs. To get such fingerprints we apply Report 2008-1, Stanford University, 2008. URL:

a similarity hash function to the compared graphs. We http://dbpubs.stanford.edu/pub/2008-1.

developed an effective hash function with the required

properties by converting a graph to a set of weighted

1168

- Krishna Google HackingÎncărcat deDurga Reddy
- GT-2Încărcat deMahi On D Rockzz
- Identifying and Filtering Near-Duplicate Documents (2008, 2009)Încărcat de.xml
- Crawl WaveÎncărcat desalmansaleem
- A Graph theoretical Approach of Three-Phase Power System using Short Circuit AnalysisÎncărcat deInternational Journal for Scientific Research and Development - IJSRD
- Modelling and Simulation of Search EngineÎncărcat deKaroko Samba
- IIJIT-2018-07-06-8Încărcat deAnonymous vQrJlEN
- blendedkit-blueprint-week 1Încărcat deapi-354498922
- Basic Minimal Dominating Functions of Euler Totient Cayley GraphsÎncărcat deIOSRjournal
- Multimedia Answer Generation from Web InformationÎncărcat deIJIRAE- International Journal of Innovative Research in Advanced Engineering
- 06-CPMÎncărcat deSyed Shehzad
- pxc3896972.pdfÎncărcat dereshma mohan
- EN010 301B Engineering Mathematics II(1)Încărcat deAby Panthalanickal
- some_MBA_1Încărcat deSurya Singh
- Core_13Încărcat deIqbal Ahmed
- Eulerian Graph PDFÎncărcat deLaura
- Integrating With Google Apps GSAÎncărcat dethoradeon
- Development of Internet PartiesÎncărcat de4gen_7
- BAMSÎncărcat deAnkur Kabra
- TSL 5345 M&M Graph Lesson PlanÎncărcat deddougl18
- A Study on Effectiveness of Digital Marketing Services in Domain2host-1[1]Încărcat dechandrasekhar
- 171[1]Încărcat deapi-3808925
- Math(1st)May2018Încărcat deImran
- Black HoleÎncărcat deGajju Chautala
- http---www.sciencedirect.com-scienceÎncărcat delupustar
- j 0121317Încărcat deInternational Organization of Scientific Research (IOSR)
- The Kernighan_Lin AlgorithmÎncărcat derounakanchal
- graph1Încărcat dePrasad Bathe
- Sequential_and_Parallel_Approaches_in_context_to_Graph_Coloring_Problems_A_Survey.pdfÎncărcat deeditorinchiefijcs
- Graph4Scala-JSON-UserGuide.pdfÎncărcat dejp149

- Lecture 04Încărcat demachinelearner
- Base Flood Elevation for Houston County LakeÎncărcat demachinelearner
- Lecture 02Încărcat demachinelearner
- 03Încărcat demachinelearner
- Lecture_04_Image_SearchÎncărcat demachinelearner
- Lecture 05Încărcat demachinelearner
- Lecture 06Încărcat demachinelearner
- Lecture 07Încărcat demachinelearner
- Lecture 08Încărcat demachinelearner
- Lecture 09Încărcat demachinelearner
- Lecture 11(2)Încărcat demachinelearner
- Lecture 12Încărcat demachinelearner
- Lecture 10Încărcat demachinelearner
- Lecture 13Încărcat demachinelearner
- Lecture 18Încărcat demachinelearner
- Lecture 19Încărcat demachinelearner
- lecture-21Încărcat demachinelearner
- Lecture 25Încărcat demachinelearner
- Lecture 28Încărcat demachinelearner
- Lecture 27Încărcat demachinelearner
- Lecture 29Încărcat demachinelearner
- Causal InferenceÎncărcat degschiro
- 07www-Duplicates Main Paper for My ProjectÎncărcat deXain Oddin
- Language IdentificationÎncărcat demriley6736
- wiki_linguisticsÎncărcat dematt5test
- Eﬃcient Passage Ranking for Document DatabasesÎncărcat demachinelearner
- Pruning Policies for Two-Tiered Inverted Index With Correctness GuaranteeÎncărcat demachinelearner
- Three-Level Caching for Efﬁcient Query Processing in Large Web Search EnginesÎncărcat demachinelearner
- The Past, Present and Future of Web Information RetrievalÎncărcat demachinelearner

- Bell Delaware Method English 2 (1)Încărcat dealdair
- Presentation2(1)Încărcat deamanda
- PrintedManual-RGowland-OMV.pdfÎncărcat deAndrei Horhoianu
- How to Repair a Broken Anchor BoltÎncărcat defostbarr
- Fee Guide SpsÎncărcat deEjaz
- Mckinsey s 7s Framework.pdf GudÎncărcat deDeepak Mehrotra
- Inmateh i - 2009Încărcat deIonut Velescu
- Global chronostratigraphical correlation table (v. 2011)Încărcat depeixe80
- v46i07.pdfÎncărcat deIsmael Neu
- CRM AirtelÎncărcat deNiti Modi
- SP-PumpsÎncărcat dePepes Hiuu
- RP AbhishekÎncărcat deMilind_Sharma_3421
- Wave NetÎncărcat deJeroen van Zutphen
- Mechanical Nonlin 13.0 WS 05A PlasticityÎncărcat deShaheen S. Ratnani
- Chromatography FinalÎncărcat deRachelle Victorio
- Logistics Leverage for Sustainable Competitive AdvantagesÎncărcat deragataus
- Bolnica u FirenciÎncărcat deMilaJoksimović
- Alberta at the 2006 Smithsonian Folklife FestivalÎncărcat dechonpipe
- Engineering MechanicsÎncărcat deUgender Singarapu
- Service MarketingÎncărcat deShriHemaRaja
- 1st PUC Maths Model QP 3Încărcat dePrasad C M
- Application of Thermodynamics to Flow ProcessesÎncărcat deAldren RebaLde
- ESP Assignment NOVAÎncărcat deNova Novita Sihombing
- DESIGN AND ANALYSIS OF.pptxÎncărcat deSomnath Bhunia
- Cad SyllabusÎncărcat deSachi Dhanandam
- BC Housing Olympic Tickets Distribution FOIÎncărcat deBob Mackin
- Student Employability Guide for Philosophy SaundersÎncărcat detragedyandcomedy
- staying-relevant-in-the-workplace.pdfÎncărcat depanuwat
- Ganag PresentationÎncărcat dePriscilla Ruiz De Vergara
- balanza cs200 ohausÎncărcat deVictorManuelFelix

## Mult mai mult decât documente.

Descoperiți tot ce are Scribd de oferit, inclusiv cărți și cărți audio de la editori majori.

Anulați oricând.