Documente Academic
Documente Profesional
Documente Cultură
Outline
Hadoop Basics Case Study
Word Count Pairwise Similarity PageRank K-Means Clustering Matrix Factorization Cluster Coefficient
Introduction to Hadoop
Hadoop Map/Reduce is a java based software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
3
Client
JobTracker
NameNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
Slave node
Slave node
Slave node
Hadoop HDFS
Partition
w3
worker
w1
worker
w2
worker
r1
r2
r3
Result
Combine
10
Outline
Hadoop Basics Case Study
Word Count Pairwise Similarity PageRank K-Means Clustering Matrix Factorization Cluster Coefficient
Map
two fish
cat
3 1 1 4 1 1 2 1
From Jimmy Lins slides 12
Reduce
blue
2 1 3 1 1 1
hat
two
Outline
Hadoop Basics Case Study
Word Count Pairwise Similarity PageRank K-Means Clustering Matrix Factorization Cluster Coefficient
load each vector o(N) times load each term o(dft2) times
Goal
Better Solution
Each term contributes only if appears in
Load weights for each term once Each term contributes o(dft2) partial scores
15
Decomposition
Each term contributes only if appears in
reduce
map
Load weights for each term once Each term contributes o(dft2) partial scores
From Jimmy Lins slides 16
Standard Indexing
(a) Map
doc
(b) Shuffle
(c) Reduce
posting list posting list posting list
tokenize
tokenize
doc
Shuffling
group values by: terms
combine combine
doc
tokenize
tokenize
combine
doc
17
Map
two fish
cat
3 1 1 2 1 1 2 1
From Jimmy Lins slides 18
Reduce
2 2
blue
2 1 3 1 1 1
hat
two
Cheney Indexing
1
Barack
1
Obama
1 1
19
Pairwise Similarity
(a) Generate pairs Clinton
2 1 1
2 2 1 2 3 1 2
Cheney
1
Barack
1
1
Obama
1 1
1
From Jimmy Lins slides
Outline
Hadoop Basics Case Study
Word Count Pairwise Similarity PageRank K-Means Clustering Matrix Factorization Cluster Coefficient
PageRank
PageRank an information propagation model
22
Map
n2 n4 n3 n5 n4 n5 n1 n2 n3
n1
n2
n2
n3
n3
n4
n4
n5
n5
Reduce
n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5] n5 [n1, n2, n3]
Outline
Hadoop Basics Case Study
Word Count Pairwise Similarity PageRank K-Means Clustering Matrix Factorization Cluster Coefficient
K-Means Clustering
25
Mapper_i
1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4
Mapper_i+1
Each Mapper needs to keep a copy of centroids
Reducer_i-1
3 2
Reducer_i
Reducer_i+1
4
How to set the initial centroids is very important! Usually we set the centroids using Canopy Clustering.
[McCallum, Nigam and Ungar: "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching", SIGKDD 2000]
26
Outline
Hadoop Basics Case Study
Word Count Pairwise Similarity PageRank K-Means Clustering Matrix Factorization Cluster Coefficient
28
n k k k
k k k
Stage 2 Mapper_i
Group rating data in X using for user i i i i+1 Vj Vj+2 Vj
Reducer_i
Rating for Features for item j item j Align ratings and features for item j, and make a copy of Vj for each observe xij
Reducer_i
i-1
Vj
i Vj
i+1
Vj
30
Outline
Hadoop Basics Case Study
Word Count Pairwise Similarity PageRank K-Means Clustering Matrix Factorization Cluster Coefficient
Cluster Coefficient
In graph mining, a clustering coefficient is a measure of degree to which nodes in a graph tend to cluster together. The local clustering coefficient of a vertex in a graph quantifies how close its neighbors are to being a clique (complete graph), which is used to determine whether a graph is a small-world network.
[D. J. Watts and Steven Strogatz (June 1998). "Collective dynamics of 'small-world' networks". Nature 393 (6684): 440442]
Reducer_i
Reducer_i
BFS based method need three stages, but actually we only need two!
33
Jimmy Lins Lab iSchool at the University of Maryland Jimeng Sun & Yan Rong s Collections IBM TJ Watson Research Center Edward Chang & Yi Wang Google Beijing
34
Others
35
Outline
Hadoop Basics Case Study
Word Count Pairwise Similarity PageRank K-Means Clustering Matrix Factorization Cluster Coefficient
Q&A
Why not MPI?
Hadoop is Cheap in everythingD.P.T.H
38