Learning With Hadoop Based Data Mining: - A Case Study On Mapreduce

Learning with Hadoop A case study on MapReduce based Data Mining
Evan Xiang, HKUST
Outline
Hadoop Basics Case Study
Word Count Pairwise Similarity PageRank K-Means Clustering Matrix Factorization Cluster Coefficient
Resource Entries to ML labs Advanced Topics Q&A

2
Introduction to Hadoop
Hadoop Map/Reduce is a java based software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
3
Hadoop Cluster Architecture

Job submission node HDFS master
Client
JobTracker
NameNode
TaskTracker
DataNode
TaskTracker
DataNode
TaskTracker
DataNode
Slave node
Slave node
Slave node
From Jimmy Lins slides
Hadoop HDFS
Hadoop Cluster Rack Awareness
Hadoop Development Cycle

1. Scp data to cluster 2. Move data into HDFS
3. Develop code locally
4. Submit MapReduce job 4a. Go back to Step 3 You Hadoop Cluster
5. Move data out of HDFS 6. Scp data from cluster
Divide and Conquer

Work
Partition
w3
worker
w1
worker
w2
worker
r1
r2
r3
Result
Combine
High-level MapReduce pipeline
Detailed Hadoop MapReduce data flow
10
Outline

11
Word Count with MapReduce

Doc 1 Doc 2 Doc 3
one fish, two fish one

1 1
1 1 1 2
red fish, blue fish red blue fish

2 1
2 1 2 2
cat in the hat cat hat

3 1
3 1
Map
two fish
Shuffle and Sort: aggregate values by keys
cat
3 1 1 4 1 1 2 1
From Jimmy Lins slides 12
Reduce
fish one red
blue
2 1 3 1 1 1
hat
two
Outline

13
Calculating document pairwise similarity

Trivial Solution
load each vector o(N) times load each term o(dft2) times
Goal
scalable and efficient solution for large collections

Better Solution
Each term contributes only if appears in
Load weights for each term once Each term contributes o(dft2) partial scores
15
Decomposition
Each term contributes only if appears in
reduce
map
Load weights for each term once Each term contributes o(dft2) partial scores
Standard Indexing
(a) Map
doc
(b) Shuffle
(c) Reduce
posting list posting list posting list
tokenize
tokenize
doc
Shuffling
group values by: terms
combine combine
doc
tokenize
tokenize
combine
doc
17
Inverted Indexing with MapReduce

Doc 1 Doc 2 Doc 3
one fish, two fish one

1 1
1 1 1 2
red fish, blue fish red blue fish

2 1
2 1 2 2
cat in the hat cat hat

3 1
3 1
Map
two fish
Shuffle and Sort: aggregate values by keys
cat
3 1 1 2 1 1 2 1
Reduce
fish one red
2 2
blue
2 1 3 1 1 1
hat
two
Indexing (3-doc toy collection)

Clinton Clinton Obama Obama Clinton Clinton Clinton Cheney
Clinton
2 1 1
Cheney Indexing
1
Barack
1
Clinton Clinton Barack Barack Obama Obama

Obama
1 1
19
Pairwise Similarity
(a) Generate pairs Clinton
2 1 1
2 2 1 2 3 1 2
(b) Group pairs
(c) Sum pairs
Cheney
1
Barack
1
1
Obama
1 1
1
How to deal with the long list?

20
Outline

21
PageRank
PageRank an information propagation model
Intensive access of neighborhood list
22
PageRank with MapReduce

n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5] n5 [n1, n2, n3]
Map
n2 n4 n3 n5 n4 n5 n1 n2 n3
n1
n2
n2
n3
n3
n4
n4
n5
n5
Reduce
n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5] n5 [n1, n2, n3]
How to maintain the graph structure?

Outline

24
K-Means Clustering
25
K-Means Clustering with MapReduce

Mapper_i-1
Each Mapper loads a set of data samples, and assign each sample to a nearest centroid
Mapper_i
1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4
Mapper_i+1
Each Mapper needs to keep a copy of centroids
Reducer_i-1
3 2
Reducer_i
Reducer_i+1
4
How to set the initial centroids is very important! Usually we set the centroids using Canopy Clustering.
[McCallum, Nigam and Ungar: "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching", SIGKDD 2000]
26
Outline

27
Matrix Factorization for Link Prediction

In this task, we observe a sparse matrix X Rmn with entries xij. Let R = {(i,j,r): r = xij, where xij 0} denote the set of observed links in the system. In order to predict the unobserved links in X, we model the users and the items by a user factor matrix U Rkm and an item factor matrix V Rkn. The goal is to approximate the link matrix X via multiplying the factor matrix U and V, which can be learnt by minimizing:
28
Solving Matrix Factorization via Alternative Least Squares

Given X and V, updating U:
m ui n
n k k k
k k k
Similarly, given X and U, we can alternatively update V

29
MapReduce for ALS

Stage 1 Mapper_i
Group rating data in X using for item j Group features in V using for item j
Stage 2 Mapper_i
Group rating data in X using for user i i i i+1 Vj Vj+2 Vj
Reducer_i
Rating for Features for item j item j Align ratings and features for item j, and make a copy of Vj for each observe xij
Reducer_i
Standard ALS: Calculate A and b, and update Ui
i-1
Vj
i Vj
i+1
Vj
30
Outline

31
Cluster Coefficient
In graph mining, a clustering coefficient is a measure of degree to which nodes in a graph tend to cluster together. The local clustering coefficient of a vertex in a graph quantifies how close its neighbors are to being a clique (complete graph), which is used to determine whether a graph is a small-world network.
[D. J. Watts and Steven Strogatz (June 1998). "Collective dynamics of 'small-world' networks". Nature 393 (6684): 440442]
How to maintain the Tier-2 neighbors?

32
Cluster Coefficient with MapReduce

Stage 1 Mapper_i Stage 2 Mapper_i
Reducer_i
Reducer_i
Calculate the cluster coefficient
BFS based method need three stages, but actually we only need two!
33
Resource Entries to ML labs

Mahout
Apaches scalable machine learning libraries
Jimmy Lins Lab iSchool at the University of Maryland Jimeng Sun & Yan Rong s Collections IBM TJ Watson Research Center Edward Chang & Yi Wang Google Beijing
34
Advanced Topics in Machine Learning with MapReduce

Probabilistic Graphical models
Gradient based optimization methods Graph Mining
Others
35
Some Advanced Tips

Design your algorithm with a divide and conquer manner Make your functional units loosely dependent Carefully manage your memory and disk storage Discussions
36
Outline

37
Q&A
Why not MPI?
Hadoop is Cheap in everythingD.P.T.H
Whats the advantages of Hadoop?

Scalability!
How do you guarantee the model equivalence?

Guarantee equivalent/comparable function logics
How can you beat large memory solution?

Clever use of Sequential Disk Access
38

Learning With Hadoop Based Data Mining: - A Case Study On Mapreduce

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Learning With Hadoop Based Data Mining: - A Case Study On Mapreduce

Încărcat de

Drepturi de autor:

Formate disponibile

Learning with Hadoop A case study on MapReduce based Data Mining

Evan Xiang, HKUST

Resource Entries to ML labs Advanced Topics Q&A

Hadoop Cluster Architecture

From Jimmy Lins slides

Hadoop Cluster Rack Awareness

Hadoop Development Cycle

3. Develop code locally

4. Submit MapReduce job 4a. Go back to Step 3 You Hadoop Cluster

5. Move data out of HDFS 6. Scp data from cluster

From Jimmy Lins slides

Divide and Conquer

From Jimmy Lins slides

High-level MapReduce pipeline

Detailed Hadoop MapReduce data flow

Resource Entries to ML labs Advanced Topics Q&A

Word Count with MapReduce

one fish, two fish one

red fish, blue fish red blue fish

cat in the hat cat hat

Shuffle and Sort: aggregate values by keys

fish one red

Resource Entries to ML labs Advanced Topics Q&A

Calculating document pairwise similarity

scalable and efficient solution for large collections

From Jimmy Lins slides

From Jimmy Lins slides

Inverted Indexing with MapReduce

one fish, two fish one

red fish, blue fish red blue fish

cat in the hat cat hat

Shuffle and Sort: aggregate values by keys

fish one red

Indexing (3-doc toy collection)

Clinton Clinton Barack Barack Obama Obama

(b) Group pairs

(c) Sum pairs

How to deal with the long list?

Resource Entries to ML labs Advanced Topics Q&A

Intensive access of neighborhood list

PageRank with MapReduce

How to maintain the graph structure?

Resource Entries to ML labs Advanced Topics Q&A

K-Means Clustering with MapReduce

Resource Entries to ML labs Advanced Topics Q&A

Matrix Factorization for Link Prediction

Solving Matrix Factorization via Alternative Least Squares

Similarly, given X and U, we can alternatively update V

MapReduce for ALS

Standard ALS: Calculate A and b, and update Ui

Resource Entries to ML labs Advanced Topics Q&A

How to maintain the Tier-2 neighbors?

Cluster Coefficient with MapReduce

Calculate the cluster coefficient

Resource Entries to ML labs

Advanced Topics in Machine Learning with MapReduce

Some Advanced Tips

Resource Entries to ML labs Advanced Topics Q&A

Whats the advantages of Hadoop?

How do you guarantee the model equivalence?

How can you beat large memory solution?

S-ar putea să vă placă și