Sunteți pe pagina 1din 38

Learning with Hadoop A case study on MapReduce based Data Mining

Evan Xiang, HKUST

Outline
Hadoop Basics Case Study
Word Count Pairwise Similarity PageRank K-Means Clustering Matrix Factorization Cluster Coefficient

Resource Entries to ML labs Advanced Topics Q&A


2

Introduction to Hadoop
Hadoop Map/Reduce is a java based software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
3

Hadoop Cluster Architecture


Job submission node HDFS master

Client

JobTracker

NameNode

TaskTracker

DataNode

TaskTracker

DataNode

TaskTracker

DataNode

Slave node

Slave node

Slave node

From Jimmy Lins slides

Hadoop HDFS

Hadoop Cluster Rack Awareness

Hadoop Development Cycle


1. Scp data to cluster 2. Move data into HDFS

3. Develop code locally

4. Submit MapReduce job 4a. Go back to Step 3 You Hadoop Cluster

5. Move data out of HDFS 6. Scp data from cluster

From Jimmy Lins slides

Divide and Conquer


Work

Partition
w3
worker

w1
worker

w2
worker

r1

r2

r3

Result

Combine

From Jimmy Lins slides

High-level MapReduce pipeline

Detailed Hadoop MapReduce data flow

10

Outline
Hadoop Basics Case Study
Word Count Pairwise Similarity PageRank K-Means Clustering Matrix Factorization Cluster Coefficient

Resource Entries to ML labs Advanced Topics Q&A


11

Word Count with MapReduce


Doc 1 Doc 2 Doc 3

one fish, two fish one


1 1
1 1 1 2

red fish, blue fish red blue fish


2 1
2 1 2 2

cat in the hat cat hat


3 1
3 1

Map

two fish

Shuffle and Sort: aggregate values by keys

cat

3 1 1 4 1 1 2 1
From Jimmy Lins slides 12

Reduce

fish one red

blue

2 1 3 1 1 1

hat
two

Outline
Hadoop Basics Case Study
Word Count Pairwise Similarity PageRank K-Means Clustering Matrix Factorization Cluster Coefficient

Resource Entries to ML labs Advanced Topics Q&A


13

Calculating document pairwise similarity


Trivial Solution

load each vector o(N) times load each term o(dft2) times

Goal

scalable and efficient solution for large collections


From Jimmy Lins slides 14

Better Solution
Each term contributes only if appears in

Load weights for each term once Each term contributes o(dft2) partial scores

From Jimmy Lins slides

15

Decomposition
Each term contributes only if appears in

reduce

map

Load weights for each term once Each term contributes o(dft2) partial scores
From Jimmy Lins slides 16

Standard Indexing
(a) Map
doc

(b) Shuffle

(c) Reduce
posting list posting list posting list

tokenize
tokenize

doc

Shuffling
group values by: terms

combine combine

doc

tokenize
tokenize

combine

doc

From Jimmy Lins slides

17

Inverted Indexing with MapReduce


Doc 1 Doc 2 Doc 3

one fish, two fish one


1 1
1 1 1 2

red fish, blue fish red blue fish


2 1
2 1 2 2

cat in the hat cat hat


3 1
3 1

Map

two fish

Shuffle and Sort: aggregate values by keys

cat

3 1 1 2 1 1 2 1
From Jimmy Lins slides 18

Reduce

fish one red

2 2

blue

2 1 3 1 1 1

hat
two

Indexing (3-doc toy collection)


Clinton Clinton Obama Obama Clinton Clinton Clinton Cheney
Clinton
2 1 1

Cheney Indexing
1

Barack
1

Clinton Clinton Barack Barack Obama Obama


From Jimmy Lins slides

Obama
1 1

19

Pairwise Similarity
(a) Generate pairs Clinton
2 1 1
2 2 1 2 3 1 2

(b) Group pairs

(c) Sum pairs

Cheney
1

Barack
1
1

Obama
1 1
1
From Jimmy Lins slides

How to deal with the long list?


20

Outline
Hadoop Basics Case Study
Word Count Pairwise Similarity PageRank K-Means Clustering Matrix Factorization Cluster Coefficient

Resource Entries to ML labs Advanced Topics Q&A


21

PageRank
PageRank an information propagation model

Intensive access of neighborhood list

22

PageRank with MapReduce


n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5] n5 [n1, n2, n3]

Map
n2 n4 n3 n5 n4 n5 n1 n2 n3

n1

n2

n2

n3

n3

n4

n4

n5

n5

Reduce
n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5] n5 [n1, n2, n3]

How to maintain the graph structure?


From Jimmy Lins slides

Outline
Hadoop Basics Case Study
Word Count Pairwise Similarity PageRank K-Means Clustering Matrix Factorization Cluster Coefficient

Resource Entries to ML labs Advanced Topics Q&A


24

K-Means Clustering

25

K-Means Clustering with MapReduce


Mapper_i-1
Each Mapper loads a set of data samples, and assign each sample to a nearest centroid

Mapper_i
1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4

Mapper_i+1
Each Mapper needs to keep a copy of centroids

Reducer_i-1
3 2

Reducer_i

Reducer_i+1
4

How to set the initial centroids is very important! Usually we set the centroids using Canopy Clustering.
[McCallum, Nigam and Ungar: "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching", SIGKDD 2000]
26

Outline
Hadoop Basics Case Study
Word Count Pairwise Similarity PageRank K-Means Clustering Matrix Factorization Cluster Coefficient

Resource Entries to ML labs Advanced Topics Q&A


27

Matrix Factorization for Link Prediction


In this task, we observe a sparse matrix X Rmn with entries xij. Let R = {(i,j,r): r = xij, where xij 0} denote the set of observed links in the system. In order to predict the unobserved links in X, we model the users and the items by a user factor matrix U Rkm and an item factor matrix V Rkn. The goal is to approximate the link matrix X via multiplying the factor matrix U and V, which can be learnt by minimizing:

28

Solving Matrix Factorization via Alternative Least Squares


Given X and V, updating U:
m ui n

n k k k

k k k

Similarly, given X and U, we can alternatively update V


29

MapReduce for ALS


Stage 1 Mapper_i
Group rating data in X using for item j Group features in V using for item j

Stage 2 Mapper_i
Group rating data in X using for user i i i i+1 Vj Vj+2 Vj

Reducer_i
Rating for Features for item j item j Align ratings and features for item j, and make a copy of Vj for each observe xij

Reducer_i

Standard ALS: Calculate A and b, and update Ui

i-1

Vj
i Vj

i+1

Vj

30

Outline
Hadoop Basics Case Study
Word Count Pairwise Similarity PageRank K-Means Clustering Matrix Factorization Cluster Coefficient

Resource Entries to ML labs Advanced Topics Q&A


31

Cluster Coefficient
In graph mining, a clustering coefficient is a measure of degree to which nodes in a graph tend to cluster together. The local clustering coefficient of a vertex in a graph quantifies how close its neighbors are to being a clique (complete graph), which is used to determine whether a graph is a small-world network.

[D. J. Watts and Steven Strogatz (June 1998). "Collective dynamics of 'small-world' networks". Nature 393 (6684): 440442]

How to maintain the Tier-2 neighbors?


32

Cluster Coefficient with MapReduce


Stage 1 Mapper_i Stage 2 Mapper_i

Reducer_i

Reducer_i

Calculate the cluster coefficient

BFS based method need three stages, but actually we only need two!

33

Resource Entries to ML labs


Mahout
Apaches scalable machine learning libraries

Jimmy Lins Lab iSchool at the University of Maryland Jimeng Sun & Yan Rong s Collections IBM TJ Watson Research Center Edward Chang & Yi Wang Google Beijing
34

Advanced Topics in Machine Learning with MapReduce


Probabilistic Graphical models
Gradient based optimization methods Graph Mining

Others

35

Some Advanced Tips


Design your algorithm with a divide and conquer manner Make your functional units loosely dependent Carefully manage your memory and disk storage Discussions
36

Outline
Hadoop Basics Case Study
Word Count Pairwise Similarity PageRank K-Means Clustering Matrix Factorization Cluster Coefficient

Resource Entries to ML labs Advanced Topics Q&A


37

Q&A
Why not MPI?
Hadoop is Cheap in everythingD.P.T.H

Whats the advantages of Hadoop?


Scalability!

How do you guarantee the model equivalence?


Guarantee equivalent/comparable function logics

How can you beat large memory solution?


Clever use of Sequential Disk Access

38

S-ar putea să vă placă și