Sunteți pe pagina 1din 34

Performance Analysis of Hadoop Link Prediction

Yuxiao Dong ydong1@nd.edu

Casey Robinson crobins9@nd.edu

Jian Xu jxu5@nd.edu

Introduction

Facebook

? ?

Twitter
? ? X

Problem Statement
In a network G=(V,E,X), for a particular user vs and a set of candidates C to which vs may create a link, nd a predictive function f:(V,E,X,vs,C)Y where Y={y1,y2,...,y|C|} is a set of inferred results for whether user vs would create links with users in C.

Challenges
Real networks are large > 1 billion users on Facebook (Oct. 2012) > 500 million users on Twitter (Jul. 2012) > 175 million users on LinkedIn (Jun. 2012) Big data makes prediction even slower

Our Solution
Divide Adjacency list Distributed computing Hadoop

Smaller Problems Map Reduce Data Intensive Science Cluster

sort

split 0

map

merge

reduce

part 0

sort

split 1

map
merge

reduce
sort

part 0

split 2

map

Link Prediction Framework


Prepare Vertex Num Split Data Probe Edge Num Degree Statis

AUC

Non-Exist Score

Probe Score

LP Score

AdjList

Algorithm Design
1 2 5 3 4 7 6
1 5 1 1 2 2 3 2 4 6 2 6 3 4 3 4 4 6 5 7 5 5 3 1 1 1 6 7 4 4 2 3 4 5 2 2 2 3 4 6 1 2 3 4 5 6 2,3,4 3,4,6 4,,,, 5,,,, 6,7,, 7,,,, 2,3,1 2,4,1 3,4,1 3,4,2 3,6,2 4,6,2 6,7,5 2,3,1 2,4,1 3,4,1,2 3,6,2 4,6,2 6,7,5

Mapper

Reducer

Mapper

Reducer

Data Sets
Name
HepPh ND Web Live Journal

Nodes
12,008 325,729 4,847,571

Edges
237,010 1,497,134 68,993,773

Relative Size
1x 7.14x 357.78x

Approach

Black Box

Number of Reducers Data Size

Time Breakdown

Which step(s)?

80 Time (% of total) 60 40 20 0

HEP Ph

ND Web

Live Journal

Resource Monitoring

Bottlenecks

Machine Specications

26 Nodes 32 GB RAM 12x2 TB SATA disks (4 dedicated to Hadoop storage) 2x8-core Intel Xeon E5620 CPUs @ 2.40 GHz Gigabit Ethernet

Monitoring Tools
Resource
CPU Disk Network

Command
iostat -c 1 iostat -d 1 netstat -c -I

Monitoring Implementation
1 for q in $(seq -w 1 26); do 2 ./ssh.exp disc$q.crc.nd.edu crobins9 $p 3 date >> /tmp/cpu.out 4 (iostat -c 1 >> /tmp/cpu.out) & 5 done 6 7 # submit and wait for link prediction 8 9 for q in $(seq -w 1 26); do 10 " ./ssh.exp disc$q.crc.nd.edu crobins9 $p 11 ps aux | grep iostat | awk {print $2} | xargs kill -9 12 done 13 " 14 for q in $(seq -w 1 26); do 15 ./scp.exp disc$q.crc.nd.edu crobins9 $p 16 done

CPU

100

CPU Usage (%)

50

1000

2000

3000 4000 Time (s)

5000

6000

7000

Disk

80 Blocks Read (1k blocks)

LP Score

AUC

40

1000

2000

3000 4000 Time (s)

5000

6000

7000

800 Blocks Written (1k blocks)

LP Score

AUC

400

1000

2000

3000 4000 Time (s)

5000

6000

7000

Network

1000 Data Received (Mb/s)

LP Score

AUC

500

1000

2000

3000 4000 Time (s)

5000

6000

7000

LP Score

AUC

1000 Data Sent (Mb/s)

500

1000

2000

3000 4000 Time (s)

5000

6000

7000

Conclusions and Future Improvements

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

n = 13000000 double left[] = new double[n]; double right[] = new double[n]; int n1=0, n2=0; int m = 3*n; for(int i = 0; i < m; i++){ " index1 = rand.nextInt(n); " index2 = rand1.nextInt(n); " " leftScore = left[index1]; rightScore = right[index2]; if(leftScore > rightScore){ n1++; } else if( Math.abs(leftScore - rightScore) < 1E-6 ){ n2++; } } AUC = ( n1 + 0.5 * n2 ) / m;

" "

Some Conclusions

Data 1GB Hadoop Useful 6 Reducers Multiple jobs with less reducers

S-ar putea să vă placă și