Documente Academic
Documente Profesional
Documente Cultură
Jian Xu jxu5@nd.edu
Introduction
? ?
Twitter
? ? X
Problem Statement
In a network G=(V,E,X), for a particular user vs and a set of candidates C to which vs may create a link, nd a predictive function f:(V,E,X,vs,C)Y where Y={y1,y2,...,y|C|} is a set of inferred results for whether user vs would create links with users in C.
Challenges
Real networks are large > 1 billion users on Facebook (Oct. 2012) > 500 million users on Twitter (Jul. 2012) > 175 million users on LinkedIn (Jun. 2012) Big data makes prediction even slower
Our Solution
Divide Adjacency list Distributed computing Hadoop
sort
split 0
map
merge
reduce
part 0
sort
split 1
map
merge
reduce
sort
part 0
split 2
map
AUC
Non-Exist Score
Probe Score
LP Score
AdjList
Algorithm Design
1 2 5 3 4 7 6
1 5 1 1 2 2 3 2 4 6 2 6 3 4 3 4 4 6 5 7 5 5 3 1 1 1 6 7 4 4 2 3 4 5 2 2 2 3 4 6 1 2 3 4 5 6 2,3,4 3,4,6 4,,,, 5,,,, 6,7,, 7,,,, 2,3,1 2,4,1 3,4,1 3,4,2 3,6,2 4,6,2 6,7,5 2,3,1 2,4,1 3,4,1,2 3,6,2 4,6,2 6,7,5
Mapper
Reducer
Mapper
Reducer
Data Sets
Name
HepPh ND Web Live Journal
Nodes
12,008 325,729 4,847,571
Edges
237,010 1,497,134 68,993,773
Relative Size
1x 7.14x 357.78x
Approach
Black Box
Time Breakdown
Which step(s)?
80 Time (% of total) 60 40 20 0
HEP Ph
ND Web
Live Journal
Resource Monitoring
Bottlenecks
Machine Specications
26 Nodes 32 GB RAM 12x2 TB SATA disks (4 dedicated to Hadoop storage) 2x8-core Intel Xeon E5620 CPUs @ 2.40 GHz Gigabit Ethernet
Monitoring Tools
Resource
CPU Disk Network
Command
iostat -c 1 iostat -d 1 netstat -c -I
Monitoring Implementation
1 for q in $(seq -w 1 26); do 2 ./ssh.exp disc$q.crc.nd.edu crobins9 $p 3 date >> /tmp/cpu.out 4 (iostat -c 1 >> /tmp/cpu.out) & 5 done 6 7 # submit and wait for link prediction 8 9 for q in $(seq -w 1 26); do 10 " ./ssh.exp disc$q.crc.nd.edu crobins9 $p 11 ps aux | grep iostat | awk {print $2} | xargs kill -9 12 done 13 " 14 for q in $(seq -w 1 26); do 15 ./scp.exp disc$q.crc.nd.edu crobins9 $p 16 done
CPU
100
50
1000
2000
5000
6000
7000
Disk
LP Score
AUC
40
1000
2000
5000
6000
7000
LP Score
AUC
400
1000
2000
5000
6000
7000
Network
LP Score
AUC
500
1000
2000
5000
6000
7000
LP Score
AUC
500
1000
2000
5000
6000
7000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
n = 13000000 double left[] = new double[n]; double right[] = new double[n]; int n1=0, n2=0; int m = 3*n; for(int i = 0; i < m; i++){ " index1 = rand.nextInt(n); " index2 = rand1.nextInt(n); " " leftScore = left[index1]; rightScore = right[index2]; if(leftScore > rightScore){ n1++; } else if( Math.abs(leftScore - rightScore) < 1E-6 ){ n2++; } } AUC = ( n1 + 0.5 * n2 ) / m;
" "
Some Conclusions
Data 1GB Hadoop Useful 6 Reducers Multiple jobs with less reducers