Documente Academic
Documente Profesional
Documente Cultură
using Graphs
Dissertation
by:
Vandana Bhatia
( Roll No. : 901403001 )
Value
Visualization
3
From Big Data to Data Mining
Data Mining also known as Knowledge Discovery in
Databases (KDD) is the process of extracting useful hidden
information from very large databases in supervised and
unsupervised manner.
For certain kind of data, the connection between the entities
is more important than data itself.
The relationship between entities can be structured as graph
Mining graph data is one of the important research areas in
the field of data mining
Involves analyzing graph structure.
Mining patterns from graphs.
4
Graph: A Simple Model
Entities
Set of vertices
Pairwise relations among vertices
Set of edges
Can add directions, weights,. . .
6
Diversity of Graphs
8
Properties of
Real-World Graphs
There are different patterns in diverse collections of graphs arising from
different phenomena:
Static networks
heavy tails
clustering coefficients
communities
small diameters
Time-evolving networks
densification
shrinking diameters
Web graph
bow-tie structure
Bipartite cliques 9
Graph Applications
Social networks:
Friendship and collaboration networks
Phone-call networks
Technological networks:
The internet
Power grids
Transportation networks
Knowledge and Information networks:
The World Wide Web
Blog networks
Software Graphs
Bluetooth Networks
Biological networks:
Gene co-expression networks
Gene regulation Networks
Protein-Protein Interaction Networks
7
Graph Based data mining
Graph Mining is essentially the problem of discovering
repetitive subgraphs occurring in the input graphs.
Motivation
Identifying conceptually interesting patterns.
Identifying clusters by grouping the vertices of a
given input graph.
Classifying the unstructured graph data.
Finding subgraphs capable of compressing the data
by abstracting instances of the substructures.
Achieve parallelism by partitioning data to get an
efficient “Divide and Conquer” solution
10
Approaches for Mining Graphs
Vertices within the cluster are more connected with the vertices
same/similar hobbies.
o A set of graphs representing chemical compounds can be grouped into clusters
14
Frequent Subgraph Mining (FSM)
Finding frequent subgraphs within a single graphs or a
set of graphs
Applications:
o Mining biochemical structures
15
Comparative study of existing FSM algorithms
16
Research Gaps
Most of the existing algorithms do not provide scalability, in any case
There exist only few parallel structural clustering algorithm that can
performing FSM.
graph.
19
Research Methodology
20
Graph Processing Framework: Pregel
Scalable and Fault-tolerant platform
API with flexibility to express arbitrary
algorithm
Vertex centric computation (Think like a
vertex).
25
Pregel: Master-Slave Architecture
22
Map-Reduce v/s Pregel
MapReduce
No online query support
The map reduce data model is not a native graph model
Graph algorithms cannot be expressed intuitively.
Graph processing is inefficient on map reduce
Intermediate results of each iteration need to be materialized
Entire graph structure need to be sent over network iteration after iteration,
this incurs huge unnecessary data movement
Pregel
Exploits fine-grained parallelism at node level
Pregel doesn’t move graph partitions over network, only messages among
nodes are passed at the end of each iteration
Not many graph algorithms can be implemented using vertex-based
computation model elegantly.
26
Proposed Clustering Algorithms
PGFC : Parallel Graph Fuzzy Clustering
24
Evaluation Metrics
F-Measure
o Harmonic Mean of Recall and Precision values
2(Pr ecision �Re call )
F - Measure =
Pr ecsision + Re call
|SI T | |SI T |
Pr ecision = Re call =
|S| |T |
S = Set of pair of vertices (i,j) belonging to same cluster detected by proposed algorithm
T =Set of pair of vertices where vertex (i,j) belong to same cluster in ground truth data.
Modularity
o Measure of number of dense connection inside a cluster and number of
sparse connections with
1 other
� clusters
d(i ) *. d( j ) �
Q= ��M
2m ( i , j )�V 2
�
ij -
2m
* d (ci , c j )
�
�
d (ci , c j ) = 1 ci = c j
25 if, , otherwise =0.
Evaluation Metrics
Partition Coefficient:
o It measures the amount of overlapping between clusters and is
given as: 1 k N
PC =
N
��m
i =1 j =1
ij
2
Conductance:
o It can be defined as the ratio of number of inter cluster edges to the
minimum number of edges incident on either cluster Ck or :
Cond (Ck ) =
� i�Ck , j�Ck
Aij
min( A(Ck ), A(Ck ))
27
Pseudo-code: PGFC
28
Performance Evaluation: PGFC
29
Scalability: PGFC
30
Limitations of PGFC
Requires the number of clusters to be pre-defined.
31
Flow Chart of proposed PFCA
o It selects the candidate
cluster heads based on
their influence in the
network
o Determines the number
of clusters by analyzing
the graph structure using
Personalized PageRank
algorithm and
modularity.
32
Proposed Algorithm: PFCA
Cluster Formation & Optimization
33
Proposed Algorithm: PFCA
Filtering: Identifying fuzzy and crisp vertices
34
Performance Evaluation: PFCA
Dataset Characteristics
Performance
35
Performance Evaluation: PFCA
Run Time vs No. of Supersteps
36
Performance Evaluation: PFCA
Accuracy in terms of F1 and F2 measures
37
Scalability: PFCA
38
Limitations of PFCA
Information Loss:
39
Deep Learning based Autoencoders
42
Performance Evaluation: DFuzzy
Dataset and Effect of multiple layers
43
Performance Evaluation: DFuzzy
Run Time vs No. of Supersteps
44
Performance Evaluation: DFuzzy
45
Scalability: DFuzzy
46
Performance Comparison: Coverage
The graph coverage indicates how many vertices are assigned to clusters
47
Proposed Frequent Subgraph Mining
Algorithms
Exact Subgraph Mining: PaGro
o Exact subgraph mining algorithm is proposed by leveraging the
operative communication primitives for better scalability.
o A two-step hybrid approach is developed for optimization of
subgraph pruning task at both local and global levels to avoid the
excess communication overhead.
Approximate Subgraph Mining: Ap-FSM
o An approximate frequent subgraph mining algorithm which
exploits sampling for faster processing.
o A novel sampling approach named G-Samp is proposed for the
selection of an approximate subgraph while capturing the original
graph properties for convenient and relatively easy analysis.
48
Proposed PaGro
49
Proposed Algorithm: PaGro
50
Proposed Algorithm: PaGro
51
Dataset Used in PaGro
53
Performance Evaluation: PaGro Optimizations
~Baseline Algorithm ~Isomorphism Discovery ~ Communication Cost
54
Scalability: PaGro
55
Working of Ap-FSM
57
Performance Metrics for G-Samp
Degree Distribution
o In-degree is a count of the number of links directed to the vertex and
out-degree is the number of links that the vertex directs to others
Clustering Coefficient
o It is a measure of the degree to which the vertices of a graph tends to
form a cluster
o There are two versions of this measure: Local (LCC) and Global
(GCC). The LCC of a vertex compute the closeness of vertex v with its
neighbors to form a clique while, the GCC detects the overall
indication of the clustering in graph
Edge Betweenness
o Edge betweenness can be defined as the number of shortest paths σij in
the graph G that pass through a given edge (u, v)
58
Performance Evaluation: G-Samp
59
Performance Evaluation: Ap-FSM
60
Accuracy Evaluation of Ap-FSM
1
0.95
F-Measure
0.9
0.85
0.8
0.75
0.7
0.65
0.6 US Patents
LiveJournal
0.55
Twitter
0.5
2k 4k 6k 8k 10k
Support Threshold τ
61
Conclusion
A parallel fuzzy clustering algorithm PFCA is proposed which finds fuzzy clusters by amending the
structure of the classical Fuzzy C-Means algorithm for large graph data. Degree measure is taken for
initialization of cluster centers. It is proved that PGFC performs better than state-of-art clustering algorithms
in terms of partition coefficient and conductance.
A parallel fuzzy clustering algorithm named “PFCA” is proposed for large graphs where the number of
clusters is not pre-defined. PFCA outperforms the state-of-art clustering algorithms in terms of run time,
modularity and conductance.
A deep learning based fuzzy clustering algorithm named DFuzzy is proposed which performs clustering by
leveraging the idea from deep learning pipelines. Results shows that use of autoencoders for fuzzy clustering
help in getting good quality of clusters.
An exact subgraph mining algorithm named PaGro is proposed by leveraging the operative communication
primitives for better scalability. A two-step hybrid approach is developed for optimization of subgraph
pruning task at both local and global levels to avoid the excess communication overhead. PaGro performs
better than state-of-art FSM algorithms in terms of processing time and memory overhead.
An approximate frequent subgraph mining algorithm named Ap-FSM is proposed which exploits sampling
for faster processing. A novel sampling approach named G-Samp is proposed for the selection of an
approximate subgraph while capturing the original graph properties for convenient and relatively easy
analysis. The results show that it outperforms the competent algorithms and is time efficient.
62
Future Scope
A suitable partitioning technique could be applied to the proposed
63
List of The Publications
Vandana Bhatia and Rinkle Rani, “A Parallel Fuzzy Clustering algorithm for Large Graphs using Pregel”,
Expert Systems with Applications, Vol-78, pp. 135-144, 2017. [SCIE Indexed, Impact Factor-3.928] doi-
10.1016/j.eswa.2017.02.005.
Vandana Bhatia and Rinkle Rani, “DFuzzy: A Deep learning based Fuzzy Clustering Model for Large
Graphs”, Knowledge and Information Systems [ SCIE Indexed, Impact Factor-2.004] doi- 10.1007/s10115-
018-1156-3.
Vandana Bhatia and Rinkle Rani, “Ap-FSM: A Parallel algorithm for Approximate Frequent Subgraph
Mining using Pregel”, Expert Systems with Applications,Vol-106, pp. 217-232, 2018. [SCIE Indexed,
Impact Factor-3.928] doi- 10.1016/j.eswa.2018.04.010.
Vandana Bhatia and Rinkle Rani, “PFCA: An Influence based Parallel Fuzzy Clustering algorithm for Large
Complex Networks”, Expert Systems-SCIE Indexed, Impact factor: 1.18. [In Press]
Vandana Bhatia and Rinkle Rani, “PaGro: A Distributed Pattern Growth based Frequent Subgraph Mining
algorithm for Large Graphs”, IEEE Transactions on Parallel and Distributed Computing-SCIE Indexed,
Impact factor: 4.181. [Under Review]
Vandana Bhatia and Rinkle Rani, “PSGC: Parallel Structural Graph Clustering algorithm based on
Subgraph Similarity”, The Journal of Supercomputing -SCIE Indexed, Impact Factor-1.326
[Communicated].
Vandana Bhatia and Rinkle Rani, “An Efficient Influence based Label Propagation algorithm for Clustering
large graphs”, in the proceedings of IEEE International Conference on Infocom Technologies and
Unmanned Systems (ICTUS'2017), pp. 480-486, 18-20 December 2017.
Vandana Bhatia and Rinkle Rani, “An Efficient algorithm for Sampling of Single Large Graph”, in the
proceedings of IEEE 10th International Conference on Contemporary Computing (IC3’2017), 10-12
August 2017.
Vandana Bhatia, Bharti Saneja and Rinkle Rani, “INGC: Graph Clustering & Outlier Detection algorithm
using Label Propagation”, in the proceedings of IEEE International Conference on Machine Learning and
64Data Science, 13-15 December 2017.
Selected References
o U. Kang and C. Faloutsos, “Big Graph Mining : Algorithms and Discoveries,” ACM SIGKDD Explore Newsetter., vol. 14, no. 2, pp. 29–36, 2013.
o D. J. Cook and L. B. Holder, Mining Graph Data. Wiley, 2007.
o H. Meyerhenke, P. Sanders, and C. Schulz, “Parallel Graph Partitioning for Complex Networks -Balanced Graph Partitioning,” IEEE Transanction on Parallel Distributed Systems, vol. 28, no. 9, pp.
2625–2638, 2017.
o W. X. Lu, C. Zhou, and J. Wu, “Big social network influence maximization via recursively estimating influence spread,” Knowledge-Based Systems, vol. 113, pp. 143–154, 2016.
o M. Wang, C. Wang, J. X. Yu, and J. Zhang, “Community Detection in Social Networks : An In-depth Benchmarking Study with a Procedure-Oriented Framework,” in the p roceedings of VLDB
Endowment, vol. 8, no. 10, pp. 998–1009, 2015.
o N. S. Ketkar, L. B. Holder, and D. J. Cook, “Subdue: compression-based frequent pattern discovery in graph data,” in the proceedings of 1st International Workshop on Open source data Mining:
Frequent Pattern Mining Implementations, pp. 71–76, 2015.
o C. Borgelt and M. R. Berthold, “Mining molecular fragments: finding relevant substructures of molecules,” in the proceedings of IEEE International Conference on Data Mining, pp. 51–58, 2002.
o J. Baumes, M. Goldberg, M. Krishnamoorthy, M. Magdon-Ismail, and N. Preston, “Finding communities by clustering a graph into overlapping subgraphs,” in the proceedings of Interntional
Conference of Appllied Computing (IADIS 2005), pp. 97–104, 2005.
o J. Huan, W. Wang, J. Prins, and J. Yang, “SPIN: Mining Maximal Frequent Subgraphs from Graph Databases,” in the proceedings of the 10th ACM SIGKDD International Conference on Knowledge
discovery and data mining, no. 1, pp. 581–586, 2004.
o F. Schreiber and H. Schw, “Frequency Concepts and Pattern Detection for the Analysis of Motifs in Networks,” in the proceedings of Transactions on Computational Systems Biology III, pp. 89–104,
2005.
o S. E. Schaeffer, “Graph Clustering,” Computer Science Review., vol. 1, pp. 27–64, 2007.
o L. Wang and J. Hopcroft, “Community Structure in Large Complex Networks,” in the proceedings of International Conference on Theory and Applications of Models of Computation, pp. 455–466,
2010.
o S. Malek, M. Golsefid, M. Hossien, and F. Zarandi, “Fuzzy Community Detection Model in Social Networks,” International Journal of Intelligent System, vol. 30, pp. 1227–1244, 2015.
o H. Zhou, J. Li, J. Li, F. Zhang, and Y. Cui, “A graph clustering method for community detection in complex networks,” Physica A: Statistical Mechanics and Its Applications, vol. 469, pp. 551–562,
2017.
o A. Ghosh, N. S. Mishra, and S. Ghosh, “Fuzzy clustering algorithms for unsupervised change detection in remote sensing images,” Information Sciences., vol. 181, no. 4, pp. 699–715, 2011.
o Y. Zhou, H. Cheng, and J. X. Yu, “Graph Clustering Based on Structural / Attribute Similarities,” in the proceedings of the VLDB Endowment, vol. 2, no. 1, pp. 718–729, 2009.
o D. Arthur, “k-means ++ : The Advantages of Careful Seeding,” in the proceedings of the 18th annual ACM-SIAM symposium on Discrete algorithms, pp. 1027–1035, 2007.
o J. C. Bezdek, R. Ehrlich, and W. Full, “FCM: The fuzzy c-means clustering algorithm,” Computers & Geosciences, vol. 10, no. 2, pp. 191–203, 1984.
o H. Wang, Z. Xu, and W. Pedrycz, “An overview on the roles of fuzzy set techniques in big data processing: Trends, challenges and opportunities,” Knowledge-Based System, vol. 118, pp. 15–30,
2016.
o G. Palla and I. Dere, “Uncovering the Overlapping Community Structure of Complex Networks in Nature and Society,” Nature, vol. 435, no. 7043, pp. 814–818, 2005.
o S. Fortunato, “Community detection in Graphs,” Physics reports, vol. 486, pp. 75–174, 2010.
o C. Laclau and M. Nadif, “Hard and fuzzy diagonal co-clustering for document-term partitioning,” Neurocomputing, vol. 193, pp. 133–147, 2016.
o G. Bello-Orgaz, J. J. Jung, and D. Camacho, “Social Big Data: Recent Achievements and New Challenges,” Information Fusion, vol. 28, pp. 45–59, 2016.
o Y. Özbay, R. Ceylan, and B. Karlik, “A fuzzy clustering neural network architecture for classification of ECG arrhythmias,” Computers in Biology and Medicine, vol. 36, no. 4, pp. 376–388, 2006.
o O. Kesemen, Ö. Tezel, and E. Özkul, “Fuzzy c-means clustering algorithm for directional data ( FCM4DD ),” Expert System and Applications, vol. 58, pp. 76–82, 2016.
o A. Király, Á. Vathy-fogarassy, and J. Abonyi, “Geodesic distance based fuzzy c-medoid clustering – searching for central points in graphs and high dimensional data,” Fuzzy Sets Systems, vol. 1, pp.
1–16, 2015.
o P. G. Sun, L. Gao, and S. S. Han, “Identification of overlapping and non-overlapping community structure by fuzzy clustering in complex networks,” Information Sciences, vol. 181, pp. 1060–1071,
2011.
35
Selected References
o T. Nepusz, A. Petrczi, L. Ngyessy, and F. Bazs, “Fuzzy Communities and the concept of Bridgeness in Complex Networks,” Physical Review E , vol. 77, no. 1, pp. 1–13, 2008.
o A. Fahad et al., “A Survey of Clustering Algorithms for Big Data : Taxonomy and Empirical Analysis,” IEEE Transactions on Emerging Topics in Computing, vol. 2, no. 3, 2014.
o A. Inokuchi, T. Washio, and H. Motoda, “An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data,” Principles of Data Mining and Knowledge Discovery , vol. 1910, pp. 13–23,
2000.
o H. P. Hsieh and C. Te Li, “Mining temporal subgraph patterns in heterogeneous information networks,” in the proceedings of 2nd IEEE International Conference on Privacy, Security, Risk and Trust, pp.
282–287, 2010.
o X. Yan and J. Han, “gSpan: Graph-based substructure pattern mining,” Journal of Chemical Information and Modeling, vol. 53, no. 9, pp. 1689–1699, 2002.
o A. Dhiman and S. K. Jain, “Optimizing Frequent Subgraph Mining for Single Large Graph,” Procedia Computer Science, vol. 89, pp. 378–385, 2016.
o J. Li, Z. Zhong, J. Z. Huang, and S. Feng, “Balanced Parallel FP-Growth with MapReduce,” in the proceedings of IEEE International Conference on Granular Computing (GrC), pp. 875–878, 2011.
o K. Wang, X. X. B, H. Jin, P. Yuan, F. Lu, and X. Ke, “Frequent Subgraph Mining in Graph Databases Based on MapReduce,” in the proceedings of 10th Asia-Pacific Services Computing Conference on
Advances in Services Computing, APSCC, pp. 464–476, 2016.
o W. Lin, X. Xiao, and G. Ghinita, “Large-Scale Frequent Subgraph Mining in MapReduce,” in the proceedings of IEEE International Conference on Data Engineering, pp. 844–855, 2014.
o Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” in the proceedings of the Symposium of Operating Systems Design and Implementation, pp. 137–149, 2004.
o G. Malewicz et al., “Pregel : A System for Large-Scale Graph Processing,” in the proceedings of the ACM SIGMOD International Conference on Management of data, pp. 135–145, 2010.
o F. Tian, B. Gao, Q. Cui, E. Chen, and T. Liu, “Learning Deep Representations for Graph Clustering,” in the proceedings of the 28th AAAI Conference on Artificial Intelligence, pp. 1293–1299, 2015.
o J. Yang and J. Leskovec, “Overlapping community detection at scale: A Nonnegative Matrix Factorization Approach,” in the proceedings of the 6th ACM international conference on Web search and data
mining, pp. 587-596 , 2013.
o S. Gregory, “Finding overlapping communities in networks by label propagation,” New Journal of Physics, vol. 12, no. 10, pp. 1–21, 2010.
o T. Schäfer and P. Mutzel, “StruClus: Structural Clustering of Large-Scale Graph Databases,” arXiv Prepr. arXiv1609.09000 (2016)., 2016.
o I. Timón, J. Soto, H. Pérez-sánchez, and J. M. Cecilia, “Parallel implementation of fuzzy minimals clustering algorithm in R,” Expert System with Applications., vol. 48, pp. 35–41, 2016.
o S. A. Ludwig, “MapReduce-based fuzzy c-means clustering algorithm : implementation and scalability,” International Journal of Machine Learning and Cybernetics, vol. 6, no. 6, pp. 923–934, 2015.
o Z. Wu, G. Gao, Z. Bu, and J. Cao, “SIMPLE: a simplifying-ensembling framework for parallel community detection from large networks,” Cluster Computing, vol. 19, no. 1, pp. 211–221, 2016.
o X. Pan, D. Papailiopoulos, S. Oymak, B. Recht, K. Ramchandran, and M. I. Jordan, “Parallel Correlation Clustering on Big Graphs,” In Advances in Neural Information Processing Systems , pp. 1–22,
2015.
o H. Chun et al., “A graph clustering method for community detection in complex networks,” Knowledge and Information System., vol. 469, no. 1, pp. 718–729, 2017.
o S. Nijssen and J. Kok, “Faster Association Rules for Multiple Relations,” in the proceedings of International Joint Conference on Artificial Intelligence, pp. 891–896 , 2001.
o M. Kuramochi and G. Karypis, “An Efficient Algorithm for Discovering Frequent Subgraphs,” IEEE Transaction on Knowledge and Data Engineering, vol. 16, no. 9, pp. 1038–1051, 2004.
o T. Ramraj and R. Prabhakar, “Frequent subgraph mining algorithms - A survey,” Procedia Computer Science, vol. 47, no. C, pp. 197–204, 2014.
36
Thank You