Gspan 11

gSpan: Graph-based
substructure pattern
mining
Authors: Xifeng Yan and Jiawei Han
Presented by: Ahmed R. Nabhan

University of Vermont
1
Copyright note:
This presentation was originally provided by Prof.

Xifeng Yan upon request from student
Citation:
Xifeng Yan and Jiawei Han. gSpan: graph-based
substructure pattern mining. In IEEE
International Conference on Data Mining (ICDM),
2002
Outlines
Background
Problem Definition
Authors Contribution
Concepts behind gSpan
Experimental Result
Conclusion
3
Background
Frequent Subgraph Mining is an extension to

existing frequent pattern mining algorithms
A major challenge is to count how many instances

of a pattern are in the dataset
Counting instances might be easy for sets, but
subtle for graphs
Recall the graph isomorphism problem
4
Background
G1=(V1,E1,L1)
1
2
4
(a)
G2=(V2,E2,L2)
V
W
U
(b)
f(V1.1)
f(V1.2)
f(V1.3)
f(V1.4)
f(V1.5)
=
=
=
=
=
V2.2
V2.5
V2.3
V2.4
V2.1
5
(c)
Two Isomorphic graph (a) and (b) with their mapping function (c)
Two graphs are isomorphic if one can find a mapping of nodes of the
first graph to the second graph such that labels on nodes and edges
are preserved.
Problem: Finding Frequent

Subgraphs
Problem setting: similar to finding frequent itemsets for

association rule discovery
Input: Database of graph transactions
Undirected simple graph (no loops(?), no multiples edges)
Each graph transaction has labeled edges/vertices.
Transactions may not be connected
Minimum support thresholds
Output: Frequent subgraphs that satisfy the support

threshold, where each frequent subgraph is connected.
6
Finding Frequent Subgraphs
Xifeng Yan
Authors Contribution
Representing graphs as strings (like TreeMiner)

No candidate generation!
It combines the growing and checking of frequent
subgraphs into one procedure, thus accelerates the mining
process.
Really fast, still a standard baseline system that most rivals
compare their systems to.
Concepts behind gSpan
The idea is to produces a Depth-First Search (DFS) codes

for each edge in graphs
Edges are sorted according to lexicographic order of codes
Yan and Han proved that graph isomororphism can be
tested for two graphs annotated with DFS codes
Starting with small graph patterns containing 1-edge,
patterns are expanded systemically by the DFS search
Employ anti-monotonic property of graph frequency
9
Anti-Monotonicity of graph
frequency
The frequency of a super-pattern is less than or equal to

the frequency of a sub-pattern. Copyright SIGMOD08
10
Lexicographic Ordering in Graph
It can tell us the order of two graphs.

The design can help us build a similar hierarchy.
The design should guarantee easy-growing from one level
to the lower level and easy-rolling-up from low level to
higher level.
It may be difficult to have such design that no two nodes
in this tree are same for graph case.
It can tell us whether the graph has been discovered.
And more, the most important, if a graph has been
discovered, all its children nodes in the hierarchy must
have been discovered.
11
Lexicographic Ordering in Graph

1-edge
...
2-edge
...
...
...
...
3-edge
...
...
...
12
DFS code and Minimum DFS code
Depth First Tree and Forward/Backward Edge Set

13
DFS code and Minimum DFS code
We use a 5-tuple (vi, vj, l(vi), l(vj), l(vi,vj)) to

represent an edge. (it may be redudant, but much
easier to understand.)
Turn a graph into a sequence whose basic element
is 5-tuple. Form the sequence in such an order:
to extend one new node, add the forward edge
that connect one node in the old graph with this
new node.
Add all backward edge that connect this new
node to other nodes in the old graph
repeat this procedure.
14
DFS code
a
b
Y
b
X
c
Z
v0
v1
v2
v3
v4
Y
b
X
c
Z
e0: (0,1,x,y,a)
a
e1: (1,2,y,x,b)
e2: (2,0,x,x,a)
e3: (2,3,x,z,c)
e4: (3,1,x,y,b)
e5: (1,4,x,z,d)
15
Minimum DFS code

Each Graph may have lots of DFS code (why?):
one smallest lexicographic one is its Minimum DFS Code
Edge no.
(B)
(C)
(D)
(0,1,x,y,a)
(0,1,y,x,a) (0,1,x,x,a)
(1,2,y,x,b)
(1,2,x,x,a) (1,2,x,y,b)
(2,0,x,x,a)
(2,0,x,y,b) (0,1,y,x,a)
(2,3,x,z,c)
(2,3,x,z,c) (2,3,y,z,a)
(3,1,z,y,b)
(3,0,z,y,b) (3,1,z,x,c)
(1,4,x,z,d)
(0,4,y,z,d) (2,4,y,z,d)
16
Graph Parent and its Children

?
a
?
X ? ?
c
?
Y a
?
b
Z
Z
?
Given a DFS code

c0=(e0,e1,,en)
if c1=(e0,e1,,en,ex)
if c0<c1, then
c0 is c1s parent,
c1 is c0s child.
17
DFS Code Tree

1-edge
...
2-edge
...
...
...
...
3-edge
...
...
...
18
Theorem
1. Given two graph G0 and G1, G0 is isomorphic
to G1 iff min_dfs_code(G0)=min_dfs_code(G1).
2. DFS Code Tree covers all graphs although

some tree nodes may represent the same graph.
(Covering)
3. Given a node in DFS Code Tree, if its DFS
code is not its minimum DFS code, prune this
node and its all descendants wont change
Covering.
19
Algorithm
20
Algorithm
21
Experimental Result
22
Experimental Result
23
Conclusion
No Candidate Generation and False Test
Space Saving from Depth First Search DFM
Good Performance: using memory Pool and one

major counting improvement, it seems the
performance will be improved 5 times more. (but
need more testing).
24
Questions?
25
Exam Questions
Q1) Compare gSpan to Apriori-based algorithms
Answer:
Unlike Apriori-based algorithms, gSpan does not generate candidate patterns and
tests for false positive pruning. This feature of gSpan is both time and space
efficient. Apriori-based algorithms must generate a candidate and then test for
isomorphism against graph dataset to calculate support. This test is costly. On
the other hand, gSpan does not test for isomorphism!
Q2) What are the main concepts behind gSpan
Answer:
- Using Depth-First-Search (DFS) codes to label graph edges

- Employing anti-monotonic property of sub-graph frequency
- Pattern growths and pruning
26
Exam Questions (cont.)
Q3) Please similar and different features of gSpan

and TreeMiner.
Answer:
- Both algorithms employ string representation of graphs

- TreeMiner generates candidate patterns and then find support, while
gSpan expand frequent patterns directly
- gSpan is generally more applicable (can handle both trees and graphs)
27

Gspan 11

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Gspan 11

Încărcat de

Drepturi de autor:

Formate disponibile

gSpan: Graph-based

Presented by: Ahmed R. Nabhan

This presentation was originally provided by Prof.

Concepts behind gSpan

Frequent Subgraph Mining is an extension to

A major challenge is to count how many instances

Problem: Finding Frequent

Problem setting: similar to finding frequent itemsets for

Input: Database of graph transactions

Undirected simple graph (no loops(?), no multiples edges)

Each graph transaction has labeled edges/vertices.

Transactions may not be connected

Minimum support thresholds

Output: Frequent subgraphs that satisfy the support

Finding Frequent Subgraphs

Representing graphs as strings (like TreeMiner)

Concepts behind gSpan

The idea is to produces a Depth-First Search (DFS) codes

The frequency of a super-pattern is less than or equal to

Lexicographic Ordering in Graph

It can tell us the order of two graphs.

Lexicographic Ordering in Graph

DFS code and Minimum DFS code

Depth First Tree and Forward/Backward Edge Set

DFS code and Minimum DFS code

We use a 5-tuple (vi, vj, l(vi), l(vj), l(vi,vj)) to

Minimum DFS code

Graph Parent and its Children

Given a DFS code

DFS Code Tree

1. Given two graph G0 and G1, G0 is isomorphic

2. DFS Code Tree covers all graphs although

No Candidate Generation and False Test

Space Saving from Depth First Search DFM

Good Performance: using memory Pool and one

Q1) Compare gSpan to Apriori-based algorithms

Q2) What are the main concepts behind gSpan

- Using Depth-First-Search (DFS) codes to label graph edges

Exam Questions (cont.)

Q3) Please similar and different features of gSpan

- Both algorithms employ string representation of graphs

S-ar putea să vă placă și