Sunteți pe pagina 1din 36

Algorithms for Data Science

CSOR W4246
Eleni Drinea
Computer Science Department
Columbia University
Thursday, September 24, 2015

Outline

1 Recap

2 Applications of DFS

Cycle detection
Topological sorting
Strongly connected components

Today

1 Recap

2 Applications of DFS

Cycle detection
Topological sorting
Strongly connected components

Review of the last lecture

1. Applications of BFS
I
I

Connected components in undirected graphs


Testing bipartiteness

2. DFS
I

Classification of graph edges in directed graphs: back,


forward, cross
Time intervals of vertices, identifying the type of an edge
from the time intervals of its endpoints

Finding your way in a maze

Depth-first search (DFS): starting from a vertex s, explore


the graph as deeply as possible, then backtrack
1. Try the first edge out of s, towards some node v.
2. Continue from v until you reach a dead end, that is a node
whose neighbors have all been explored.
3. Backtrack to the first node with an unexplored neighbor
and repeat 2.
Remark: DFS answers s-t connectivity

Directed graphs: classification of edges

Graph edges that do not belong to the DFS tree(s) may be


1. forward: from a vertex to a descendant (other than a child)
2. back: from a vertex to an ancestor
3. cross: from right to left (no ancestral relation), that is
I
I

from tree to tree


between nodes in the same tree but on different branches

On the time intervals of vertices u, v

If we use an explicit stack, then


I

start(u) is the time when u is pushed in the stack

f inish(u) is the time when u is popped from the stack


(that is, all of its neighbors have been explored).

Intervals [start(u), f inish(u)] and [start(v), f inish(v)] either


I

contain each other (u is an ancestor of v or vice versa); or

they are disjoint.

Classifying edges using time

1. Edge (u, v) E is a back edge in a DFS tree if and only if


start(v) < start(u) < f inish(u) < f inish(v).
2. Edge (u, v) E is a forward edge if
start(u) < start(v) < f inish(v) < f inish(u).
3. Edge (u, v) E is a cross edge if
start(v) < f inish(v) < start(u) < f inish(u).

Today

1 Recap

2 Applications of DFS

Cycle detection
Topological sorting
Strongly connected components

Application I: Cycle detection


Claim 1.
G = (V, E) has a cycle if and only if DFS(G) yields a back edge.

Proof.
If (u, v) is a back edge, together with the path on the DFS tree
from v to u, it forms a cycle.
Conversely, suppose G has a cycle. Let v be the first vertex
from the cycle discovered by DFS(G). Let (u, v) be the
preceding edge in the cycle. Since there is a path from v to
every vertex in the cycle, all vertices in the cycle are now
discovered and fully explored before v is popped from the
stack. Hence the interval of u is contained in the interval of v.
By Claim 1, (u, v) is a back edge.

Application II: Topological sorting in DAGs

An undirected acyclic graph has an extremely simple


structure: it is a tree, hence a sparse graph (O(n) edges).

A directed acyclic graph (DAG) may be dense ((n2 )


edges): e.g., V = {1, . . . , n}, E = {(i, j) if i < j }.

Topological sorting: motivation


Input:
I

a set of tasks {1, 2, . . . , n} that need to be performed

a set of dependencies, each of the form (i, j), indicating


that task i must be performed before task j.

Output: a valid order in which the tasks may be performed, so


that all dependencies are respected.
Example: tasks are courses and certain courses must be taken
before others.
How can we model this problem using a graph? What kind of
graph must arise and why?

Topological ordering: definition

Definition 1.
A topological ordering of G is an ordering of its nodes as
1, 2, . . . , n such that for every edge (i, j), we have i < j.

All edges point forward in the topological ordering.

It provides an order in which all tasks can be safely


performed: when we try to perform task j, all tasks
required to precede it have already been done.

Example of DAG and its topological sorting


2

A DAG (top left), its topological sort (top right) and a drawing
emphasizing the topological sort (bottom).

Topological sorting in DAGs

Claim 2.
If G has a topological ordering, then G is a DAG.
Proof: By contradiction (exercise).
A visualization of the proof is provided by the linearized graph
of the previous slide: vertices appear in increasing order, edges
go from left to right, hence no cycles.
Is the converse true: does every DAG have a topological
ordering? And how can we find it?

Structural properties of DAGs


In a DAG, can every vertex have
I

an outgoing edge?

an incoming edge?

Definition 2 (source and sink).


A source is a node with no incoming edges.
A sink is a node with no outgoing edges.

Fact 3.
Every DAG has at least one source and at least one sink.

How can we use Fact 3 to find a topological order?


The node that we label first in the topological sorting must have
no incoming edges. Fact 3 guarantees that such a node exists.

Fact 4.
Let G0 be the graph after a source node and its adjacent edges
have been removed. Then G0 is a DAG.
Proof: removing edges from G cannot yield a cycle!
This gives rise to a recursive algorithm for finding the
topological order of a DAG. Its correctness can be shown by
induction (use Facts 3, 4 to show induction step).

Algorithm for topological sorting

TopologicalOrder(G)
1. Find a source vertex s and order it first.
2. Delete s and its adjacent edges from G; let G0 be the new
graph.
3. TopologicalOrder(G0 )
4. Append the order found after s.

Running time: O(n2 ). Can be improved to O(n + m).

Topological sorting via DFS

Let G = (V, E) be a DAG.


I

Run DFS(G); compute f inish times.

Process the tasks in decreasing order of f inish times.

Running time: O(m + n)

Intuition behind this algorithm

The task v with the largest f inish has no incoming edges


(if it had an incoming edge from some other task u, then u
would have the largest f inish). Hence v does not depend
on any other task and it is safe to perform it first.

The same reasoning shows that the task w with the second
largest f inish has no incoming edges from any other task
except (maybe) task v. Hence it is safe to perform w
second.

And so on and so forth.

Formal proof of correctness


By Claim 1 there are no back edges in the DFS forest of a
DAG. Thus every edge (u, v) E is either
1. forward/tree: start(u) < start(v) < f inish(v) < f inish(u)
s

2. or cross edge: f inish(v) < start(u) < f inish(u)


s
u

Proof of correctness (contd)

Hence for every (u, v) E, f inish(v) < f inish(u).


Consider a task v. All tasks u upon which v depends, that is,
all tasks u such that there is an edge (u, v) E, satisfy
f inish(v) < f inish(u).
Since we are processing tasks in decreasing order of finish times,
all tasks u upon which v depends have already been processed
before we start processing v.

Exploring the connectivity of a graph

Undirected graphs: find all connected components

Directed graphs: find all strongly connected components


(SCCs)
I

SCC(u) = set of nodes that are reachable from u and have


a path back to u

SCCs provide a hierarchical view of the connectivity of the


graph:
I

on a top level, the meta-graph of SCCs has a useful and


simple structure (coming up);
each meta-vertex of this graph is a fully connected
subgraph that we can further explore.

How can we find SCC(u) using BFS?

1. Run BFS(u); the resulting tree T consists of the set of


nodes to which there is a path from u.
2. Define Gr as the reverse graph, where edge (i, j) becomes
edge (j, i).
3. Run BFS(u) in Gr ; the resulting BFS tree T 0 consists of the
set of nodes that have a path to u.
4. The common vertices in T , T 0 compose the strongly
connected component of u.
What if we want all the SCCs of the graph?

The meta-graph of SCCs of a directed graph

Consider the meta-graph of all SCCs of G.


I

Make a (super)vertex for every SCC.

Add a (super)edge from SCC Ci to SCC Cj if there is an


edge from some vertex u of Ci to some vertex v of Cj .

What kind of graph is the meta-graph of SCCs?

The meta-graph of SCCs of a directed graph

C1
1
3

C2

6
4
C3

Consider the meta-graph of all SCCs of G.


I

Make a (super)vertex for every SCC.

Add a (super)edge from SCC Ci to SCC Cj if there is an


edge from some vertex u of Ci to some vertex v of Cj .

This graph is a DAG.

Is there an SCC we could process first?

Suppose we had a sink SCC of G, that is, an SCC with no


outgoing edges.
1. What will DFS discover starting at a node of a sink SCC?
2. How do we find a node that for sure lies in a sink SCC?
3. How do we continue to find all other SCCs?

Easier to find a node in a source SCC!


Fact 5.
The node assigned the largest f inish time when we run DFS(G)
belongs to a source SCC in G.
Example: v5 belongs to source SCC C2 .

Proof.
We will use Lemma 6 below. Let G be a directed graph. The
meta-graph of its SCCs is a DAG. For an SCC C, let
f inish(C) = max f inish(v)
vC

Example: f inish(C1 ) = f inish(v1 ) = 8.

Lemma 6.
Let Ci , Cj be SCCs in G. Suppose there is an edge (u, v) E
such that u Ci and v Cj . Then f inish(Ci ) > f inish(Cj ).

Gr is useful again

Fact 5 provides a direct way to find a node in a source SCC


of G: pick the node with largest f inish.

But we want a node in a sink SCC of G!

Consider Gr , the graph where the edges of G are reversed.


How do the SCCs of G and Gr compare?

Run DFS on Gr : the node with the largest f inish comes


from a source SCC of Gr (Fact 5). This is a sink SCC of G!

Using this observation to find all SCCs

We now know how to find a sink SCC in G.


1. Run DFS(Gr ); compute f inish times.
2. Run DFS(G) starting from the node with the largest f inish:
the nodes in the resulting tree T form a sink SCC in G.
How do we find all remaining SCCs?
I

Remove T from G; let G0 be the resulting graph.

The meta-graph of SCCs of G0 is a DAG, hence it has at


least one sink SCC.

Apply the procedure above recursively on G0 .

Algorithm for finding SCCs in directed graphs


SCC(G = (V, E))
1. Compute Gr .
2. Run DFS(Gr ); compute f inish(u) for all u.
3. Run DFS(G) in decreasing order of f inish(u).
4. Output the vertices of each tree in the DFS forest of line 3
as an SCC.

Remark 1.
1. Running time: O(n + m) why?
2. Equivalently, we can (i) run DFS(G), compute f inish times;
(ii) run DFS(Gr ) by decreasing order of f inish. Why?

A directed graph and its DFS forest with time intervals

1 (1,8)

2 (2,5)
3 (3,4)

5
4 (6,7)

(9,14)

6 (10,13)

7 (11,12)

DFS forest of Gr ; nodes are considered by decreasing


f inish times
(8)
v

(14)
v

(13)
v

(4)
v

v (5)

v (7)

v (12)

Still need to prove Lemma 6

Let G be a directed graph. The meta-graph of its SCCs is a


DAG.
For an SCC C, let
f inish(C) = max f inish(v)
vC

Lemma 7.
Let Ci , Cj be SCCs in G. Suppose there is an edge (u, v) E
such that u Ci and v Cj . Then f inish(Ci ) > f inish(Cj ).

Proof of Lemma 6

There are two cases to consider:


1. start(u) < start(v) (DFS starts at Ci )
I

Before leaving u, DFS will explore edge (u, v).

Since v Cj , all of Cj will now be explored.

Since there is no edge from Cj back to Ci (DAG!), all


vertices in Cj will be assigned f inish times before DFS
backtracks to u and assigns a f inish time to u. Thus
f inish(Cj ) < f inish(u) f inish(Ci )

Proof of Lemma 6 (contd)

2. start(u) > start(v) (DFS starts at Cj )


Since there is no edge from Cj to Ci , DFS will finish
exploring Cj before it restarts from some vertex that will
result in discovery of Ci . Thus
f inish(Cj ) < start(u) < f inish(u)
f inish(Cj ) < f inish(Ci )

S-ar putea să vă placă și