Sunteți pe pagina 1din 49

Algorithms for Data Science

CSOR W4246
Eleni Drinea
Computer Science Department
Columbia University
Tuesday, October 8, 2015

Outline

1 Recap
2 Segmented least squares

An exponential recursive algorithm


3 A Dynamic Programming (DP) solution

A quadratic iterative algorithm


Applying the DP principle
4 Sequence alignment

Today

1 Recap
2 Segmented least squares

An exponential recursive algorithm


3 A Dynamic Programming (DP) solution

A quadratic iterative algorithm


Applying the DP principle
4 Sequence alignment

Review of the last lecture

Weighted graphs G = (V, E, w)


I

Weighted graphs

Single-source (origin) shortest paths in graphs with


non-negative edge weights
Dijsktras algorithm

I
I

Correctness
Implementation

Today

1 Recap
2 Segmented least squares

An exponential recursive algorithm


3 A Dynamic Programming (DP) solution

A quadratic iterative algorithm


Applying the DP principle
4 Sequence alignment

Linear least squares fitting


A foundational problem in statistics: find a line of best fit
through some data points.

Linear least squares fitting

Input: a set P of n data points (x1 , y1 ), (x2 , y2 ), . . ., (xn , yn );


we assume x1 < x2 < . . . < xn .

Output: the line L defined as y = ax + b that minimizes the


error
err(L, P ) =

n
X
i=1

(yi axi b)2

(1)

Linear least squares fitting: solution


Given a set P of data points, we can use calculus to show that
the line L given by y = ax + b that minimizes
n
X

err(L, P ) =

(yi axi b)2

(2)

i=1

satisfies
a

b =

P
P
xi yi ( i xi )( i yi )
P
P
n i x2i ( i xi )2
P
P
i yi a
i xi
n
n

How fast can we compute a, b?

(3)
(4)

What if the data changes direction?

What if the data changes direction more than once?

How to detect change in the data

Any single line would have large error.

Idea 1: hardcode number of lines to 2 (or some fixed m).


I

Idea 2: pass an arbitrary set of lines through the points


and seek the set of lines that minimizes the error.
I

Fails for the dataset on the previous slide.

Trivial solution: have a different line pass through each pair


of consecutive points in P .

Idea 3: fit the points well, using as few lines as possible.


I

Trade-off between complexity and error of the model

Formalizing the problem


Input: data set P = {p1 , . . . , pn } of points on the plane.
I

A segment S = {pi , pi+1 , . . . , pj } is a contiguous subset of the


input.

Let be a partition of P into m segments S1 , S2 , . . . , Sm .


For every segment Sk , use (2), (3), (4) to compute a line Lk that
minimizes err(Lk , Sk ).

Let C > 0 be a fixed multiplier. The cost of the partition is


X
err(Lk , Sk ) + m C
Sk

Segmented least squares

This problem is an instance of change detection in data mining


and statistics.

Input: A set P of n data points pi = (xi , yi ) as before.


Output: A segmentation = {S1 , S2 , . . . , Sm } of P whose
cost
X
err(Lk , Sk ) + m C
Sk

is minimum.

A brute force approach

We can find the optimal segmentation (that is, the one


incurring the minimum penalty) by exhaustive search.
I

Enumerate every possible segmentation and compute its


penalty.

Output the one that incurs the minimum penalty.

4 O(2n ) partitions

A crucial observation regarding the last data point

Consider the last point pn in the data set.


I

pn belongs to a single segment in the optimal partition.

That segment starts at an earlier point pi , for some


1 i n.

This suggests a recursive solution: if we knew where the last


segment starts, then we could remove it and recursively solve
the problem on the remaining points {p1 , . . . , pi1 }.

A recursive approach

Let OP T (j) denote the cost of the optimal segmentation


for points p1 , . . . , pj .

Then, if the last segment of the optimal segmentation is


{pi , . . . , pn }, the cost of the optimal solution is
OP T (n) = err(L, {pi , . . . , pn }) + C + OP T (i 1).

But we dont know where the last segment starts! How do


we find the point pi ?

Set
OP T (n) = min

1in

n
o
Error(L, {pi , . . . , pn })+C +OP T (i1) .

A recurrence for the optimal solution


Notation: let ei,j = err(L, {pi , . . . , pj }), for 1 i j n.
Then
n
o
OP T (n) = min ei,n + C + OP T (i 1) .
1in

If we apply the above expression recursively to remove the last


segment, we obtain the recurrence
n
o
OP T (j) = min ei,j + C + OP T (i 1)
(5)
1ij

Remark 1.
1. We can precompute and store all ei,j using equations (2),
(3), (4) in O(n3 ) time. Can be improved to O(n2 ).
2. The natural recursive algorithm arising from recurrence (5)
is not efficient (think about its recursion tree!).

Exponential-time recursion

Notation: T (n) = time to compute optimal segmentation of n


points.
Then
T (n) T (n 1) + T (n 2).
I

Can show that T (n) Fn , the n-th Fibonacci number


(by strong induction on n).

From Problem 5a in Homework 1, Fn = (2n/2 ).

Hence T (n) = (2n/2 ).

The recursive algorithm requires (2n/2 ) time.

Today

1 Recap
2 Segmented least squares

An exponential recursive algorithm


3 A Dynamic Programming (DP) solution

A quadratic iterative algorithm


Applying the DP principle
4 Sequence alignment

Are we really that far from an efficient solution?


Recall Fibonacci problem from HW1: exponential recursive
algorithm, polynomial iterative solution
How?
1. Overlapping subproblems: spectacular redundancy in
computations of recursion tree
2. Easy-to-compute recurrence for combining the smaller
subproblems: Fn = Fn1 + Fn2
3. Iterative, bottom-up computations: we computed the
subproblems from smallest (F0 , F1 ) to largest (Fn ),
iteratively.
4. Small number of subproblems: only solved n 1
subproblems.

Elements of DP in segmented least squares

1. Overlapping subproblems
2. An easy-to-compute recurrence (5) for combining solutions
to the smaller subproblems into a solution to a larger
subproblem in O(n) time (once smaller subproblems have
been solved).
3. Iterative, bottom-up computations: compute the
subproblems from smallest (0 points) to largest (n points),
iteratively.
4. Small number of subproblems: we only need to solve n
subproblems.

A dynamic programming approach

OP T (j) = min

1ij

n
o
ei,j + C + OP T (i 1)

The optimal solution to the subproblem on p1 , . . . , pj


contains optimal solutions to smaller subproblems.

Recurrence 5 provides an ordering of the subproblems


from smaller to larger, with the subproblem of size 0 being
the smallest and the subproblem of size n the largest.

There are n + 1 subproblems in total. Solving the j-th


subproblem requires (j) = O(n) time.
The overall running time is O(n2 ).
I

Boundary conditions: OP T (0) = 0.

Segment pk , . . . , pj appears in the optimal solution only if


the minimum in the expression above is achieved for i = k.

An iterative algorithm for segmented least squares


Let M be an array of n entries. M [i] stores the cost of the
optimal segmentation of the first i data points.
SegmentedLS(n, P )
M [0] = 0
for all pairs i j do
Compute ei,j for segment pi , . . . , pj using (2), (3), (4)
end for
for j = 1 to n do
M [j] = min {ei,j + C + M [i 1]}
1ij

end for
Return M [n]
Running time: time required to fill in dynamic programming
array M is O(n3 ) + O(n2 ). Can be brought down to O(n2 ).

Reconstructing an optimal segmentation


I

Suppose we want the optimal solution in addition to its


value, that is, the actual segmentation that achieves the
minimum cost M [n].

We can trace back through the dynamic programming


array M to compute the optimal segmentation.

Initial call: OPTSegmentation(n)


OPTSegmentation(j)
if (j == 0) then return
else
Find 1 i j such that M [j] = ei,j + C + M [i 1]
OPTSegmentation(i 1)
Output segment {pi , . . . , pj }
end if

Obtaining efficient algorithms using DP

1. Optimal substructure: the optimal solution to the problem


contains optimal solutions to the subproblems.
2. A recurrence for the overall optimal solution in terms of
optimal solutions to appropriate subproblems. The
recurrence should provide a natural ordering of the
subproblems from smaller to larger and require polynomial
work for combining solutions to the subproblems.
3. Iterative, bottom-up computation of subproblems, from
smaller to larger.
4. Small number of subproblems (polynomial in n).

Dynamic programming vs Divide & Conquer

They both combine solutions to subproblems to generate


the overall solution.

However, divide and conquer starts with a large problem


and divides it into small pieces.

While dynamic programming works from the bottom up,


solving the smallest subproblems first and building optimal
solutions to steadily larger problems.

Today

1 Recap
2 Segmented least squares

An exponential recursive algorithm


3 A Dynamic Programming (DP) solution

A quadratic iterative algorithm


Applying the DP principle
4 Sequence alignment

String similarity

This problem arises when comparing strings.


Example: consider an online dictionary.
I

Input:

a word, e.g., ocurrance

Output: did you mean occurrence?

Similarity: intuitively, two words are similar if we can almost


line them up by using gaps and mismatches.

Aligning strings using gaps and mismatches

We can align ocurrance and occurrence using


I

one gap and one mismatch


o
o

c
c

u
u

r
r

r
r

a
e

n
n

c
c

e
e

or, three gaps


o
o

c
c

u
u

r
r

r
r

n
n

c
c

e
e

Strings in biology

Similarity of english words is rather intuitive.

Determining similarity of biological strings is a central


computational problem for molecular biologists.
I

Chromosomes again: an organisms genome consists of


chromosomes (giant linear DNA molecules)
We may think of a chromosome as an enormous linear tape
containing a string over the alphabet {A, C, G, T }.
The string encodes instructions for building protein
molecules.

Why similarity?

Why are we interested in similarity of biological strings?


I

Roughly speaking, the sequence of symbols in an


organisms genome determines the properties of the
organism.

So similarity can guide decisions about biological


experiments.

How do we define similarity between two strings?

Similarity based on the notion of lining up two strings

Informally, an alignment between two strings tells us which


pairs of positions will be lined up with one another.
Example: X = GCAT, Y = CATG
x1
G
-

x2
C
C
y1

x3
A
A
y2

x4
T
T
y3

G
y4

Then {(2, 1), (3, 2), (4, 3)} is an alignment of X and Y : these
are the pairs of positions in X, Y that are aligned (matched).

Definition of alignment of two strings


An alignment L of X = x1 . . . xm , Y = y1 . . . yn is a set of
ordered pairs of indices (i, j) with i [1, m], j [1, n] such that
the following two properties hold:
P1. every i [1, m] and every j [1, n] appears at most once;
P2. pairs do not cross: if (i, j), (i0 , j 0 ) L and i < i0 , then
j < j0.
Example: X = GCAT, Y = CATG
x1
G
-

x2
C
C
y1

x3
A
A
y2

x4
T
T
y3

G
y4

1. {(2, 1), (3, 2), (4, 3)} is an alignment; but


2. {(2, 1), (3, 2), (4, 3), (1, 4)} is not an alignment (violates P2).

Cost of an alignment

Let L be an alignment of X = x1 . . . xm , Y = y1 . . . yn .
1. Gap penalty : there is a cost for every position of X
that is not matched in Y ; and vice versa.
2. Mismatch cost: there is a cost pq for every pair of
alphabet symbols p, q that are matched in L.
I
I

So every pair (i, j) L incurs a cost of xi yj .


Assumption: pp = 0 (matching a symbol with itself
incurs no cost).

The cost of alignment L is the sum of all the gap and the
mismatch costs.

Cost of alignment in symbols

In symbols, given alignment L, let


I

XiL = 1 if position i of X is not matched,

YjL = 1 if position j of Y is not matched.

Then the cost of alignment L is given by


X
X
X
cost(L) =
XiL +
YjL +
xi yj
1im

1jn

(i,j)L

Examples

Example 1.
Let L1 be the alignment shown below.
x1
o
o
y1

x2
c
c
y2

c
y3

x3
u
u
y4

x4
r
r
y5

x5
r
r
y6

x6
a
e
y7

x7
n
n
y8

x8
c
c
y9

x9
e
e
y10

L1 = {(1, 1), (2, 2), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (8, 9), (9, 10)}

cost(L1 ) = + ae

(This is Y3L1 + x6 y7 .)

Examples

Example 2.
Let L2 be the alignment shown below.
x1
o
o

x2
c
c

x3
u
u

x4
r
r

x5
r
r

x6
a
-

x7
n
n

x8
c
c

L1 = {(1, 1), (2, 3), (3, 4), (4, 5), (5, 6), (7, 8), (8, 9), (9, 10)}

cost(L2 ) = 3

(This is X6L2 + Y2L2 + Y7L2 .)

x9
e
e

Examples

Example 3.
Let L3 , L4 be the alignments shown below.
x1
G
C
y1

x2
C
A
y2

x3
A
T
y3

x4
T
G
y4

x1
G
-

x2
C
C
y1

x3
A
A
y2

x4
T
T
y3

G
y4

L3 = {(1, 1), (2, 2), (3, 3), (4, 4)}

L4 = {(2, 1), (3, 2), (4, 3)}

cost(L3 ) = GC + CA + AT + TG

cost(L4 ) = 2

The sequence alignment problem

Input:
I

two strings X, Y consisting of m, n symbols respectively;


each symbol is from some alphabet

the gap penalty

the mismatch costs {pq } for every pair (p, q) 2

Output: the alignment L of minimum cost.

Towards a recursive solution

Claim 1.
Let L be the optimal alignment. Then
1. either the last two symbols xm , yn of X, Y are matched in
L, hence the pair (m, n) L; or,
2. xm , yn are not matched in L, hence (m, n) 6 L.
In this case, at least one of xm , yn is not matched in L,
hence at least one of m, n does not appear in L.

Proof of Claim 1

By contradiction.
Suppose (m, n) 6 L but xm and yn are both matched in L.
That is,
1. xm is matched with yj for some j < n, hence (m, j) L;
2. yn is matched with xi for some i < m, hence (i, n) L.
Since pairs (i, n) and (m, j) cross, L is not an alignment.

Rewriting Claim 1

The following equivalent way of stating Claim 1 will allow us to


easily derive a recurrence.

Fact 4.
In an optimal alignment L, at least one of the following is true
1. (m, n) L; or
2. xm is not matched; or
3. yn is not matched.

The subproblems for sequence alignment

Let
OP T (i, j) = minimum cost of an alignment between x1 . . . xi , y1 . . . yj

We want OP T (m, n). From Fact 4,


1. If (m, n) L, we pay xm yn + OP T (m 1, n 1).
2. If xm is not matched, we pay + OP T (m 1, n).
3. If yn is not matched, we pay + OP T (m, n 1).
How do we decide which of the three to use for OP T (m, n)?

The recurrence for the sequence alignment problem

xi yj + OP T (i 1, j 1)
+ OP T (i 1, j)
OP T (i, j) =
min

+ OP T (i, j 1)

, if i = 0
, if i, j 1
, if j = 0

Remarks
I

Boundary cases: OP T (0, j) = j and OP T (i, 0) = i.

Pair (i, j) appears in the optimal alignment for subproblem


x1 . . . xi , y1 . . . yj if and only if the minimum is achieved by
the first of the three values inside the min computation.

Computing the cost of the optimal alignment


I
I

M is an (m + 1) (n + 1) dynamic programming table.


Fill in M so that all subproblems needed for entry M [i, j]
have already been computed when we compute M [i, j]
(e.g., column-by-column).

0
0

i-1
i
m

j-1 j

Pseudocode

SequenceAlignment(X, Y )
Initialize M [i, 0] to i
Initialize M [0, j] to j
for j = 1 to n do
for i = 1 to m don
M [i, j] = min xi yj + M [i 1, j 1],
o
+ M [i 1, j], + M [i, j 1]
end for
end for
return M [m, n]
Running time?

Reconstructing the optimal alignment


Given M , we can reconstruct the optimal alignment as follows.
TraceAlignment(i, j)
if i == 0 or j == 0 then return
else
if M [i, j] == xi yj + M [i 1, j 1] then
TraceAlignment(i 1, j 1)
Output (i, j),
else
if M [i, j] == + M [i 1, j] then TraceAlignment(i 1, j)
else TraceAlignment(i, j 1)
end if
end if
end if
Initial call: TraceAlignment(m, n)
Running time?

Resources used by dynamic programming algorithm

I
I

Time: O(mn)
Space: O(mn)
I
I

English words: m, n 10
Computational biology: m = n = 100000
I
I

Time: 10 billions ops


Space: 10GB table!

Can we avoid using quadratic space while maintaining


quadratic running time?

Using only O(m + n) space

1. First, suppose we are only interested in the cost of the


optimal alignment.
Easy: keep a table M with 2 columns, hence 2(m + 1)
entries.
2. What if we want the optimal alignment too?
I

No longer possible in O(n + m) time.

S-ar putea să vă placă și