Slides10 6

Algorithms for Data Science
CSOR W4246
Eleni Drinea
Computer Science Department
Columbia University
Tuesday, October 8, 2015
Outline
1 Recap
2 Segmented least squares
An exponential recursive algorithm

3 A Dynamic Programming (DP) solution
A quadratic iterative algorithm

Applying the DP principle
4 Sequence alignment
Today
1 Recap


Review of the last lecture
Weighted graphs G = (V, E, w)

I
Weighted graphs
Single-source (origin) shortest paths in graphs with

non-negative edge weights
Dijsktras algorithm
I
I
Correctness
Implementation
Today
1 Recap


Linear least squares fitting

A foundational problem in statistics: find a line of best fit
through some data points.
Linear least squares fitting
Input: a set P of n data points (x1 , y1 ), (x2 , y2 ), . . ., (xn , yn );

we assume x1 < x2 < . . . < xn .
Output: the line L defined as y = ax + b that minimizes the

error
err(L, P ) =
n
X
i=1
(yi axi b)2
(1)
Linear least squares fitting: solution

Given a set P of data points, we can use calculus to show that
the line L given by y = ax + b that minimizes
n
X
err(L, P ) =
(yi axi b)2
(2)
i=1
satisfies
a
b =
P
P
xi yi ( i xi )( i yi )
P
P
n i x2i ( i xi )2
P
P
i yi a
i xi
n
n
How fast can we compute a, b?
(3)
(4)
What if the data changes direction?
What if the data changes direction more than once?
How to detect change in the data
Any single line would have large error.
Idea 1: hardcode number of lines to 2 (or some fixed m).

I
Idea 2: pass an arbitrary set of lines through the points

and seek the set of lines that minimizes the error.
I
Fails for the dataset on the previous slide.
Trivial solution: have a different line pass through each pair

of consecutive points in P .
Idea 3: fit the points well, using as few lines as possible.

I
Trade-off between complexity and error of the model
Formalizing the problem

Input: data set P = {p1 , . . . , pn } of points on the plane.
I
A segment S = {pi , pi+1 , . . . , pj } is a contiguous subset of the

input.
Let be a partition of P into m segments S1 , S2 , . . . , Sm .

For every segment Sk , use (2), (3), (4) to compute a line Lk that
minimizes err(Lk , Sk ).
Let C > 0 be a fixed multiplier. The cost of the partition is

X
err(Lk , Sk ) + m C
Sk
Segmented least squares
This problem is an instance of change detection in data mining

and statistics.
Input: A set P of n data points pi = (xi , yi ) as before.

Output: A segmentation = {S1 , S2 , . . . , Sm } of P whose
cost
X
err(Lk , Sk ) + m C
Sk
is minimum.
A brute force approach
We can find the optimal segmentation (that is, the one

incurring the minimum penalty) by exhaustive search.
I
Enumerate every possible segmentation and compute its

penalty.
Output the one that incurs the minimum penalty.
4 O(2n ) partitions
A crucial observation regarding the last data point
Consider the last point pn in the data set.

I
pn belongs to a single segment in the optimal partition.
That segment starts at an earlier point pi , for some

1 i n.
This suggests a recursive solution: if we knew where the last

segment starts, then we could remove it and recursively solve
the problem on the remaining points {p1 , . . . , pi1 }.
A recursive approach
Let OP T (j) denote the cost of the optimal segmentation

for points p1 , . . . , pj .
Then, if the last segment of the optimal segmentation is

{pi , . . . , pn }, the cost of the optimal solution is
OP T (n) = err(L, {pi , . . . , pn }) + C + OP T (i 1).
But we dont know where the last segment starts! How do

we find the point pi ?
Set
OP T (n) = min
1in
n
o
Error(L, {pi , . . . , pn })+C +OP T (i1) .
A recurrence for the optimal solution

Notation: let ei,j = err(L, {pi , . . . , pj }), for 1 i j n.
Then
n
o
OP T (n) = min ei,n + C + OP T (i 1) .
1in
If we apply the above expression recursively to remove the last

segment, we obtain the recurrence
n
o
OP T (j) = min ei,j + C + OP T (i 1)
(5)
1ij
Remark 1.
1. We can precompute and store all ei,j using equations (2),
(3), (4) in O(n3 ) time. Can be improved to O(n2 ).
2. The natural recursive algorithm arising from recurrence (5)
is not efficient (think about its recursion tree!).
Exponential-time recursion
Notation: T (n) = time to compute optimal segmentation of n

points.
Then
T (n) T (n 1) + T (n 2).
I
Can show that T (n) Fn , the n-th Fibonacci number

(by strong induction on n).
From Problem 5a in Homework 1, Fn = (2n/2 ).
Hence T (n) = (2n/2 ).
The recursive algorithm requires (2n/2 ) time.
Today
1 Recap


Are we really that far from an efficient solution?

Recall Fibonacci problem from HW1: exponential recursive
algorithm, polynomial iterative solution
How?
1. Overlapping subproblems: spectacular redundancy in
computations of recursion tree
2. Easy-to-compute recurrence for combining the smaller
subproblems: Fn = Fn1 + Fn2
3. Iterative, bottom-up computations: we computed the
subproblems from smallest (F0 , F1 ) to largest (Fn ),
iteratively.
4. Small number of subproblems: only solved n 1
subproblems.
Elements of DP in segmented least squares
1. Overlapping subproblems
2. An easy-to-compute recurrence (5) for combining solutions
to the smaller subproblems into a solution to a larger
subproblem in O(n) time (once smaller subproblems have
been solved).
3. Iterative, bottom-up computations: compute the
subproblems from smallest (0 points) to largest (n points),
iteratively.
4. Small number of subproblems: we only need to solve n
subproblems.
A dynamic programming approach
OP T (j) = min
1ij
n
o
ei,j + C + OP T (i 1)
The optimal solution to the subproblem on p1 , . . . , pj

contains optimal solutions to smaller subproblems.
Recurrence 5 provides an ordering of the subproblems

from smaller to larger, with the subproblem of size 0 being
the smallest and the subproblem of size n the largest.
There are n + 1 subproblems in total. Solving the j-th

subproblem requires (j) = O(n) time.
The overall running time is O(n2 ).
I
Boundary conditions: OP T (0) = 0.
Segment pk , . . . , pj appears in the optimal solution only if

the minimum in the expression above is achieved for i = k.
An iterative algorithm for segmented least squares

Let M be an array of n entries. M [i] stores the cost of the
optimal segmentation of the first i data points.
SegmentedLS(n, P )
M [0] = 0
for all pairs i j do
Compute ei,j for segment pi , . . . , pj using (2), (3), (4)
end for
for j = 1 to n do
M [j] = min {ei,j + C + M [i 1]}
1ij
end for
Return M [n]
Running time: time required to fill in dynamic programming
array M is O(n3 ) + O(n2 ). Can be brought down to O(n2 ).
Reconstructing an optimal segmentation

I
Suppose we want the optimal solution in addition to its

value, that is, the actual segmentation that achieves the
minimum cost M [n].
We can trace back through the dynamic programming

array M to compute the optimal segmentation.
Initial call: OPTSegmentation(n)

OPTSegmentation(j)
if (j == 0) then return
else
Find 1 i j such that M [j] = ei,j + C + M [i 1]
OPTSegmentation(i 1)
Output segment {pi , . . . , pj }
end if
Obtaining efficient algorithms using DP
1. Optimal substructure: the optimal solution to the problem

contains optimal solutions to the subproblems.
2. A recurrence for the overall optimal solution in terms of
optimal solutions to appropriate subproblems. The
recurrence should provide a natural ordering of the
subproblems from smaller to larger and require polynomial
work for combining solutions to the subproblems.
3. Iterative, bottom-up computation of subproblems, from
smaller to larger.
4. Small number of subproblems (polynomial in n).
Dynamic programming vs Divide & Conquer
They both combine solutions to subproblems to generate

the overall solution.
However, divide and conquer starts with a large problem

and divides it into small pieces.
While dynamic programming works from the bottom up,

solving the smallest subproblems first and building optimal
solutions to steadily larger problems.
Today
1 Recap


String similarity
This problem arises when comparing strings.

Example: consider an online dictionary.
I
Input:
a word, e.g., ocurrance
Output: did you mean occurrence?
Similarity: intuitively, two words are similar if we can almost

line them up by using gaps and mismatches.
Aligning strings using gaps and mismatches
We can align ocurrance and occurrence using

I
one gap and one mismatch

o
o
c
c
u
u
r
r
r
r
a
e
n
n
c
c
e
e
or, three gaps

o
o
c
c
u
u
r
r
r
r
n
n
c
c
e
e
Strings in biology
Similarity of english words is rather intuitive.
Determining similarity of biological strings is a central

computational problem for molecular biologists.
I
Chromosomes again: an organisms genome consists of

chromosomes (giant linear DNA molecules)
We may think of a chromosome as an enormous linear tape
containing a string over the alphabet {A, C, G, T }.
The string encodes instructions for building protein
molecules.
Why similarity?
Why are we interested in similarity of biological strings?

I
Roughly speaking, the sequence of symbols in an

organisms genome determines the properties of the
organism.
So similarity can guide decisions about biological

experiments.
How do we define similarity between two strings?
Similarity based on the notion of lining up two strings
Informally, an alignment between two strings tells us which

pairs of positions will be lined up with one another.
Example: X = GCAT, Y = CATG
x1
G
-
x2
C
C
y1
x3
A
A
y2
x4
T
T
y3
G
y4
Then {(2, 1), (3, 2), (4, 3)} is an alignment of X and Y : these
are the pairs of positions in X, Y that are aligned (matched).
Definition of alignment of two strings

An alignment L of X = x1 . . . xm , Y = y1 . . . yn is a set of
ordered pairs of indices (i, j) with i [1, m], j [1, n] such that
the following two properties hold:
P1. every i [1, m] and every j [1, n] appears at most once;
P2. pairs do not cross: if (i, j), (i0 , j 0 ) L and i < i0 , then
j < j0.
Example: X = GCAT, Y = CATG
x1
G
-
x2
C
C
y1
x3
A
A
y2
x4
T
T
y3
G
y4
1. {(2, 1), (3, 2), (4, 3)} is an alignment; but

2. {(2, 1), (3, 2), (4, 3), (1, 4)} is not an alignment (violates P2).
Cost of an alignment
Let L be an alignment of X = x1 . . . xm , Y = y1 . . . yn .
1. Gap penalty : there is a cost for every position of X
that is not matched in Y ; and vice versa.
2. Mismatch cost: there is a cost pq for every pair of
alphabet symbols p, q that are matched in L.
I
I
So every pair (i, j) L incurs a cost of xi yj .

Assumption: pp = 0 (matching a symbol with itself
incurs no cost).
The cost of alignment L is the sum of all the gap and the
mismatch costs.
Cost of alignment in symbols
In symbols, given alignment L, let

I
XiL = 1 if position i of X is not matched,
YjL = 1 if position j of Y is not matched.
Then the cost of alignment L is given by

X
X
X
cost(L) =
XiL +
YjL +
xi yj
1im
1jn
(i,j)L
Examples
Example 1.
Let L1 be the alignment shown below.
x1
o
o
y1
x2
c
c
y2
c
y3
x3
u
u
y4
x4
r
r
y5
x5
r
r
y6
x6
a
e
y7
x7
n
n
y8
x8
c
c
y9
x9
e
e
y10
L1 = {(1, 1), (2, 2), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (8, 9), (9, 10)}
cost(L1 ) = + ae
(This is Y3L1 + x6 y7 .)
Examples
Example 2.
Let L2 be the alignment shown below.
x1
o
o
x2
c
c
x3
u
u
x4
r
r
x5
r
r
x6
a
-
x7
n
n
x8
c
c
L1 = {(1, 1), (2, 3), (3, 4), (4, 5), (5, 6), (7, 8), (8, 9), (9, 10)}
cost(L2 ) = 3
(This is X6L2 + Y2L2 + Y7L2 .)
x9
e
e
Examples
Example 3.
Let L3 , L4 be the alignments shown below.
x1
G
C
y1
x2
C
A
y2
x3
A
T
y3
x4
T
G
y4
x1
G
-
x2
C
C
y1
x3
A
A
y2
x4
T
T
y3
G
y4
L3 = {(1, 1), (2, 2), (3, 3), (4, 4)}
L4 = {(2, 1), (3, 2), (4, 3)}
cost(L3 ) = GC + CA + AT + TG
cost(L4 ) = 2
The sequence alignment problem
Input:
I
two strings X, Y consisting of m, n symbols respectively;

each symbol is from some alphabet
the gap penalty
the mismatch costs {pq } for every pair (p, q) 2
Output: the alignment L of minimum cost.
Towards a recursive solution
Claim 1.
Let L be the optimal alignment. Then
1. either the last two symbols xm , yn of X, Y are matched in
L, hence the pair (m, n) L; or,
2. xm , yn are not matched in L, hence (m, n) 6 L.
In this case, at least one of xm , yn is not matched in L,
hence at least one of m, n does not appear in L.
Proof of Claim 1
By contradiction.
Suppose (m, n) 6 L but xm and yn are both matched in L.
That is,
1. xm is matched with yj for some j < n, hence (m, j) L;
2. yn is matched with xi for some i < m, hence (i, n) L.
Since pairs (i, n) and (m, j) cross, L is not an alignment.
Rewriting Claim 1
The following equivalent way of stating Claim 1 will allow us to

easily derive a recurrence.
Fact 4.
In an optimal alignment L, at least one of the following is true
1. (m, n) L; or
2. xm is not matched; or
3. yn is not matched.
The subproblems for sequence alignment
Let
OP T (i, j) = minimum cost of an alignment between x1 . . . xi , y1 . . . yj
We want OP T (m, n). From Fact 4,

1. If (m, n) L, we pay xm yn + OP T (m 1, n 1).
2. If xm is not matched, we pay + OP T (m 1, n).
3. If yn is not matched, we pay + OP T (m, n 1).
How do we decide which of the three to use for OP T (m, n)?
The recurrence for the sequence alignment problem
xi yj + OP T (i 1, j 1)
+ OP T (i 1, j)
OP T (i, j) =
min
+ OP T (i, j 1)
, if i = 0
, if i, j 1
, if j = 0
Remarks
I
Boundary cases: OP T (0, j) = j and OP T (i, 0) = i.
Pair (i, j) appears in the optimal alignment for subproblem

x1 . . . xi , y1 . . . yj if and only if the minimum is achieved by
the first of the three values inside the min computation.
Computing the cost of the optimal alignment

I
I
M is an (m + 1) (n + 1) dynamic programming table.

Fill in M so that all subproblems needed for entry M [i, j]
have already been computed when we compute M [i, j]
(e.g., column-by-column).
0
0
i-1
i
m
j-1 j
Pseudocode
SequenceAlignment(X, Y )
Initialize M [i, 0] to i
Initialize M [0, j] to j
for j = 1 to n do
for i = 1 to m don
M [i, j] = min xi yj + M [i 1, j 1],
o
+ M [i 1, j], + M [i, j 1]
end for
end for
return M [m, n]
Running time?
Reconstructing the optimal alignment

Given M , we can reconstruct the optimal alignment as follows.
TraceAlignment(i, j)
if i == 0 or j == 0 then return
else
if M [i, j] == xi yj + M [i 1, j 1] then
TraceAlignment(i 1, j 1)
Output (i, j),
else
if M [i, j] == + M [i 1, j] then TraceAlignment(i 1, j)
else TraceAlignment(i, j 1)
end if
end if
end if
Initial call: TraceAlignment(m, n)
Running time?
Resources used by dynamic programming algorithm
I
I
Time: O(mn)
Space: O(mn)
I
I
English words: m, n 10
Computational biology: m = n = 100000
I
I
Time: 10 billions ops

Space: 10GB table!
Can we avoid using quadratic space while maintaining

quadratic running time?
Using only O(m + n) space
1. First, suppose we are only interested in the cost of the

optimal alignment.
Easy: keep a table M with 2 columns, hence 2(m + 1)
entries.
2. What if we want the optimal alignment too?
I
No longer possible in O(n + m) time.

Slides10 6

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Slides10 6

Încărcat de

Drepturi de autor:

Formate disponibile

Algorithms for Data Science

An exponential recursive algorithm

A quadratic iterative algorithm

An exponential recursive algorithm

A quadratic iterative algorithm

Review of the last lecture

Weighted graphs G = (V, E, w)

Single-source (origin) shortest paths in graphs with

An exponential recursive algorithm

A quadratic iterative algorithm

Linear least squares fitting

Linear least squares fitting

Input: a set P of n data points (x1 , y1 ), (x2 , y2 ), . . ., (xn , yn );

Output: the line L defined as y = ax + b that minimizes the

(yi axi b)2

Linear least squares fitting: solution

(yi axi b)2

How fast can we compute a, b?

What if the data changes direction?

What if the data changes direction more than once?

How to detect change in the data

Any single line would have large error.

Idea 1: hardcode number of lines to 2 (or some fixed m).

Idea 2: pass an arbitrary set of lines through the points

Fails for the dataset on the previous slide.

Trivial solution: have a different line pass through each pair

Idea 3: fit the points well, using as few lines as possible.

Trade-off between complexity and error of the model

Formalizing the problem

A segment S = {pi , pi+1 , . . . , pj } is a contiguous subset of the

Let be a partition of P into m segments S1 , S2 , . . . , Sm .

Let C > 0 be a fixed multiplier. The cost of the partition is

Segmented least squares

This problem is an instance of change detection in data mining

Input: A set P of n data points pi = (xi , yi ) as before.

A brute force approach

We can find the optimal segmentation (that is, the one

Enumerate every possible segmentation and compute its

Output the one that incurs the minimum penalty.

A crucial observation regarding the last data point

Consider the last point pn in the data set.

pn belongs to a single segment in the optimal partition.

That segment starts at an earlier point pi , for some

This suggests a recursive solution: if we knew where the last

Let OP T (j) denote the cost of the optimal segmentation

Then, if the last segment of the optimal segmentation is

But we dont know where the last segment starts! How do

A recurrence for the optimal solution

If we apply the above expression recursively to remove the last

Notation: T (n) = time to compute optimal segmentation of n

Can show that T (n) Fn , the n-th Fibonacci number

From Problem 5a in Homework 1, Fn = (2n/2 ).

Hence T (n) = (2n/2 ).

The recursive algorithm requires (2n/2 ) time.

An exponential recursive algorithm

A quadratic iterative algorithm

Are we really that far from an efficient solution?

Elements of DP in segmented least squares

A dynamic programming approach

The optimal solution to the subproblem on p1 , . . . , pj

Recurrence 5 provides an ordering of the subproblems

There are n + 1 subproblems in total. Solving the j-th

Boundary conditions: OP T (0) = 0.