Sunteți pe pagina 1din 204

Computability, Complexity, and Algorithms

Charles Brubaker and Lance Fortnow

Introduction
Welcome to Computability, Complexity and Algorithms, the introductory Theoretical Computer Science course for the Georgia
Tech Masters and Phd programs. In this course, we will ask the big questions, What is a computer? What are the limits of
computation? Are there problems that no computer will ever solve? Are there problems that cant be solved quickly? What kinds
of problems can we solve eciently and how do we go about developing these algorithms? Understanding the power and
limitations of algorithms helps us develop the tools to make real-world computers smarter, faster and safer.
We step away from the particulars of programming languages and computer architectures and instead take an abstract view of
computing. This viewpoint allows us to understand whether a computational problem is inherently easy or hard to solve,
independent of the specific implementation or machine we plan to use. These results have stood the test of time, being as
valuable to us today as they were when they were developed in the 70s.
Though the topic of the course is theory, understanding the material can have very practical benefits. You will learn a wealth of
tools and techniques that will help you recognize when problems you encounter in the real-world are intractable and when there
an ecient solution. This can save you countless hours that otherwise would have been spent on a fruitless endeavor or in reinventing the wheel.
We hope that you find such direct applications, and certainly by having followed the rigorous arguments made in this course,
you will have given yourself a kind of training. The athlete doesnt just practice on the field or court where the competition is
held; he goes to gym and when he returns he finds he is better prepared for the game. In a similar way, by taking this course,
you will improve your thinking. As you make engineering or even business decisions, you will find that you have better reasons
to prefer one strategy over another, and will be able to articulate more rigorous arguments in support of your ideas.
Well start our journey by going back to when we all first learned about functions.

Languages and Computability - Udacity


Functions and Computable Functions - (Udacity,Youtube )
In school, it is common to develop a certain mental conception of a function. This conception is often of the form:
A function is a procedure you do to an input (often called x) in order to find an output (often called f(x)).

However, this notion of a function is much dierent from the mathematical definition of a function:
A function is a set of ordered pairs such that the first element of each pair is from a set X (called the domain), the second
element of each pair is from a set Y (called the codomain or range), and each element of the domain is paired with exactly
one element of the range.

The first conceptionthe one many of us develop in schoolis that of a function as an algorithm: a finite number of steps to
produce an output from an input, while the second conceptionthe mathematical definitionis described in the abstract
language of set theory. For many practical purposes, the two definitions coincide. However, they are quite dierent: not every
function can be described as an algorithm, since not every function can be computed in a finite number of steps on every input.
In other words, there are functions that computers cannot compute.

Rules of the Game - (Udacity, Youtube)


When you hear the word computation, the image that should come to mind is a machine taking some input, performing a
sequences of operations, and after some time (hopefully) giving some output.

In this lesson, we will focus on the forms of the inputs and output and leave the definition of the machine for later.
The inputs read by the machine must be in the form of strings of characters from some finite set, called the machines alphabet.
For example, the machines alphabet might be binary (0s and 1s), it might be based on the genetic code (with symbols A, C, G,
T), or it might be the ASCII character set.
Any finite sequence of symbols is called a string. Thus, 0110 is a string over the binary alphabet. Strings are the only input data
type that we allow in our model.
Sometimes, we will talk about machines having string outputs just like the inputs, but more often than not, the output will just be
binaryan up or down decision about some property of the input. You might imagine the machine just turning on one of two
lights, one for accept, or one for reject, once the machine is finished computing.

With these rules, an important type becomes a collection of strings. Maybe, its the set of strings that some particular machine
accepts, or maybe we are trying to design a machine so that it accepts strings in a certain set and no others, or maybe were
asking if its even possible to design a machine that accepts everything in some particular set and no others. In all these cases,
its a set of strings that we are talking about, so it makes sense to give this type its own name.
We call a set of strings a language. For example, a language could be a list of names, It could be the set of binary strings that
represent even numbersnotice that this set is infiniteor it could be the empty set. Any set of strings over an alphabet is a
language.

Operations on Languages - (Udacity, Youtube)


The concept of a language is fundamental to this course, so well take a moment to describe common operations used to
manipulate languages. For examples well use the languages A = {0,10} and B = {0,11} over the zero-one alphabet.
Since languages are sets, we have the usual operations of union, intersection and complement defined for them. For example A
union B consists of the three strings 0,10, and 11. The string 0 comes from both A and B , 10 from A and 11 from B . The
intersection contains only those strings in both languages, just the string 0.
To define the complement of a language we need to make clear what it is that we are completing. Its not sucient just to say
that it is everything not in A . For A that would include strings with characters besides 0 and 1 or maybe even infinite sequences
of 0s and 1s, which we dont want. The complement, therefore, is defined so as to complete the set of all strings over the
relevant alphabet, in this case the binary alphabet. The alphabet over which the language is defined is almost always clear from
context. In this case, the complement of A will be infinite.

In addition to these standard set operations, we also define an operation for concatenating two languages. The concatenation
of A and B is just all strings you can form by taking a string from A and appending a string from B to it. In our examples, this set
would be 00 with first 0 coming from A and second from B. The string 011 with the 0 coming from A and the 11 coming
2
from B , and so forth. Of course, we can also concatenate a language with itself. Instead of writing AA , we often write A . In
general, when we want to concatenate a language with itself k times we write A to the kth power. Note that for k = 0 , this
defined as the language containing exactly the empty string.

When we want to concatenate any number of strings from a language together to form a new language, we use an operator
known as Kleene star. This can be thought of as the union of all possible powers of the language. When we want to exclude the
empty string, we use the plus operator, which insists that at least one string from A be used. Notice the dierence in starting

indices. So for example, the string 01001010 is in A . There is a way that I can break it up so that each part is in the language
A , and so as a whole the string can be thought of a concatenation of strings from A . Note that even A doesnt include infinite
sequences of symbols. Each individual string from A must be of finite length, and you are only allowed to concatenate a finite
number together.
For those who have studied regular expressions, this should seem quite familiar. In fact, one gets the notation of regular

expressions by treating individual symbols a languages. For example, 0 is the set of all strings consisting entirely of zeros.

Here we are treating the symbol 0 as a language unto itself. We will also commonly refer to , meaning all possible strings over
the alphabet . Here we are treating the individual symbols of the alphabet as strings in the language .

Language Operations Exercise - (Udacity)

Countability (Part 1) - (Udacity, Youtube)


We need one more piece of mathematical background before we can more formally prove our claim that not all functions are
computable. Intuitively, the proof will show that there are more languages than there are possible machines. The set of possible
computer programs is countably infinite, but the set of languages over an alphabet in uncountably infinite.
If you arent familiar with the distinction between countable and uncountable sets already, you may be thinking to yourself
infinity is a strange enough idea by itself; now I have to deal with two of them! Well, it isnt as bad as all that. A countably
infinite set is one that you can enumeratethat is, you can say this is the first element, this is the second element, and so forth,
and this is the really important part eventually give a number to every element of the set. For some sets an enumeration
straightforward. Take the even numbers. We can say that 2 is the first one, 4 the second, 6 the third and so forth. For some sets,
like the rationals you have to be a little clever to find an enumeration. Well see the trick for enumerating them in a little bit. And
for some sets, like the real numbers, it doesnt matter how clever you are; there simply is no enumeration. These are the
uncountable sets.
Let us make the following definition:
A set S is countable if it is finite or if it is in one-to-one correspondence with the natural numbers.

A one-to-one correspondence, by the way is a function that is one-to-one, meaning that no two elements of the domain get
mapped to the same element of the range, and also onto, meaning the every element of the range is mapped to by an element
of the domain. For example, here is a one-to-one correspondence between the integers 1 through 6 and the set of permutations
of three elements.

Now, in general, the existence of a one-to-one correspondence implies that the two sets have the same size that is, the same
number of elements. And this actually holds even for infinite sets. This is why we say that there are as many even natural
numbers as there are natural numbers: because its easy to establish a one-to-one correspondence between the two sets,
f (n) = 2n for example. Some examples of countably infinite sets are:
The set of nonnegative even numbers (a one-to-one correspondence to the natural numbers is f (x)
The set of positive odd numbers (with correspondence f (x)
The set of all integers (with correspondence f (x)

= x/2)

= (x 1)/2
= 2x if x is nonnegative and f (x) = 2x 1 if x is negative)

For our purposes, we want to show that the set of all strings over a finite alphabet is countable. Since computer programs are
always represented as finite strings, this will tell us that the set of computer programs is countable. The proof is relatively
straightforward. Recall that is the union of strings of size 0 with those of size 1 with those of size 2, etc. Our strategy will just
be to assign the first number to 0 , the next numbers 1 , the next to 2 , etc. Here is the enumeration for the binary alphabet.

The key is that every string gets a positive integer mapped to it under this scheme. Therefore,
The set of all strings over any alphabet is countable.

Countability (Part 2) - (Udacity, Youtube)


This same argument shows that
A countable union of finite sets is countable.

Suppose that our set of sets is S 0 , S1 , etc. Without loss of generality, well suppose that they are disjoint. (If they happen not
to be disjoint, we can always make them so by subtracting out from S k all the elements it shares with sets S 0 . . . Sk 1.)
Then the argument proceeds just as before. We assign the first numbers to S0 , the next to S 1 , etc. Every element in the union
must have a first set S k that it belongs to, and thus it will be counted in the enumeration.

It turns out that we can actually prove something even stronger than this statement here.
We can replace this word finite with the word countable, and say that
A countable union of countable sets is countable.

Notice that our current proof doesnt work. If we tried to count all of the elements of S 0 before any of the elements of S1 , we
might never get to the elements of S 1 , or any other set besides S0 . Nevertheless, this theorem is true. For convenience of
notation, we let the elements of S k be {x_k0, x_k1, } and then we can make each set S k a row in a grid.

Again, we cant enumerate row-by-row here because we would never finish the first row. On the other hand, can go diagonalby-diagonal, since each diagonal is finite. The union of all the set S_k is the union of all the rows, but that is the same as the
union of all the diagonals. Each diagonal being finite, we can then apply the original version of the theorem to prove that a
countable union of countable sets is countable.
Note that his idea proves that the rationals are countable. Imagine putting all fractions with a 1 in the numerator in the first row,
all those with a 2 in the numerator in the second row, etc.

A False Proof - (Udacity)

Languages are Uncountable - (Udacity, Youtube)


So far, weve seen that the set of strings over an alphabet in countable. But what about all subsets of these strings? What about
the set of all languages. It turns out that this set is uncountable.
The set of all languages over an alphabet is uncountable.

For the proof, well suppose not. That is, suppose there is an enumeration of the languages L1 , L 2 , over an alphabet . Also,

let x1 , x2 , be the strings in . We are then going to build a table, where the columns correspond to the strings from Sigma*
and the rows correspond to the languages. In each entry in the table, well put a 1 if the string is in the language and a 0 if it is
not.

Now, we are going to consider a very sneaky language defined as follows: it consists of the those strings xi for which xi is not in
the language Li . In eect, weve taken the diagonal in this table and just reversed it. Since we are assuming that the set of
languages is countable, this language must be L k for some k . But is xk in L k or not? From the table, the row L k such that in
every column the entry is the opposite of what is on the diagonal. But the diagonal entry cant be the opposite of itself.
If xk is in Lk , then according to the definition of L k , it should not be in L k . On the other hand, if x k is not in Lk , then it should be
in Lk . Another way to think about this argument is to say that this opposite-of-the-diagonal language must be dierent from
every row in the table because it is dierent from the diagonal element. In any case, we have a contradiction and can conclude
that this enumeration of the languages was invalid since the rest of our reasoning was sound. This argument here is known as
the diagonalization trick, and well see it come up again later, when we discuss Undecidability.

Consequences - (Udacity, Youtube)


Although they may not be immediately apparent, the consequences of the uncountability of languages are rather profound.
Well let be the set of ASCII charactersthese are all the ones you would need to program

and observe that the set of strings that represent valid python programs is a subset of . Each program is a finite string, after

all. (The choice of python arbitrarily any language or set of languages works.) Since this set is subset of the countable set ,
we have it is countable. Thus, there are a countable number of python programs.
On the other hand, consider this fact. For any language L , we can define FL to be the function that is 1 if x is in L and 0
otherwise. All such functions are distinct, so the set of these functions must be uncountable, just like the set of all languages.
Here is the profound point, since the set of valid python programs is countable but the set of functions is not, it follows that
there must be some functions that we just cant write programs for. In fact, there are uncountably many of them!
So going back to our picture of the HS classroom, we can see that the teacher, perhaps without realizing it was talking about
something much more general that what the student ended up thinking. There are only countably many computer programs that
can follow a finite number of steps as the student was thinking, but there are uncountably many functions that fit the teachers
definition.

Turing Machines - (Udacity)


Motivation - (Udacity, Youtube)
In the last lesson we talked informally about what it means to be computable. But what even is a computer? What kinds of
problem can we solve on a computer? What problem cant be solved?
To answer these questions we need a model of computation. There are many, many ways we could define the notion of
computabilityideally something simple and easy to describe, yet powerful enough so this model can capture everything any
computer can do, now or in the future.

Luckily one of the very first mathematical model of a computer serves us quite well, a model developed by Alan Turing in the
1930s. Decades before we had digital computers, Turing developed a simple model to capture the thinking process of a
mathematician. This model, which we now call the Turing machine, is an extremely simple device and yet completely captures
our notion of computability.
In this lesson well define the Turing machine, what it means for the machine to compute and either accept or reject a given
input. In future lessons well use this model to give specific problems that we cannot solve.

Introduction - (Udacity, Youtube)


In the last lesson, we began to define what computation is with the goal of eventually being precise about what it can and
cannot do. We said that the input to any computation can be expressed as a string, and we assumed that, whatever instructions
there were for turning the input into output, that these too could be expressed as a string. Using a counting argument, we were
able to show that there were some functions that were not computable.
In this lesson, we are going to look at how input gets turned into output more closely. Specifically, we are going to study the
Turing machine, the classical model for computation. As well see in a later lesson, Turing machines can do everything that we
consider as computation, and because of their simplicity, they are a terrific tool for studying computation and its limitations.
Massively parallel machines, quantum computers they cant do anything that a Turing machine cant also do.
Turing machines were never intended to be practical, but nevertheless several have been built for illustrative purposes, including
this one from Mike Davey.

The input to the machine is a tape onto which the string input has been written. Using a read/write head the machine turns input
into output through a series of steps. At each step, a decision is made about whether and what to write to the tape and whether
to move it right or left. This decision is based on exactly two things:
the current symbol under the read-write head, and
something called the machines state, which also gets updated as the symbol is written.

Thats it. The machine stops when it reaches one of two halting states named accept and reject. Usually, we are interested in
which of these two states the machine halts in, though when we want to compute functions from strings to strings then we pay
attention to the tape contents instead.
Its a very interesting historical note that in Alan Turings 1936 paper in which he first proposed this model, the inspiration does
not seem to come from any thought of a electromechanical devise but rather from the experience of doing computations on
paper. In section 9, he starts from the idea of a person who he call the computer working with pen and paper, and then argues
that his proposed machine can do what this person does.
Lets follow his logic by considering my computing a very simple number: Alan Turings age when he wrote the paper.

1936 1912 = 24.


Turing argues that any calculation like this can be done on a grid. Like a child arithmetic book, he says. By this he means
something like wide-ruled graph paper. He argues that all symbols can be made to fit inside of one of these squares.

Then he argues, that the fact that the grid is two-dimensional is just a convenience, so he takes away the paper and says that
computation can done on tape consisting of a one-dimensional sequence of squares. This isnt convenient for me, but it doesnt
limit what the computation I can do.
Then points out that there are limits to the width of human perception. Imagine I am reading in a very long mathematical paper,
where the phrase hence by theorem this big number we have is used. When I look back, I probably wouldnt be sure at a
glance that I had found the theorem number. I would have to check, maybe four digits at a time, crossing o the ones that I had
matched so as to not lose my place. Eventually, I will have matched them all and can re-read the theorem.

Since Turing was going for the simplest machine possible, he takes this idea to the extreme and only lets me read one symbol
be read at a time, and limits movement to only one square at time, trusting to the strategy of making marks on the tape to
record my place and my state of mind to accomplish the same things as I would under normal operation with pen and paper.
And with those rules, I have become a Turing machine.
So thats the inspiration, not a futuristic vision of the digital age, but probably Alan Turings own everyday experience of
computing with pen and paper.

Notation - (Udacity, Youtube)


Now that we have some intuition for the Turing machine, we turn to the task of establishing some notation for our mathematical
model. Here, Ive used a diagram to represent the Turing machine and its configuration.

We have the tape, the read/write head, which is connected to the state transition logic and a little display that will indicate the
halt statethat is, the internal state of the Turing machine when it stops.
Mathematically, a Turing machine consists of:
1. A finite set of states Q . (Everything used to specify a Turing machine is finite. That is important.)
2. An input alphabet of allowed input symbols. (This must NOT include the blank symbol which we will notate with this square cup
most of the time. For some of the quizzes where we need you to be able to type the character we will use b. We cant allow the
input alphabet to include the blank symbol or we wouldnt be able to tell where the input string ended.)
3. A tape alphabet of symbols that the read/write head can use (this WILL include the blank symbol)
4. It also includes a transition function from a (state,tape symbol) to a (state, tape symbol, direction) triple. This, of course, tells the
machine what to do. For every, possible current state and symbol that could be read, we have the appropriate response: the new
state to move to, the symbol to write to the tape (make this same an the read symbol to leave it alone), and the direction to move
the head relative to the tape. Note that we can always move the head to the right, but if the head is currently over the first position
on the tape, then we cant actually move left. When the transition function says that the machine should be left, we have it stay in
the same position by convention.
5. We also have a start state. The machine always starts in the first position on the tape and in this state.

6. Finally, we have an accept state,


7. and a reject state. When these are reached the machine halts its execution and displays the final state.

At first, all of this notation may seem overwhelmingits a seven-tuple after all. Remember, however, that all the machine ever
does is to respond to the current symbol it sees based on its current state. Thus, its the transition function that is at the heart of
the machine and most all of the important information like the set of states and the tape alphabet is implicit in it.

Testing Oddness - (Udacity, Youtube)


For our first Turing machine example, consider one that tests the oddness of a binary representation of a natural number. Note
that Ive cheated here in the transition function by including only (state,symbol) pairs in the domain that we would actually
encounter during computation. By convention, if there is no transition is specified for the current (state,symbol) pair, then the
program just halts in the reject state.

One convenient way represent the transition function, by the way, is with a state diagram, similar to what is often used for finite
automata for those familiar with that model of computation. Each state gets its own vertex in a multigraph and every row of the
transition table is represented as an edge. The edge gets labeled with remaining information besides the two states, that is the
symbol that is read, the one that is written and the direction in which the head is moved.
See if you can trace through the operation of the Turing machine for the input shown. If you are unsure, watch the video.

Configuration Sequences - (Udacity, Youtube)


Recall that a Turing machine always starts in the initial state and with the head at the first position on the tape. As it computes,
its internal state, the tape contents, and the position of the head will change, but everything else will stay the same. We call this
triple of state, tape content, and head position a configuration, and any given computation can be thought of as a sequence of
configurations. It starts with the initial state, the input string and with the head position on the first location, and it proceeds from
there.

Now, it isnt very practical always draw a picture like this one every time that we want to refer to the configuration of a Turing
machine, so we develop some notation that captures the idea.

Well do the same computation again, but this time well write down the configuration using this notation. We write the start
configuration as q0 1011. The part to the left of the state represents the contents of the tape to the left of the head. Its just the
empty string in this case. Then we have the state of the machine and then the rest of the tape contents.
After the first step, 1 is to the left of the head, we are state q0 still and 1 is the string to the right. In the next, configuration 10
is to the left, we are still in state q0 and 11 to right, and so on and so forth.
This notation is a little awkward, but its convenient for typesettings. Its also very much in the spirit of Turing machines, where
all structured data must ultimately be represented as strings. If a Turing machine can handle working with data like this, then so
can we.
At a slightly higher level, a whole sequence of configurations like this captures everything that a Turing machine did on a
particular input, and so we will sometimes call such a sequence a computation. And actually, this representation of a
computation will be central as when we discuss the Cook-Levin theorem in the section on complexity.

The Value of Practice - (Udacity, Youtube)


Next, we are going get some practice tracing though Turing machine computations and programming Turing machines. The
point of these exercises is not, so that you can put on your resume that you are good at programming Turing machines. And if
someone asks you in an interview to write a program to test whether an input number is prime, I wouldnt recommend trying to
draw a Turing machine state diagram. Somehow, I doubt that will land you the job.
Rather, the point is to help you convince yourself that Turing machines can do everything that we mean by computation, and if
you really had to, you could program a Turing machine to do anything that your could write a procedure to do in your favorite
programming language. There are two ways to convince yourself of this. One is to just practice so that you build up enough of a
facility with programming them that is to say, it becomes easy enough for you so that it just seems intuitive that you could do
anything. Another way is to show that a Turing machine can simulate the action of models that are closer to real-world
computers like the Random Access Model. Well do that in a later lesson. But, to be able to understand these simulation
arguments, you need a pretty good facility with how Turing machines work anyway, so make the most of these examples and
exercises.

Equality Testing - (Udacity, Youtube)


To illustrate some of the key challenges in computing with Turing machines and how to overcome them, well examine this task
here, where we are given an input string, and want to tell if it is of the form w#w, where w is a binary string. In other words, we
want to test whether the string to the left of the first hash is the same as the string to the right of it.

(Watch the video for an illustration on this example.)


As Turing machines go this is a pretty simple program, but as you can see here the state diagram gets a little messy. Like the
Sisper textbook, Ive used a little shorthand here in the diagram. When two symbols appear to the left of the arrow, I mean to
either one of those. Its easier than writing out a whole other edge. Also, sometimes, I will only give a direction on the right.
Interpret that to mean that the tape should be left alone.

(Watch the video for an illustration of the state transitions in the diagram.)

Configuration Exercise - (Udacity)

Right Shift - (Udacity)


Now, I want you to actually build a Turing machine. The goal is to right-shift the input and place a dollar sign symbol in front of
the input and accept. Unlike our previous examples, accept vs. reject isnt important here. This is more like a subroutine within a
larger Turing machine that you might want to build, though you could think about it as a Turing machine that computes a

function from all strings over the input alphabet to string over the tape alphabet, if you like.

Balanced Strings - (Udacity)

Language Deciders - (Udacity, Youtube)


If you completed all those exercises, then you are well on your way towards understanding how Turing machines compute, and
hopefully, also are ready to be convinced that they can compute as well as any other possible machine.
Before we move onto that argument, however, there is some terminology about Turing machines and the languages that they
accept and reject those that they might loop on that we should set straight. Some of the terms may seem very similar, but the
distinctions are important ones and we will use them freely throughout the rest of the course, so pay close attention. First we
define what it means for a machine to decide a language.
A Turing machine decides a language L if and only if accepts every x in L and rejects every x not in L.

For example, the Turing machine we just described decided the language L consisting of strings of the form w#w, where w is a
binary string. We might also say the Turing machine computed the function that is one if x is in L and 0 otherwise. Or even just
that the Turing machine computed L.

Contains a One - (Udacity)


Now for a question that is a little tricky. Consider the language that consists of all binary strings that contain a symbol 1. Does
this Turing machine decide L?

Language Recognizers - (Udacity, Youtube)


The possibility of Turing machines looping forever leads us to define the notion of a language recognizer.
We say that a Turing machine recognizes a language L if and only if it accepts every string in the language and does not accept
every string not in the language. Thus, we can say that the Turing machine from the quiz does indeed recognize the language
consisting of strings that contain a 1. It accepts those containing 1 and it doesnt accept the others; it loops on them.

Contrast this definition with what it takes for a Turing machine to decide a language. Then it needs not only to accept everything
in the language but it must reject everything else. It cant loop like this Turing machine.
If we wanted to build a decider for this language we would need to modify the Turing machine so that it detects the end of the
string and move into the reject state.
At this point, it also makes sense to define the language of a machine, which is just the language that the machine recognizes.
After all, every machine recognizes some language, even if it is the empty one.
Formally, we define L(M) to be the set of strings accepted by M , and we call this the language of the machine M .

Conclusion - (Udacity, Youtube)


In this lesson, we examined the workings of Turing machines, and if you completed all the exercises, you should have a strong
sense of how to use them to compute functions and decide languages. Weve also seen how unlike some simpler models of
computation, Turing machines dont necessarily halt on all inputs. This forced us to distinguish between language deciders and
language recognizers. Eventually, we will see how this problem of halting or not halting will be the key to understanding the
limits of computation.
Weve shown Turing machines can test equality of strings, not something that could be computed by simpler models like finite
or push-down automaton that you might have seen in your undergraduate courses. But equality is a rather simple problemcan
Turing machines solve the complex tasks we have our computer do. Can a Turing machines do your taxes or play a great game
of chess? Next lesson, well see that Turing machines can indeed solve any problem that our computers can solve and truly do
capture the idea of computation.

Church-Turing Thesis - (Udacity)


Introduction - (Udacity, Youtube)
In this lecture, we will give strong evidence for the statement that
everything computable is computable by a Turing machine.

This statement is called the Church-Turing thesis, named for Alan Turing, whom we met in the previous lesson, and Alonzo
Church, who had an alternative model of computation known as the lambda-calculus, which turns out to be exactly as powerful
as a Turing machine. We call the Church-Turing thesis a thesis because it isnt a statement that we can prove or disprove. In this
lecture well give a strong argument that our simple Turing machine can do anything todays computers can do or anything a
computer could ever do.
To convince you of the Church-Turing thesis, well start from the basic Turing Machine and then branch out, showing that it is
equivalent to machines as powerful as the most advanced machines today or in any conceivable future. Well begin by looking
at multi-tape Turing machines, which in many cases are much easier to work with. And well show that anything a multi-tape
Turing machine can do, a regular Turing machine can do too.
Then well consider the Random Access Model: a model capturing all the important capabilities of a modern computer, and we
will show that it is equivalent to a multitape Turing machine. Therefore it must also be equivalent to a regular Turing machine.
This means that a simple Turing machine can compute anything your Intel i7or whatever chip you may happen to have in your
computercan.

Hopefully, by the end of the lesson, you will have understood all of these connections and youll be convinced that the ChurchTuring thesis really is true. Formally, we can state the thesis as:
a language is computable if and only if it can be implemented on a Turing machine.

Simulating Machines - (Udacity, Youtube)


Before going into how single tape machines can simulate multitape ones, we will warm up with a very simple example to
illustrate what is meant when we say that one machine can simulate another.
Lets consider what Ill call stay-put machines. These have the capability of not moving their heads in a computation step (which
Turing machines, as weve defined them, are not allowed to do). So the transition function now includes S, which makes the
head stay put. Now, this doesnt add any additional computational capability to the machine, because I can accomplish the
same things with a normal Turing machine. For every transition where the head stays put,
we can introduce a new state, and just have the head move one step right and then one step left. (Gamma here means match
everything in the tape alphabet).

This puts the tape head back in the spot it started without aecting the tapes contents. Except for occasionally taking an extra
movement step, this Turing machine will operate in the same way as the stay-put machine.
More precisely, we say that two machines are equivalent
if they accept the same inputs, reject the same inputs, and loop on the same inputs.
considering the tape to be part of the output, equivalent machines also halt with the same tape contents.

Note that other properties of the machines (such as the number of states, the tape alphabet, or the number of steps in any given
computation) do not need to be same. Just the relationship between the input and the output matters.

Multitape Turing Machines - (Udacity, Youtube)

Since having multiple tapes makes programming with Turing machines more convenient, and since it provides a nice
intermediate step for getting into more complicated models, well look at this Turing machine variant in detail. As shown in the
figure here, each tape has its own tape head.

What the Turing machine does at each step is determined solely by its current state and the symbols under these heads. At
each step, it can change the symbol under each head, and moves each head right or left, or just keeps it where it is. (With a
one-tape machine, we always forced the head to move, but if we required that condition for multitape machines, the dierences
in tape head positions would always be even, which leads to awkwardness in programming. Its better to allow the heads to stay
put.)
Except for those dierences, multitape Turing machines are the same as single-tape ones. Well only need to redefine the
transition function. For a Turing machine with k tapes, the new transition function is

: Q k : Q k
L, R, S
k

Everything else stays the same.

Duplicate the Input - (Udacity, Youtube)


Lets see a multitape Turing machine in action. Input always comes in on the first tape, and all the heads start at the left end of
the tapes. Our task will be to duplicate this input, separated by a hash mark.

(See video for animation of the computation).

Substring Search - (Udacity)

Next, well do a little exercise to practice using multitape Turing machines. Again, the point here is not so that you can put
experience programming multitape Turing machines on your resume. The idea is to get you familiar with the model so that you
can really convince yourself of the Church-Turing thesis and understand how Turing machines can interpret their own
description in a later lesson.
With that in mind, your task is to build a two-tape TM that decides the language strings of the form x#y, where x is a substring
of y. So for example, the string 101#01010 is in the language.
The second through fourth character of y match x. But on the other hand, 001#01010 is not in the language. Even though two
0s and a 1 appear in the string 01010, 001 is not a substring because the numbers are not consecutive.

Multitape SingleTape Equivalence - (Udacity, Youtube)


Now, I will argue that these enhanced multitape Turing machines have the same computational power as regular Turing
machines. Multitape machines certainly dont have less power: by ignoring all but the input tape, we obtain a regular Turing
machine.
Lets see why multitape machines dont have any more power than regular machines.
On the left, we have a multitape Turing machine in some configuration, and on the right, we have created a corresponding
configuration for a single-tape Turing machine.

On the single tape, we have the contents of the multiple tapes, with each tapes contents separated by a hash. Also, note these
dots here. We are using a trick that we havent used before: expanding the size of the alphabet. For every symbol in the tape
alphabet of the multitape machine, we have two on the single tape machine, one that is marked by a dot and one that is
unmarked. We use the marked symbols to indicate that a head of the multitape machine would be over this position on the
tape.
Simulating a step of the multitape machine with the single tape version happens in two phases: one for reading and one for
writing. First, the single-tape machine simulates the simultaneous reading of the heads of the multitape machines by scanning
over the tape and noting which symbols have marks. That completes the first phase where we read the symbols.

Now we need to update the tape contents and head positions or markers as part of the writing phase. This is done in a second,
leftward pass across the tape. (See video for an example.)
Note that it is possible that one of these strings will need to increase its length when the multitape reaches a position it hasnt
reached before. In that case, we just right-shift the tape contents to allow room for the new symbol to be written.
Once all that work is done, we return the head back to the beginning to prepare for the next pass.
So, all the information about the configuration of a multitape machine can be captured in a single tape. It shouldnt be too hard
to convince yourself that the logic of reaching and keeping track of the multiple dotted symbols and taking the right action
should be as well. In fact, this would be a good for you to do on your own.

Analysis of Multitape Simulation - (Udacity)


Now for an exercise. No more programming Turing machines. Instead, I want you to try to figure out how long this simulation
process takes. Let M be a multitape Turing machine and let S be its single-tape equivalent as weve defined it. If on input x, M
halts after t steps, then S halts after how many steps. Give the most appropriate bound. Note that we are treating the number
of tapes k as a constant here.
This question isnt easy, so just spend a few minutes thinking about it and take a guess, before watching the video answer.

RAM Model - (Udacity, Youtube)


There are other curious variants of the basic Turing machines: we can restrict them so that a symbol on a square can only be
changed once, we can let them have two-way infinite tapes, or even let them be nondeterministic (well examine this idea when
we get to complexity). All of these things are equivalent to Turing machines in the sense we have been talking about, and its
good to know that they are equivalent.
Ultimately, however, I doubt that the equivalence of those models does much to convince anyone that Turing machines capture
the common notion of computation. To make that argument, we will show that a Turing machine is equivalent to the Random
Access model, which very closely resembles the basic CPU/register/memory paradigm behind the design of modern
computers.
Here is a representation of the RAM model.

Instead of operating with a finite alphabet like a Turing machine, the RAM model operates with non-negative integers, which can
be arbitrarily large. It has registers, useful for storing operands for the basic operations and an infinite storage device analogous
to the tape of a regular Turing machine. Ill call this memory for obvious reasons. There are two key dierences between this
memory and the tape of a regular Turing machine:
1. each position on this device stores a number an
2. any element can be be read with a single instruction, instead of moving a head over the tape to the right spot.

In addition to this storage, the machine also contains the program itself expressed as a sequence of instructions and a special
register called the program counter, which keeps track of which instruction should be executed next. Every instruction is one of
a finite set that closely resembles the instructions of assembly code. For instance, we have the instruction read j, which reads
the contents from the jth address on the memory and places it in register 0. Register 0, by the way, has a special status and is
involved in almost every operation. We also have a write operation, which writes to the jth address in memory. For moving data
between the registers, we have load, which write to R 0 , and store which writes from it, as well as add, which increases the
number in R0 by the amount in R j . All of these operations cause the program counter to be incremented by 1 after they are
finished.
To jump around the list of instructionsas one needs to do for conditionalswe have a series of jump instructions that change
the program counter, sometimes depending on the value in R 0 .
And finally, of course, we have the halt instruction to end the program. The final value in R0 determines whether it accepts or
rejects. Note that in our definition here there is no multiplication. We can achieve that through repeated addition.
We wont have much use for the notation surrounding the RAM model, but nevertheless its good to write things down
mathematically, as this sometimes sharpens our understanding. In this spirit, we can say that a Random Access Turing machine
consists of :
- a natural number k indicating the number of registers
- and a sequence of instructions .
The configuration of a Random Access machine is defined by
- the counter value, which is 0 for the halting state and indicates the next instruction to be executed otherwise.
- the register values and the values in the memory, which can be expressed as a function.
(Note that only a finite number of the addresses will contain a nonzero value, so this function always has a finite representation.
Well use 1-based indexing, hence the domain for the tape is the natural numbers starting from one.)

Equivalence of RAM and Turing Machines - (Udacity, Youtube)


Now we are ready to argue for the equivalence of our Random Access Model and the traditional Turing machine. To translate
between the symbol representation of the Turing machine and numbers of RAM, well use a one-to-one correspondence E.

E:
.

0, , ||
The blank symbol is mapped to to zero (

E() = 0)sothatthedef aultvalueof thetapecorrespondstothedef aultvaluef ormemory. First,


wearguethataRAMcansimulateasingletapeTuringmachine
. Theroleof thetapeplayedintheTuringmachinewillbeplayedbythememoryW e llkeeptrackof theheadpositioninaf ixedregister,
say
R_1\(. And the program and the program counter will implement the transition function. Here, I've written out in pseudocode
what this might look like for the simple Turing machine shown over here, which just replaces all the ones with zeros and then
halts.

Being in state \)q_0$$$ corresponds to having the program counter point to the top line of the program, so the RAM will execute a sequence of tests
for what the symbol under the head would be, adjust the values on the tape or memory accordingly, and then jump to the appropriate line for the next
state.

Now we argue the other way: that a traditional Turing machine can simulate a RAM. Actually, well create a multitape Turing
machine that implements a RAM since that is a little easier to conceptualize. As weve seen, anything that can be done on a
multitape Turing machine can be done with a single tape.

We will have one tape per register, and each tape will represent the number stored in the corresponding register.
We also have another tape that is useful for scratch work in some of the instructions that involve constants like add 55.
Then we have two tapes corresponding to the random access device. One is for input and output, and the other is for
simulating the contents of the memory device during execution. Storing the contents of the random access device is the more
interesting part. This is done just by concatenating the (index, values) pairs using some standard syntax like parentheses and
commas.

The program of the RAM must be simulated by the state transitions of the Turing machine. This can be accomplished by having
a subroutine or sub-Turing machine for each instruction in the program. The most interesting of these instructions are the ones
involving memory. We simulate those by searching the tape that stores the contents of the RAM for one of these pairs that has
the proper index and then reading or writing the value as appropriate. If no such pair is found, then the value on the memory
device must be zero.
After the work of the instruction is completed, the eect of incrementing the program counter is achieved by transitioning to the
state corresponding to the start of the next instruction. That is, unless the instruction was a jump, in which case that transition is
eected. Once the halt instruction is executed the contents of the tape simulating the random access device are copied out
onto the I/O tape.

RAM Simulation Running Time - (Udacity)


Given this description about how a traditional Turing machine can simulate a Random Access Model, I want you to think about
how long this simulation takes. Let R be a random access machine and let M be its multi-tape equivalent. Well let n be the
length of the binary encoding of the input to R and let t be the number of steps taken by R. Then M takes how long to simulate
R?
This is a tough question, so Ill give you a hint: the length of the representation of a number increases by at most a constant in
each step of the random access machine R.

Conclusion - (Udacity, Youtube)


Once you know that a Turing machine can simulate a RAM, then you know it can simulate a standard CPU. Once you can
simulate a CPU, you can simulate any interpreter or compiler and thus any programming language. So anything you can run on
your desktop computer can be simulated by a Turing machine.
What about multi-core, cloud computing, probabilistic, quantum and DNA computing? We wont do it here, but you can prove
Turing machines can simulate all those models as well. The Church-Turing thesis has truly stood the test of time. Models of
computation have come and go but none have been any match for the Turing machine.
Why should we care about the Church-Turing thesis? Because there are problems that Turing machines cant solve. We argued
this with counting arguments in the first lecture and will give specific examples in future lectures. If these problems cant be
solved by Turing machines they cant be solved by any other computing device.
To help us describe specific problems that one cannot compute, in the next lecture we discuss two of Turings critical insights:
that a computer program can be viewed as dataas part of the input to another programand that one can have a universal
Turing machine that can simulate the code of any other computer: one machine to rule them all.

Universality - (Udacity)

Introduction - (Udacity, Youtube)


In 1936, when Alan Turing wrote his famous paper On Computable Numbers, he not only created the Turing machine but had
a number of other major insights on the nature of computation. Turing realized that the computer program itself could also be
considered as part of the input. There really is no dierence between the program and data. We all take that for a given today:
we create computer code in a file that gets stored on the computer no dierently than any other type of file. And data files often
have computer instructions embedded in them. Even modern computer fonts are basically small programs that generate
readable characters at arbitrary sizes.
Once he took the view of program code as data, Turing had the beautiful idea that a Turing machine could simulate that code.
There is some fixed Turing machine, a universal Turing machine, that can simulate the code of any other Turing machine. Again,
this is an idea that we take for granted today, as we have interpreters and compilers that can run the code of any programming
language. But, back in Turings day, this idea of a universal machine was the major breakthrough that allowed Turing, and will
allow us, to develop problems that Turing machines or any other computer cannot solve.

Encoding a Turing Machine - (Udacity, Youtube)


Before we can simulate or interpret a Turing machine, we first have to represent it using a string. Notice that this presents an
immediate challenge: our universal Turing machine must use a fixed alphabet for its input and have a fixed number of states, but
it must be able to simulate other Turing machines with arbitrarily large alphabets and an arbitrary numbers of states. As well
see, one solution is essentially to enumerate all the symbols and states and represent them in binary. There are lots of ways to
do this: the way were going to do it is a compromise of readability and eciency.
i

Let M be a Turing machine with states Q = {q0 , qn1 } and tape alphabet = {a 1 , a m }. Define i and j so that 2 is at
j
least the number of states and 2 is at least the number of tape symbols. Then we can encode a state qk as the concatenation
of the symbol q with the string w, where w is the binary representation of k . For example, if there are 6 states, then we need
three bits to encode all the states. The state q3 would be encoded as the string q011. By convention, we make
the initial state q0 ,
the accept state q1 ,
the reject state q2 .

We use an analogous strategy for the symbols, encoding ak as the symbol a followed by the string w, where w is the binary
representation of k.
For example, if there are 10 symbols, then we need four bits to represent them all. If a5 is the star symbol, we would be encode
that symbol as a0101.
Lets see an encoding for an example.

This example decides whether the input consists of a number of zeros that is a power of two. To encode the Turing machine as
a whole we really just need to encode its transition function. Well start by encoding the black edge from the diagram. We are
going from state zero, seeing the symbol zero, and we go to state three, we write symbol 0 and we move the head to the right.
Remember that the order here is input state, input symbol. Then output state, output symbol, and finally direction.

Encoding Quiz - (Udacity)


Now, Im not going to write out all the rest of the transitions, but I think it would be a good idea for you to do one more. So use
this red box here to encode this red transition.

Building a Universal Turing Machine - (Udacity, Youtube)


Now, we are ready to describe how to build this all-powerful universal Turing machine. As input to the universal machine, we will
give the encoding of the input and the encoding of M, separated by a hash. We write this as
\[\#\\]

The goal is to simulate Ms execution when given the input w, halting in an accept or reject state, or not halting at all, and
ultimately outputting the encoding of the output of M on w when M does halt. Well describe a 3-tape Turing machine that
achieves this goal of simulating M on w.
The input comes in on the first tape. First, well copy the description of the machine to the second tape and copy the initial state
to the third tape. For example, the tape contents might end up like this.

The we rewind all the heads and begin the second phase.
Here, we search for the appropriate tuple in the description of the machine. The first element has to match the current state
stored on tape three, and the symbol part has to match the encoding on the tape 1. If no match is found, then we halt the
simulation and put the universal machine in an accepting or rejecting state according to the current state of the machine being
interpreted. If there is a match, however, then we apply the changes to the first tape and repeat.

Actually, interpreting a Turing machine description is surprisingly easy.


Weve just seen how Turing machines are indeed reprogrammable just like real-world computers. This lends further support to
the Church-Turing thesis, but it also has significance beyond that. Since the input to a Turing machine can be interpreted as a
Turing machine, this suggest that programs are a type of data. But arbitrary data can also be interpreted as a (possibly invalid)
Turing machine. So is there any dierence between data and program? Perhaps, we can leave this question for the
philosophers.

Abstraction - (Udacity, Youtube)


At this point, the character of our discussion of computability is going to change significantly. Weve established the key
properties of Turing machines: that they can do anything we mean by computation and that we can pass a description of a
Turing machine to a Turing machine as input for it to simulate. With these points established, we wont need to talk about the
specifics of Turing machines much anymore. There will be little about tapes or states, transitions or head positions. Instead, we
will think about computation at a very high level, trusting that if we really had to, we could write out the Turing machine to do it.
If we need to write out code, we will do so only in pseudocode or with pictures.
What is there left to talk about? Well, remember from the very first lesson that we argued that not all functions are computable,
or as we later said, not all languages can be decided by Turing machines. Were now in a good position to talk about some of
these undecidable languages. We had to wait until we established the universality of Turing machines because these languages
are going to consist of strings that encode Turing machines. The rest of the lesson will review the definitions of recognizability
and decidability, and then well talk about the positive side of the story: the languages about Turing machines that we CAN
decide or recognize. As well see, there are plenty that we cant.

Language Recognizers - (Udacity, Youtube)


Recall that a Turing machine recognizes a language if it accepts every string in the language and does not accept anything that
is not in the language. It could either reject or loop. This Turing machine here recognized the language of binary strings
containing a 1, but it looped on those that dont contain a 1.

In order to decide a language, the Turing machine must not only accept every string in the language but it must also explicitly
reject every string that is not in the language. This machine achieves that by not looping on the blanks.

Recognizability and Decidability - (Udacity, Youtube)


Ultimately, however, we are not interested in whether a particular Turing machine recognizes or decides a language; rather we
are interested in whether there is a Turing machine that can recognize or decides the language.
Therefore, we say that a language is recognizable if there is a Turing machine that recognizes it, and we say that a language is
decidable if there is a Turing machine that decides it.
A language is recognizable if there is a Turing machine that recognizes it.
A language is decidable if there is a Turing machine that decides it.

Now, looking at this someone might object, Shouldnt we say recognizable by a Turing machine and decidable by a Turing
machine? Of course, we could and the statements would still be true. But we dont, the reason being that we strongly believe
that if anything can do it a Turing machine can! Thats the Church-Turing thesis.
In an absolute sense, we believe that a language is only recognizable by anything if a Turing machine can recognize it and a
language is only decidable by anything if a Turing machine can decide it, and we use terms that reflect that belief.

Now other terms are sometimes used instead of recognizable and decidable. Some say that Turing machines compute
languages, so to go along with that they say that languages are computable if they there is a Turing machine that computes it.
Another equivalent term for decidable is recursive. Mathematicians often prefer this word.
And those who use that term will refer to recognizable languages as recursively enumerable. Some also call these languages
Turing-acceptable and semi or partially decidable.
We should also make clear the relationship between these two terms. Clearly, if a language is decidable, then it is also
recognizable; the same Turing machine works for both. It feels like it should also be true that if a language is recognizable and
its complement is also recognizable, then the language is decidable. This is true, but there is a potential pitfall here that we need
to make sure to avoid.

Decidability Exercise - (Udacity)

Suppose that we are given one machine M 1 that recognizes a language L and another M2 that recognizes { L} . If we were to
ask your average programmer to use these machines to decide the language, his first guess might go something like this.

This program will not decide L, however, and I want you to tell me why. Check the best answer.

Alternating Machines - (Udacity, Youtube)


Here is the alternating trick in some more detail. Suppose that M1 recognizes a language and M 2 recognizes its complement.
We want to decide the language.
In psuedocode, the alternating strategy might look like this.

In every step, we execute both machines one more step than in the previous iteration. Note that it doesnt matter if we save the
machines configuration and start where
we left o or start over. The question is whether we get the right answer, not how fast.
The string has to be either in L or in { L} , so one of these has to halt after some finite number of steps, and when i hits that
value, this program will give the right answer.
Overall then, we have the following theorem.
A language L is decidable if and only if L and its complement are both recognizable.

Counting States - (Udacity)


Now were going to go through a series of languages and try to figure out if they and their complements are recognizable. First,
lets examine the set of strings that describe a Turing machine that has at most 100 states. You can assume the particular
encoding for Turing machines that we used, but any encoding will serve the same purpose.
Indicate whether you think that L is recognizable and whether L complement is recognizable. We dont have a way proving that
a language is not recognizable yet, so Ive labeled the No option as unclear.

Halting on 34 - (Udacity)
Next, we consider the set of Turing machines that halt on the number 34 written in binary. Indicate whether L and L complement
are recognizable.

Accepting Nothing - (Udacity)


Lets consider another language, this time the set of Turing machine descriptions where the Turing machine accepts nothing.
Tell me: are either L or L complement recognizable?

Dovetailing - (Udacity, Youtube)


Here is the dovetailing trick, which lets you run a countable set of computations all at once. Well illustrate the technique for the
case where we are simulating a machine M on all binary strings with this table here.

Every row in the table corresponds to a computation or the sequence of configurations the machine goes through for the given
input. Simulating all of these computation means hitting every entry in this table. Note that we cant just simulate M on the
empty string first or we might just keep going forever, filling out the first row and never getting to the second. This is the same
problem that we encountered when trying to show that a countable union of countable sets is countable, or that the set of
rational numbers is countable.
And the solution is the same too. We go diagonal by diagonal, first simulating the first computation for one step. Then the
second computation for one step and the first computation for two steps, etc. Eventually every configuration in the table is
reached.

Thus, if we are trying to recognize the language of Turing machine descriptions where the Turing machine accepts something,
then a Turing machine in the language must accept some string after finite number of steps.
This will correspond to some some entry in the table, so we eventually reach it and accept.

Always Halting - (Udacity)


Lets consider one last language, the set of descriptions of Turing machines that halt on every input. Think carefully, and indicate
whether you think that either L or L complement is recognizable.

Conclusion - (Udacity, Youtube)


In this and the previous lessons, weve developed a set of ideas and definitions that sets the stage for understanding what we
can and cannot solve on a computer.
Weve seen the Turing machine, an amazingly simple model that can only move a tape back and forth while reading and writing on
that tape.
Weve seen that despite its simplicity this model captures the full power of computers now and forever.
Weve seen how to consider Turing machine programs as data themselves and create a universal Turing machine that can simulate
those programs.
And at the end of this lesson, we saw some languages defined using program as data that dont seem to be easily decidable.

In the next lecture we will show how to prove many languages


cannot be solved by a Turing machine, including the most
famous one, the halting problem.
Undecidability - (Udacity)
Introduction - (Udacity, Youtube)
As a computer scientist, you have almost surely written a computer program that just sits there spinning its wheels when you
run it. You dont know whether the program is just taking a long time or if you made some mistake in the code and the program
is in an infinite loop. You might have wondered why nobody put a check in the compiler that would test your code to see
whether it would stop or loop forever. The compiler doesnt have such a check because it cant be done. Its not that the
programmers are not smart enough, or the computers not fast enough: it is just simply impossible to check arbitrary computer
code to determine whether or not it will halt. The best you can do is simulate the program to know when it halts, but if it doesnt
halt, you can never be sure if it wont halt in the future.
In this lesson well prove this amazing fact and beyond, not only can you not tell whether a computer halts, but you cant
determine virtually anything about the output of a computer. We build up to these results starting with a tool weve seen from
our first lecture, diagonalization.

Diagonalization - (Udacity, Youtube)

The diagonalization argument comes up in many contexts and is very useful for generating paradoxes and mathematical
contradictions. To show how general the technique is, lets examine it in the context of English adjectives.

Here Ive created a table with English adjectives both as the rows and as the columns. Consider the row to be the word itself
and the column to be the string representation of the word. For each entry, Ive written a 1 if the row adjective applies to the
column representation of the word. For instance, long is not a long word, so Ive written a 0. Polysyllabic is a long word, so
Ive written a 1. French is not a French word, its an English word, so Ive written a 0. And so forth.
So far, we havent run into any problems. Now, lets make the following definition: a heterological word is a word that expresses
a property that its representation does not possess. We can add the representation to the table without any problems. It is a
long, polysyllabic, non-French word. But when we try to add the meaning to the table, we run into problems. Remember: a
heterological word is one that express a property that its representation does not posseses. Long is not a long word, so it is
heterological. Polysyllabic is a polysyllabic word, so it is not heterological, and French is not a French word, so it is
heterological.

What about heterological, however? If we say that it is heterological (causing us to put a 1 here), then it applies to itself and so
it cant be heterological. On the other hand, if we say it is not heterological (causing us to put a zero here), then it doesnt apply
to itself and it is heterological. So there really is no satisfactory answer here. Heterological is not well-defined as an adjective.
For English adjectives, we tend to simply tolerate the paradox and politely say that we cant answer that question. Even in
mathematics the polite response was simply to ignore such questions until around the turn of the 20th century when
philosophers began to look for a more solid logical foundations for reasoning and for mathematics in particular.
Naively, one might think that a set could be an arbitrary collection. But what about the set of all sets that do not contain
themselves? Is this set a member of itself or not? This paradox posed by Bertrand Russell wasnt satisfactorily resolved until the
1920s with the formulation of what we now call Zermelo-Fraenkel set theory.
Or from mathematical logic, consider the statement, This statement is false. If this statement is true, then it says that it is false.
And if this statement is false, then it says so and should be true. It turns out that falsehood in this sense isnt well-defined
mathematically.

At this point, youve probably guessed where this is going for this course. We are going to apply the diagonalization trick to
Turing machines.

An Undecidable Language - (Udacity, Youtube)


Here is the diagonalization trick applied to Turing machines. Well let M 1 , M 2 , , be the set of all Turing machines. Turing
machines can be described with strings, so there are a countable number of them and therefore such an enumeration is
possible. Well create a table as before. Ill define the function

f (i, j) =

if M i accepts < Mj >


otherwise

For this example, Ill fill out the table in some arbitrary way. The actual values arent important right now.

Now consider the language L, consisting of string descriptions of machines that do not accept their own descriptions. I.e.

L = {< M > | < M > L(M)}.


Lets add a Turing machine M L that recognizes this language to the grid.

Again we run into a problem. The row corresponding to ML is supposed to have the opposite values of what is on the diagonal.
But what about the diagonal element of this row? What does the machine do when it is given its own description? If it accepts
itself, then < M L > is not in the language L , so M L should have accepted itself. On the other hand, if ML does not accept its
string representation, then < M L > is in the language L , so ML should have accepted its string representation!
Thankfully, in computability, the resolution to this paradox isnt as hard to see as in set theory or mathematical logic. We just
conclude that the supposed machine M L that recognizes the language L doesnt exist.

Here is natural to object: Of course, it exists. I just run M on itself and if it doesnt accept, we accept. The problem is that M on
itself might loop, or it just might run for a very long time. There is no way to tell the dierence.
The end result, then, is that the language L of string descriptions of machines that do not accept their own descriptions is not
recognizable.
Recall that in order for a language to be decidable, both the language and its complement have to be recognizable. Since L is
not recognizable, it is not decidable, and neither is its complement, the language where the machine does accept its own
description. Well call this D_TM, D standing for diagonal,

DTM = {< M > | < M > L(M)}


These facts are the foundation for everything that we will argue in this lesson, so please make sure that you understand these
claims.

Dumaflaches - (Udacity)
If you think back to the diagonalization of Turing machines, you will notice that we hardly referred to the properties of Turing
machines at all. In fact, except at the end, we might as well have been talking about a dierent model of computation, say the
dumaflache. Perhaps, unlike Turing machines, dumaflaches halt on every input. These models exist. A model that allowed one
step for each input symbol would satisfy this requirement.
How do we resolve the paradox then? Cant we just build a dumaflache that takes the description of a dumaflache as input and
then runs it on itself? It has to halt, so we can reject it if accepts and accept if it rejects, achieving the needed inversion. Whats
the problem? Take a minute to think about it.

Mapping Reductions - (Udacity, Youtube)


So far, we have only proven that one language is unrecognizable. One technique for finding more is mapping reduction, where
we turn an instance of one problem into an instance of another.
Formally, we say
A language

w,

A is mapping-reducible to a language B ( A M B ) if there is a computable function f where for every string


w A f (w) B.

We write this relation between languages with the less-than-or-equal-to sign with a little M on the side to indicate that we are
referring to mapping reducibility.
It helps to keep in your mind a picture like this.

On the left, we have the language A , a subset of and the right we have the language B , also a subset of .
In order for the computable function f to be a reduction, it has to map
each string in

A to a string in B .

each string not in A to a string not in B .

The mapping doesnt have to be one-to-one or onto; it just has to have this property.

Some Trivial Reductions - (Udacity, Youtube)


Before using reductions to prove that certain languages are undecidable, it sometimes helps to get some practice with the idea
of a reduction itselfas a kind of warm-up. With this in mind, weve provided a few programming exercises. Good luck!

EVEN <= {Even} - (Udacity)


{John} <= Complement of {John} - (Udacity)
{Jane} <= HALT - (Udacity)
Reductions and (Un)decidability - (Udacity, Youtube)
Now that we understanding reductions, we are ready to use them to help us prove decidability and even more interestingly
undecidability.
Suppose then that we have a language A that reduces to B (i.e. A M B ), and, lets say that I want to know whether some
string x is in A .
If there is a decider for B , then Im in luck. I can use the reduction, which is a computable function, that takes in the one string
and outputs another. I just need to feed in x, take the output, and feed that into the decider for B . If B accepts, then I know that
x is in A. If B rejects, then I know that it isnt.

This works because by the definition of a reduction x is in A if and only if R(x) is in B. And by the definition of a decider this is
true if and only if D accepts R(x). Therefore, the output of D tells me whether x is in A. If I can figure out whether an arbitrary
string is in B , then by the properties of the reduction, this also lets me figure out whether a string is in A . We can say that the
composition of the reduction with the decider for B is itself a decider for A.
Thus, the fact that A reduces to B has four important consequences for decidability and recognizability. The easiest to see are
If B is decidable, then A is also decidable. (As weve seen we can just compose the reduction with the decider for B. )
If B is recognizable, then A is also recognizable. (Same logic as above.)

The other two consequences are just the contrapositives of these.


If A is undecidable, then B is undecidable. (This composition of the reduction and the decider for B cant be a decider. Since we
are assuming that there is a reduction, the only possibility is that the decider for B doesnt exist. Hence, B is undecidable.)
If A is unrecognizable, then B is unrecognizable. (Same logic as above)

Remember the Consequences - (Udacity)


Lets do a quick question on the consequences of there being a reduction between two languages.

A Simple Reduction - (Udacity, Youtube)


Now we are going to use a simple reduction to show that the language B , consisting of the descriptions of Turing machines that
accept something, i.e.

B = {< M > | L(M) },


is undecidable.

Our strategy is to reduce the diagonal language to it. In other words, well argue that deciding B is at least as hard as deciding
the diagonal language. Since we cant decide the diagonal language, we cant decide B either.
Here is one of many possible reductions.

The reduction is a computable function whose input is the description of a machine M, and its going to build another machine
N in this python-like notation. First, we write down the description of a Turing machine by defining this nested function. Then we
return that function. An important point is that the reduction never runs the machine N: it just writes the program for it!
Note here that, in this example, N totally ignores the actual input that is given to it. It just accepts if M() accepts; otherwise, it
loops or rejects. Hence, N is either going to be a machine that accepts everything or a machine that doesnt accept anything
depending on the behavior of M.
In other words, the language of N will be the empty set in one case and Sigma-star in the other. A decider for B would be able to
tell the dierence, and therefore tell us whether M accepted its own description. Therefore, if B had a decider, we would be able
to decide the diagonal language, which is impossible. So B cannot be decidable.

The Halting Problem - (Udacity, Youtube)


Next, we turn to the question of halting. As we have seen, not being able to tell whether a program will halt or not plays a central
role in the diagonalization paradox, and it is at least partly intuitive that we cant tell whether a program is just taking a long time
or if it will run forever. It shouldnt be surprising, then, that given an arbitrary program-input pair, we cant decide whether the
program will halt on that input. But actually, the situation is much more extreme. We cant even decide if a program will halt
when it is given no input at all: just the empty string.
Lets go ahead prove this:
The language of descriptions of Turing machines that halt on the empty string, i.e.

HTM = {< M > | M halts on }

is undecidable.

Well do this by reducing from the diagonal language. That is, well show the halting problem is at least as hard as the diagonal
problem. Here is one of many possible reductions.

The reduction creates a machine N that simply ignores its input and runs M on itself. If M rejects itself, then N loops. Otherwise,
N accepts.
At this point it might seem weve just done a bit of symbol manipulation but lets step back and realize what weve just seen. We
showed that no Turing machine can tell whether or not a computer program will halt or remain in a loop forever. This is a
problem that we care about it and we cant solve it on a Turing machine or any other kind of computer. You cant solve the
halting problem on your iPhone. You cant solve the halting problem on your desktop, no matter how many cores you have. You
cant solve the halting problem in the cloud. Even if someone invents a quantum computer, it wont be able to solve the halting
problem. To misquote Nick Selby: If you want to solve the halting problem, youre at Georgia Tech but you still cant do that!

Filtering - (Udacity, Youtube)


So far, the machines weve made in our reductions (i.e. the N s) have been relatively uncomplicated: they all either accepted
every string or no strings. Unfortunately, reductions cant always be done that way, since the machine that always loops and the
machine that always accepts might both be in or both not be in the language were reducing to. In these cases, we need N to
pay attention to its input. Here is an example where we will need to do this: the language of descriptions of Turing machine
where the Turing machine accepts exactly one string, i.e.

S = {< M > | |L(M)| = 1}.


It doesnt make much of a dierence which undecidable language we reduce from, so this time we will reduce from the halting
problem to S . Again, there are many possible reductions. Here is one.

We run the input machine M on the empty string. If M loops, then so will N . We dont accept one string, we accept none. On
the other hand, if M does halt on the empty string, then we make N act like a machine in the language S. The empty string is as
good as any, so well test to see if N s input x is equal to that and accept or reject accordingly. This works because if M halts on
the empty string, then N accepts just one stringthe empty oneand so is in S . On the other hand, if M doesnt halt on the
empty string, then N wont halt on (and therefore wont accept) anything, and therefore N isnt in L.
In the one case, the language of N is the empty string. In the other case, the language of N is the empty set. A decider for the
language S can tell the dierence, and therefore wed be able to decide if M halted on the empty string or not. Since this is
impossible, a decider for S cannot exist.

Not Thirty-Four - (Udacity)


Now you get a chance to practice doing reductions on your own. Ive been using a python-like syntax in the examples, so this
shouldnt feel terribly dierent. I want you to reduce the language LOOP to the language L of Turing machines that do not
accept 34 in binary.
As well see later in the lesson, its not possible for the Udacity site to perfectly verify your software, but if you do a
straightforward reduction, then you should pass the tests provided.

0n 1n - (Udacity)
Now for a slightly more challenging reduction. Reduce H { } to the language L of Turing machines that accept strings of the
n n
form 0 1 .

Which Reductions Work? - (Udacity)


Weve seen a few examples, and youve practiced writing reductions of your own at this point. Now, I want you to test your
understanding by telling me which of the following statements about these two languages is true. Think very carefully.

Rices Theorem - (Udacity, Youtube)


Once you have gained enough practice, these reductions begin to feel a little repetitive and its natural to wonder whether there
is a theorem that would capture them all. Indeed, there is, and it is traditionally called Rices theorem after H.G. Rices 1953
paper on the subject. This is a very powerful theorem, and it implies that we cant say anything about a computer just based on
the language that it recognizes.
So far, the pattern has been that we have wanted to show that some language L was undecidable, where this language L was
about descriptions of Turing machines whose language has certain property. Its important to note here that the language L
cant depend on the machine M itself, only on the language it recognizes.
Two important things have been true about this language.

1. Membership can only depend on the set of strings accepted by M, not about the machine M itselflike the number of states or
something like that.
2. The language cant be trivial, either including or excluding every Turing machine. Well assume that there is a machine,
language and another,

M2 , outside the language.

M1 , in the

Recall that in all our reductions, we created a machine N that either accepts nothing or else it has some other behavior
depending on the behavior of the input machine M.
Similarly, there are two cases for Rices theorem: either the empty set is in P and therefore every machine that doesnt accept
anything is in the language L, or else the empty set is not in P. Lets look at the case where the empty set is not in P first.

In that case, we reduce from H TM . The reduction looks like this. N just runs M with empty input. If M halts, then we define N to
act like the machine M 1 .
Thus, N acts like M1 (a machine in the language L ) if M halts on the empty string, and loops otherwise. This is exactly what we
want.
Now for the other case, where the empty set is in P.

In this case, we just replace M 1 by M 2 in the definition of the reduction, so that N behaves like M 2 if M halts on the empty
string.
This is fine, but we need to reduce from the complement of H TM : that is, from the set of descriptions of machines that loop on
the empty input. Otherwise, we would end up accepting when we wanted to not accept and vice-versa.
All in all then, we have proved the following theorem.

Slightly more intuitively, we can say the following. Let L be a subset of strings representing Turing machines having two key
properties:
1. if M 1 and M 2 accept the same set of strings, then their descriptions are either both in or both out of the languagethis just says
that the language only depends on the behavior of the machine not its implementation.
2. the language cant be trivialthere must be a machine whose description is in the language and a machine whose description is not
in the language.

If these two properties hold true, then the language is be undecidable.

Undecidable Properties - (Udacity)


Using Rices theorem, we now have a quick way of detecting whether certain questions are decidable. Use your knowledge of
the theorem to indicate which of the following properties is decidable. For claritys sake, lets say that a virus is a computer
program that modifies the data on the hard disk in some unwanted way.

Conclusion - (Udacity, Youtube)


I want to end this section on computability by revisiting the scene we used at the beginning of the course, where the teacher
was explaining functions in terms of ordered pairs, and the students were thinking of what they had to do to x to get y.

The promise of our study of computability was to better appreciate the dierence between these understandings, and I hope
you will agree that we have achieved that. We have seen how there are many functions that are not computable in any ordinary
sense of the word by a counting argument. We made precise what we meant by computation, going all the way back to Turings
inspiration from his own experience with pen and paper to formalize the Turing machine. We have seen how this model can
compute anything that any computer today or envisioned for tomorrow can. And lastly, we have described a whole family of
uncomputable functions through Rices theorem.

Towards Complexity - (Udacity, Youtube)


Even since the pioneering work of Turing and his contemporaries such as Alonzo Church and Kurt Godel, mathematical
logicians have studied the power of computation and connecting it to provability as well as giving new insights on the nature of
information, randomness, even philosophy and economics. Computability has a downside, just because we can solve a
problem on a Turing machine doesnt mean we can solve it quickly. What good is having a computer solve a problem if our sun
explodes before we get the answer?

So for now we leave the study of computable functions and


languages and move to computational complexity, trying to
understand the power of ecient computation and the
famous P versus NP problem.
P and NP - (Udacity)
Introduction - (Udacity, Youtube)
So far in the course, we have answered the question: what is computable? We modeled computability by Turing machines and
showed that some problems, like the halting problem, cannot be computed at all.
In the next few lectures we ask, what can we compute quickly? Some problems, like adding a bunch of numbers or solving
linear equations, we know how to solve quickly on computers. But how about playing the perfect game of chess? We can write
a computer program that can search the whole game tree, but that computation wont finish in our lifetime or even the lifetime of
the universe. Unlike computability, we dont have clean theorems about ecient computation. But we can explore what we
probably cant solve quickly through what known as the P versus NP problem.

P represents the problems we can solve quickly, like your GPS finding a short route to your destination. NP represents problems
where we can check that a solution is correct, such as solving a Sudoku puzzle. In this lesson we will learn about P, NP, and
their role in helping us understand what we may or may not be able to solve quickly.

Friends or Enemies - (Udacity, Youtube)


Well illustrate the distinction between P and NP by trying analyze a world of friends and enemies. Everyone is this world is
either a friend or an enemy, and well represent this by drawing an edge between all of the friends like so.

Given all this information, there are several types of analysis that you might want to do, some easier and some harder. For
instance, if you wanted to run a dating service, you would be in pretty good shape. Say that you wanted to maximize the
number of matches that you make and hence the number of happy customers. Or perhaps, you just want to know if its possible
to give everyone a date. Well, we have ecient algorithms for finding matchings, and well see some in a future lesson. Here, in
this example, its possible to match everyone up, and such a matching is fairly easy to find.

Contrast this with problem of identifying cliques. By clique, I mean a set of people who are all friends with each other. For
instance, here is a clique of size three: every pair of members has an edge between, and cliques of that size arent too hard to
find.

As we start to look for larger cliques, however, the problem becomes harder and harder to solve.

Find a Clique - (Udacity)


In fact, finding a clique of size four, even for a relatively small graph like this one isnt necessarily easy. See if you can find one.

P and NP - (Udacity, Youtube)


Being a little more formal now, we define P to be the set of problems solvable in polynomial time. By polynomial time, we mean
that the number of Turing machine steps is bounded a by a polynomial. Well formalize this more in a moment. Bipartite
matching was one example of this class of problems.
NP we define as the class of problems verifiable in polynomial time. This includes everything in P, since if a problem can be
solved in polynomial time, a potential solution can be verified in that time, too. Most computer scientists strongly believe that
this containment is strict. That is to say, there are some problems that are eciently verifiable but not eciently solvable, but we
dont have a proof for this yet. The clique problem that we encountered is one of the problems that belongs in NP but we think
does not belong in P.

There is one more class of problems that well talk about in this section on complexity, and that is the set of NP-complete
problems. These are the hardest problems in NP, and we call them the hardest, because any problem in NP can be eciently
transformed into an NP-complete problem. Therefore, if someone were to come up with a polynomial algorithm for even one
NP-complete problem, then P would expand out in this diagram, making P and NP into the same class. Finding a polynomial
solution for clique would do this, so we say that clique is NP-complete. Since solving problems doesnt seem to be as easy as
checking answers to problems, we are pretty sure that NP-complete problems cant be solved in polynomial time, and therefore
that P does not equal NP.
To computer science novices the dierence between matching and clique might not seem to be a big deal and it is surprising
that one is so much harder than the other. In fact, the dierence between a polynomially solvable problem and a NP-complete
one can be very subtle. Being able to tell the dierence is an important skill for anyone who will be designing algorithms for the
real world.

Delicacy of Tractability - (Udacity, Youtube)


As we discussed, one way to see the sublety of the dierence between problems in P and those that are NP-complete is to
compare what it takes to solve seemingly similar real-world problems.
Consider the shortest path problem. You are given two locations and you want to find the shortest valid route between them.
You phone does this in a matter of milliseconds when you ask it for directions, and it gives you an exact answer according to
whatever model for distance it is using. This is computationally tractable.
On the other hand, consider this warehouse scenario where a customer places an order for several dierent items and a person
or a robot has to go around and collect them before going to the shipping area for them to be packed. This is called the
traveling salesman problem and it is NP-complete. This problem also comes up in the unocial guide to Disney World, which
tries to tell you how to get to all the rides as quickly as possible.

This explains why your phone can give you directions but supply chain logisticsjust figuring out how things should be routed
is a billion dollar industry.

Actually, however, we dont even need to change the shortest path so much to get an NP-complete problem. Instead of asking
for the shortest path, we could ask for the longest, simple path. We have to say simple so that we dont just go around in cycles
forever.
This isnt the only possible pairing of similar P and NP-complete problems either. Im going to list some more. If you arent
familiar with these problems yet, dont worry. You will learn about them by the end of the course. Vertex cover in bipartite graphs
in polynomial, but vertex cover in general graphs is NP complete.
An class of optimization problems called Linear Programming is in P, but if we restrict the solutions to integers, then we get an
NP-complete problem Finding an Eulerian cycle in a graph where you touch each edge once is polynomial. On the other hand,
finding a Hamiltonian cycle that touches each vertex once is NP-complete
And lastly, figuring out whether a boolean formula with two literals per clause is polynomial, but if there are three literals per
clause, then the problem is NP-complete.

Unless you are familiar with some complexity theory, problems in P arent always easy to tell from those that are NP-complete.
Yet, in the real world when you encounter a problem it is very important to know which sort of problem you are dealing with. If
your problem is like one of the problems in P, then you know that there should be an ecient solution and you can avail yourself
of the wisdom of many other scientists who have thought hard about how to eciently solve these problems. On the other
hand, if your problem is like one the NP-complete problems, then some caution is in order. You can expect to be able to find
exact solutions for small enough instances, and you may be able to find a polynomial algorithm that will give an approximate
solution that is good enough, but you should not expect to find an exact solution that will scale well for all cases.
Being able to know which situation you are in is one of the main practical benefits of studying complexity.

Running Time Analysis - (Udacity, Youtube)


Now, we are going to drill down into the details and make many of the notions weve been talking about more precise. First we
need to define running time.
Well let M be a Turing machine single-tape, multi-tape, random access, the definition works for all of them. The running time of
the machine then is a function over the natural numbers where f(n) is the largest number of steps taken by the machine for an
input string of length n. We can extend this definition to machines that dont halt as well by making their running time infinite. We
always consider the worst case.

Lets illustrate this idea with an example. Consider the single-tape machine in the figure above that takes binary input and tests
whether the input contains a 1. Lets figure out the running time for string of length 2. We need to consider all the possible
strings of length 2, so we make a table and count the number of steps. The largest number of steps in 3, where we read both
zeros and then the blank symbol. Therefore, f(2) = 3.

Asymptotic Analysis - (Udacity, Youtube)


Now, its not very practical to write down every running time function exactly, so computer scientists use various levels of
approximation. For complexity, we use asymptotic analysis. Well do a very brief review here, but if you havent seen this idea
before you should take a few minutes to study it on your own before proceeding with the lesson.
In words, the set O(f (n)) is the set of functions g such that g cf (n) for suciently large n. Making this notion of suciently
large n precise, we end up with
O(f (n)) = {g(n) | c, N such that for every n N, g(n) cf (n)}.
Even though weve defined O(f (n)) as a set, we write g(n) = O(f (n)) instead of using the inclusion sign. We also say that g is
order f .
This definition can be a little confusing, but it should feel like the definition of a limit from your calculus class. In fact, we can
restate this condition to say that the ratio of g over f converges to a constant under the limsup.
2
An example also helps. Take the function g(n) = n n + 10.
2
We can argue that this is order n by choosing c = 1 and N = 10 . For every n 10 , n2 is greater than the function g .

We also could have chosen c = 10 and N = 1.

3
Note that the big-O notation does not have to create a tight bound. Thus, g = O(n ) too. Setting c = 1 and N = 3 works for
this.

Big O Question - (Udacity)


Once weve established the running time for an algorithm, we can analyze other algorithms that use it as a subroutine much
more easily. Consider this question. Suppose that algorithm A has running time O(n), and A is called by algorithm B O(log n)
times, and algorithm B runs for an additional log-squared n time afterwards. What is the tightest bound on the running time of
B?

The Class P - (Udacity, Youtube)


We are now ready to formally define the class P. Most precisely,
P is the set of languages recognized by an order

nk deterministic Turing machine for some natural number k .

There are several important things to note about this definition.


First is that P is a set of languages. Intuitively, we talk about it as a set of problems, but to be rigorous we have to ultimately define
it in terms of languages.
Second is the word deterministic. We havent seen a non-deterministic Turing machine yet, but one is coming. Deterministic just
means that given a current state and tape symbol being read, there is only one transition for the machine to follow.
k

An n time machine here is one with running time order n as we defined these terms a few minutes ago.

Perhaps the most interesting thing about this definition is the choice for any k in the natural numbers. Why is this the right
definition? After all, if k is 100 then deciding the language isnt tractable in practice.
The answer is that P doesnt exactly capture what is tractable in practice. Its not clear that any mathematical definition would
stand the test of time in this regard, given how often computers change, or be relevant in so many contexts. This choice does
have some very nice properties however.
1. It matches tractability better than one might think. In practice, k is usually low for polynomial algorithms, and there are plenty of
interesting problems not known to be in P.
2. The definition is robust to changes to the model. That is to say, P is the same for single-tape, multi-tape machines, Random
Access machines and so forth. In fact, we pointed out that the running times for each of those models are polynomially related
when we introduced them.
3. P has the nice property of closure under the composition of algorithms. If one algorithm calls another algorithm as a subroutine a
polynomial number of times, then that algorithm is still polynomial, and the problem it solves is in P. In other words, if we do
something ecient a reasonably small number of times, then the overall solution will be ecient. P is exactly the smallest class of
problems containing linear time algorithms and which is closed under composition.

Problems and Encodings - (Udacity, Youtube)


Weve defined P as a set languages, but ultimately, we want to talk about it as a set of problems. Unfortunately, this isnt as
easy as it might seem. The encoding rules we use for turning abstract problems into strings can aect whether the language is
in P or not. Lets see how this might happen.
Consider the question: Does G have a Hamiltonian cyclea cycle that visits all of the vertices? Here is a graph and here is its
adjacency matrix.

A natural way to represent the graph as a string is to write out its adjacency matrix in scanline order as done in the figure above.
But this isnt the only way to encode the graph. We might do something rather inecient.
The scanline encoding for this graph represents the number 170 in binary. We could choose to represent the graph in essentially
unary.
We might represent the graph as 342 zeros followed by 170 ones. The fact that there are 29=512 symbols total indicates that its a
3x3 matrix, and converting 170 back into binary gives us the entries of the adjacency matrix.

This is a very silly encoding, but there is nothing invalid about it.
This language, it turns out, is in P, not because it allows the algorithm to exploit any extra information or anything like that, but
just because the input is so long. The more sensible, concise encoding isnt known to be in P (and probably isnt, by an
overwhelming consensus of complexity theorists). Thus, a change in encoding can aect whether a problem is in P, yet its
ultimately problems that we are interested in, independent of the particulars of the encoding.
We deal with this problem, essentially by ignoring unreasonable representations like this one. As long as we consider any
reasonable encoding (think about what xml or json would produce from how you would store it in computer memory) then the
particulars wont change the membership of the language in P, and hence we can talk at least informally about problems being
in P or not.

Which are in P - (Udacity)


Now that we have defined P, I want to illustrate how easy it can be to recognize that a problem is in P. Recognizing that a
problem is NOT in P is a little harder, so for this exercise assume that if the brute-force algorithm is exponential, so is the best
algorithm. Consider a finite set U of the integers and the following three problems. Check the ones that you think are in P.
(Again, if there isnt an obvious polynomial-time algorithm, assume that there isnt one.)

Nondeterministic TMs - (Udacity, Youtube)


From the class P, we now turn to the class NP. At the beginning of the lesson, I said that NP is the class of problems verifiable in
polynomial time. This is true, but its not how we typically define it. Instead, we define NP as the class of problems solvable in
polynomial time on a nondeterministic Turing machine, a variant that we havent encountered before.
Nondeterminism in computer science is often misunderstood, so put aside whatever associations you might have had with the
word. Perhaps the best way to understand nondeterministic Turing machines is by contrasting a nondeterministic computation
with a deterministic one. A deterministic computation starts in some initial state and then the next state is exactly uniquely
determined by the transition function. There is only one possible successor configuration.
And to that configuration there is only one possible successor.
And so on and so forth. Until an accepting or rejecting configuration is reached, if one is reached at all.

On the other hand, in a nondeterministic computation, we start in a single initial configuration, but its possible for there to be
multiple successor configurations. In eect, the machine is able to explore multiple possibilities at once. This potential splitting
continues at every step. Sometimes there might just be one possible successor state, sometimes there might be three or more.
For each branch, we have all the same possibilities as for a deterministic machine.
It can reject.
It can loop forever.
It can accept.

If the machine ever accepts in any of the these branches, then the whole machine accepts.
The only change we need to make to the 7-tuple of the deterministic Turing machine needed to make it nondeterministic is to
modify the transition function. An element of the range is no longer the set of single (state, tape-symbol to write, direction to
move) tuple, but a collection of all such possibilities.

: Q {S | S Q {L, R}}
This set of all subsets used in the range here is often called a power set.
The only other change that needs to be made is when the machine accepts. It accepts if there is any valid sequence of
configurations that results in an accepting state. Naturally, then it also rejects only when every branch reaches a reject state. If
there is a branch that hasnt rejected yet, then we need to keep computing in case it accepts.
Therefore, a nondeterministic machine that never accepts and that loops on at least one branch will loop.

Which Language - (Udacity)


Here is the state transition diagram for a simple nondeterministic Turing machine. The machine starts out in q0 and then it can
move the head to right on 0s and 1s, OR on a 1, it can transition to state q1 .
The fact that there are two transitions out of state q0 when reading a 1 is the nondeterministic part of the machine. In branches
where the transition to state q1 is followed, the Turing machine reads one more 0 or 1 and then expects to hit the end of the
input. Remember that by convention, if there is no transition specified in one of these state diagrams, then the machine simply
moves to the reject state and halts. That keeps the diagrams from getting cluttered.
My question to you then is what language does this machine recognize? Check the appropriate answer.

Composite Numbers - (Udacity, Youtube)


To get more intuition for the power of nondeterminism, lets see how much more ecient it makes deciding the language of
composite numbersthat is, numbers that are not prime. The task is to decide the string representations of numbers that are the
product of two positive integers, greater than 1.
One deterministic solution looks like this.

Think of the flow diagram as capturing various modules within the deterministic Turing machine. We start by initializing some
number p to 1. Then we increment it, and test whether p-squared is greater than x. If it is, then trying larger values of p wont
help us, and we can reject. If p-squared is no larger than x, however, then we test to see if p divides x. If it does, we accept. If
not, the we go try the next value for p.
Each iteration of this loop require a number of steps that is polynomial in the number of bits used to represent x. The trouble is
that we might end up needing x iterations of this outer loop here in order to find the right p or confirm that one doesnt exist.
This what makes the deterministic algorithm slow. Since the value of x is exponential in its input sizeremember that it is
represented in binarythis deterministic algorithm is exponential.

On the other hand, with nondeterminism we can do much better. We initialize p so that it is represented on its own tape as the
number 1 written in binary. Then we nondeterministically modify p. By having two possible transitions for the same state and
symbol pair, we can non-deterministically append a bit to p. (The non-deterministic transitions are in orange.)
Next, we check to see if we have made p too large. If we did, then there is no point in continuing, so we reject.
On the other hand, if p is not too big, then we nondeterministically decide either to
append a zero to p ,
append a 1 to p , or

leave p as it is and go see if it divides x.

If there is some p that divides x, then some branch of the computation will set p accordingly. That branch will accept and so the
whole machine will. On the other hand, if no such p exists, then no branch will accept and the machine wont either. In fact, the
machine will always reject because every branch of computation will be rejected in one of the two possible places.
This non-deterministic strategy is faster because it only requires log x iterations of this outer loop. The divisor p is set one bit at
a time and cant use more bits than x, the number its supposed to divide.
Thus, while the deterministic algorithm we came up with was exponential in the input length, it was fairly easy to come up with a
nondeterministic one that was polynomial.

The Class NP - (Udacity, Youtube)


We are almost ready to define the class NP. First, however, we need to define running time for a nondeterministic machine,
because it operates dierently from a deterministic one.
Since we think about all of these possible computations running in parallel, the running time for each computational path is the
path length from the initial configuration.

And the running time of the machine as a whole is the maximum number of steps used on any branch of the computation. Note
that once we have a bound the length of any accepting configuration sequence, we can avoid looping by just creating a
timeout.
NP is the set of languages recognized by an O(n

) time nondeterministic Turing machine for some number k.

In other words, its the set of languages recognized in polynomial time by a nondeterministic machine. NP stands for
nondeterministic polynomial time.
Nondeterminism can be a little confusing, but it helps to remember that a string is recognized if it leads to any accepting
computationi.e. any accepting path in this tree. Note that any Turing machine that is a polynomial recognizer for a language
can easily be turned into a polynomial decider by adding a timeout since all accepting computations are bounded in length by a
polynomial.

NP Equals Verifiability Intuition - (Udacity, Youtube)


At the beginning of the lesson, we identified NP as those problems for which answer can be verified in polynomial time.
Remember how easy it was to check that a clique was a clique? In the more formal treatment, however, we defined NP as those
languages recognized by nondeterministic machines in polynomial time. Now we get to see why these mean the same thing.
To get some intuition, well revisit the example of finding a clique of size 4. We already discussed how a clique of size 4 is easy
to verify, but how is it easy to find it with a non-deterministic machine? The key is to use the non-determinism to create a
branch of computation for each subset of size 4, and then use our verification strategy to decide whether any of those subsets
corresponds to a clique.
Remember that if any sequence of configurations accepts, then the non-deterministic machine accepts.
One branch of computation might choose these 4 and then reject because its not a clique.

Another branch might choose these 4 and also reject.

But one branch will choose the correct subset and this will accept.

And, thats all we need. If one branch accepts then the whole non-deterministic machines does, as it should. There is a clique of
size 4 here.

NP Equals Verifiability - (Udacity, Youtube)


Now for the more formal argument.
A verifier for a language L is a deterministic Turing machine V such that L is equal to the set of strings
another string

c such that V accepts the pair (w, c).

w for which there is

In other words, for every string w L, there is a certificate c that can be paired with it so that V will accept, and for every
string not in L , there is no such string c . Its intuitive to think of w as a statement and of c as the proof. If the statement is true,
then there should be a proof for it that V can check. On the other hand, if w is false, then no proof should be able to convince
the verifier that it is true.
A verifier is polynomial if its running time is bounded by a polynomial in

w.

Note that this w is the same one as in the definition. It is the string that is a candidate for the language. If we included the
certificate c in the bound, then it becomes meaningless since we could make c be as long as necessary. Thats a polynomial
verifier.
We claim
The set of Languages that have polynomial time verifiers is the same as NP.

The key to understanding this connection is once again this picture of the tree of computation performed by the
nondeterministic machine.

If a language is in NP, then there is some nondeterministic machine that recognizes it, meaning that for every string in the
language there is an accepting computation path. The verifier cant simulate the whole tree of the nondeterministic machine in
polynomial time, but it can simulate a single path. It just needs to know which path to simulate.

But this is what the certificate can tell it. The certificate can act as directions for which turns to make in order to find the
accepting computation of the nondeterministic machine. Hence, if there is nondeterministic machine that can recognize a
language, then there is a verifier that can verify it.
Now, well argue the other direction. Suppose that V verifies a language. Then, we can build a nondeterministic machine whose
computation tree will look a bit like a jellyfish. It the very top, we have a high degree of branching as the machine
nondeterministically appends a certificate c to its input.

Then it just deterministically simulates the verifier. If there is any certificate that causes V to accept, the nondeterministic
machine will find it. If there isnt one, then the nondeterministic machine wont.

Which is in NP - (Udacity)
Now that weve defined NP and defined what it means to be verifiable in polynomial time, I want you to apply this knowledge to
decide if several problems are in NP. First, is a graph connected? Second, does a graph have a set of k vertices with no edges
between? This is called the independent set problem. And lastly, will a given Turing machine M accept exactly one string?

Conclusion - (Udacity, Youtube)


In this lesson, we have introduced the P and NP classes of problems. As we said in the beginning, the class P informally
captures the problems we can solve eciently on a computer, and the class NP captures those problems whose answer we can
verify eciently on a computer. Our experience with finding a clique in a graph suggests that these two classes are not the
same. Cliques are hard to find but they are easy to check. And it is simply intuitive that solving a dicult problem should be
harder than just checking a solution.
Nevertheless, no one has been able to prove that P is not equal to NP. Whether these two classes are equivalent is the known
as the P versus NP question, and it is the most important open problem in theoretical computer science today if not in all of
mathematics. Well discuss the full implications of the question in a later lecture, but for now, Ill just mention that in the year
2000, the Clay Math Institute named the P versus NP problem as one of the seven most important open questions in
mathematics and has oered a million-dollar bounty for a proof that determines whether or not P = NP. Fame and fortune await
the person who settles the P v NP problem, but many have tried and failed.
Next class we look at the NP-complete problems, the hardest problems in NP.

NP-Completeness - (Udacity)
Introduction - (Udacity, Youtube)
This lecture covers the theory of NP-completeness, the idea that there are some problems in NP so general and so expressive
that they capture all of the challenges of solving any problem in NP in polynomial time. These problems provide important
insight into the structure of P and NP, and form the basis for the best arguments we have for the intractability of many important
real-world problems.

The Hardest Problems in NP - (Udacity, Youtube)


With our previous discussion of the classes P and NP in mind, you can visualize the classes P and NP like this.

Clearly, P is contained inside of NP, and we are pretty sure that this containment is strict. That is to say that there are some
problems in NP but not in P, where the answer can be verified eciently, but it cant be found eciently. In this picture then, you
can imagine the harder problems being at the top.
Now, suppose that you encounter some problem where you know how to verify the answer, but where you think that finding an
answer is intractable. Unfortunately, your boss or maybe your advisor doesnt agree with you and keeps asking for an ecient
solution.
How would you go about showing that the problem is in fact intractable? One idea is to show that the problem is not in P. That
would indeed show that it is not tractable, but it would do much more. It would show that P is not equal to NP. You would be
famous. As we talked about in the last lecture, whether P is equal to NP is one of the great open questions in mathematics. Lets
rule option out. I dont want to discourage you from trying to prove this theorem necessarily. You just should know what you are
getting into.
Another possible solution is to show that if you could solve your problem eciently then it would be possible to solve another
problem eciently, one that is generally considered to be hard. If you were working in the 1970s you might have shown that a
polynomial solution to your problem would have yielded a polynomial solution to linear programming. Therefore, your problem
must be at least as hard a linear programming. The trouble with this approach is that it was later shown that linear programming
actually was polynomially solvable. Hence, the fact your problem is as hard as linear programming doesnt mean much
anymore. The class P swallowed linear programming. Why couldnt it swallow your program as well? This type of argument isnt
worthless, but its not as convincing as it might be.

It would be much better to reduce your problem to a problem that we knew was one of the hardest in the class NP, so hard that
if the class P were to swallow it would have to have swallowed all of NP. In other words, we would have to move the P
borderline all the way to the top. Such a problem would have to be NP-complete, meaning that we can reduce every language
in NP to it. Remarkable as it may seem, it turns out that there are lots of such languagessatisfiability being the first for which
this was proved. In other words, we know that it has to be at the top of this image. Turning back to how to show that your
problem is intractable, short of proving that P is not equal to NP, the best we can do is to reduce an NP-complete problem like
SAT to your problem. Then, your problem would be NP-Complete too, and the only way your problem could be polynomially
solvable is if everything in NP is.
There are two parts to this argument.

The first is the idea of a reduction. Weve seen reductions before in the context of computability. Here the reductions will not only
have to be computable but computable in polynomial time. This idea will occupy the first half of the lesson.
The second half will consider this idea of NP-completeness and we will go over the famous Cook-Levin theorem which show that
boolean satisfiability is NP-complete

Polynomial Reductions - (Udacity, Youtube)


Now for the formal definition of a polynomial reduction. This should feel very similar to the reductions that we considered when
were talking about computability.
A language

A is polynomially reducible to language B (A P B ) if there is a polynomial-time computable function f where

for every string w,

w A f (w) B.
The key dierence from before is that we have now required that the function be computable in polynomial time, not just that it
be computable. We will also say that f is a polynomial time reduction of A to B.
Here is key implication of there being a polynomial reduction of one language to another. Lets suppose that I want to be able to
know whether a string x is in the language A , and suppose also that there exists a polynomial time decider M for the language

B.

Then, all I need to do is take the machine or program that computes this functionlets call it N
and feed my string x into it, and then feed that output into M. The machine M will tell me if f (x) is in B. By the definition of a
reduction, this also tells me whether x is in A, which is exactly what I wanted to know. I just had to change my problem into one
encoded by the language B and then I could use Bs decider.
Therefore, the composition of M with N is a decider for A by all the same arguments we used in the context of computability.
But is it a polynomial decider?

Running Time of Composition - (Udacity)


We just found ourselves considering the running time of the composition of two Turing machines. Think carefully, and give the
most appropriate bound for the running time of the composition of M which runs in time q with N which runs in time p. You can
assume that both N and M at least read all of their input.

Polynomial Reductions Part 2 - (Udacity, Youtube)


We just argued that if N is polynomial and we take its output and feed that into M which is also polynomial in its input size, then
the resulting composition of M with N is also polynomial. Therefore, we can add in polytime to our claim in the figure.

By this argument, weve proved the following important theorem.


If A is polynomially reducible to B, and B is in P, then A must be in P.

Just convert the input string to prepare it for the decider for B, and return what the decider for B says.

What Do Reductions Imply - (Udacity)


Here is a question to test your understanding of the implications of the existence of a polynomial reduction. Suppose that A
reduces to B, which of the statements below follow?

Independent Set - (Udacity, Youtube)


To illustrate the idea of a polynomial reduction, were going to reduce the problem of finding a Independent Set in a graph to
that of finding a Vertex Cover. In the next lesson, we going to argue that if P is not equal to NP, then Independent Set is not in P,
so Vertex Cover cant be either. But thats getting ahead of ourselves. For the benefit of those not already familiar with these
problems, we will state them briefly.
First, well consider the independent set problem and see how it is equivalent to the Clique problem that we talked about when
we introduced the class NP.
Given a graph, a subset S of the vertices is an independent set if there are no edges between vertices in

These two vertices here do not form an independent set because there is an edge between them.

However, these three vertices do form an independent set because there are no edges between them.

S.

Clearly, each individual vertex forms an independent set, since there isnt another vertex in the set for it to have an edge with,
and the more vertices we add the harder it is to find new ones to add. Finding a maximum independent set, therefore, is the
interesting question. Phrased as a decision problem, the question becomes given a graph G, is there an independent set of
size k ?

Find an Independent Set - (Udacity)


In order to understand the reduction, it is critical that you understand the independent set problem, so here is a quick question
to test your understanding. Mark an independent set of size 3.

Vertex Cover - (Udacity, Youtube)


Now lets define the other problem that will be part of our example reduction, Vertex Cover.
Given a graph G, a subset S of the vertices forms a vertex cover if every edge is incident on

S (i.e. has one part in S ).

To illustrate, consider this graph here. These three shaded vertices do not form a vertex cover.

On the other hand, the two vertices do form a vertex cover because every edge is incident on one of these two.

Clearly, the set of all vertices is a vertex cover, so the interesting question is how small a vertex cover can we get. Phrased as a
decision question where we are give a graph G, it becomes Is there a vertex cover of size k ?
Note that this problem is in NP. Its easy enough to check whether a subset of vertices is of size k and whether it covers all the
edges.

Find a Vertex Cover - (Udacity)


Going forward, it will be very important to understand this vertex cover problem, so well do a quick exercise to give you a
chance to test your understanding. Mark a vertex cover of size 3 in the graph below.

Vertex Cover = Ind Set - (Udacity, Youtube)


Having defined the Independent Set and Vertex Cover problems, we will now show that Vertex Cover is as hard as Independent
set. In general, finding a reduction can be very dicult. Sometimes, however, it can as simple as playing around with a
definition.

Notice that both in these examples shown here and in the exercises, the set of vertices used in the vertex cover was the
complement of the vertices used the independent set. Lets see if we can explain this.
The set S is an independent set if there are no edges within S. By within, I mean that both endpoints are in S.
Thats equivalent to saying that every edge is not within Sor that every edge is incident on V
But that just says that V

S.

S is a vertex cover! The complement of an independent set was always a vertex cover and vice-versa.

Thus, we have the observation that


A subset of vertices S is an independent set if and only if V

S is a vertex cover.

As a corollary then,
A graph G has an independent of size at least k if and only if it contains a vertex cover of size at most size V

k.

The reduction is therefore fantastically simple: Given a graph G and a number k, just change k into V k.

Transitivity of Reducibility - (Udacity, Youtube)


The polynomial reducibility of one problem to another is a relation, much like the at most relation that you seen since elementary
school. While it doesnt have ALL the same properties, it does have the important property of transitivity. For example, weve
seen how independent set is reducible to vertex cover, and I claim that vertex cover is reducible to Hamiltonian Path, a problem
closely related to Traveling Salesman. From these facts it follows that independent set is reducible to Hamiltonian Path.

Lets take a look at the proof of this theorem. Let M be the program that computes the function that reduces A to B, and let N
be the program that computes the function that reduces B to C. To turn an instance of the problem A into an instance of the
problem C, we just pass it through M and then pass that result through N.

This whole process can be thought of as another computable function R.


Note that like M and N, R is polynomial time. Ill ask you to help me show why in a minute.
Thus, x is in A if and only if M(x) is in B because M implements the reduction and M(x) is in B if and only if N(M(x)) is in C
because N implements that reduction. The composition of N and M is the reduction R, so overall we have that x is in A if and
only if R(x) is in C, just as we wanted.

NP Completeness - (Udacity, Youtube)


If you have followed everything in the lesson so far, then you are ready to understand NP-completeness, an idea behind some of
the most fascinating structure in the P vs NP question. You may have heard optimists say that we are only one algorithm away
proving that P is equal to NP. What they mean is that if we could solve just one NP-complete problem in polynomial time, then
we could solve then all in polynomial time. Here is why.
Formally, we say
A language

L is NP-complete if L is in NP and if every other language in NP can be reduced to it in polynomial time.

Recalling our picture of P and NP from the beginning of the lesson the NP-complete problems were at the top of NP, and we
called them the hardest problems in NP. We cant have anything higher thats still in NP, because if its in NP, then in can be
reduced to an NP-complete problem. Also, if any one NP-complete problem were shown to be in P, then P would extend up and
swallow all of NP.

Its not immediately obvious that an NP-complete problem even exists, but it turns out that there are lots of them and in fact
they seem to occur more often in practice than problems than problems in the intermediate zone, which are not NP-complete
and so far have not been proved to be in P either.
Historically, the first natural problem to be proved to be NP-Complete is called Boolean formula satisfiability or SAT for short.

SAT is NP-Complete

This was shown to be NP-Complete by Stephen Cook in 1971 and independently by Leonid Levin in the Soviet Union around
the same time. The fact that this problem is NP-complete is extremely powerful, because once you have one NP-Complete
problem, you just need to reduce it to other problems in NP to show that they too are NP-complete. Thus, much of the theory of
complexity can be said to rest on this theorem. This is the high point of our study of complexity.

CNF Satisfiability - (Udacity, Youtube)


For our purposes, we wont have to work with the most general satisfiability problem. Rather, we can restrict ourselves to a
simpler case where the boolean formula has a particular structure called conjunctive normal form, or CNF, like this one here.

First, consider the operators. The indicates a logical OR, the indicates a logical AND, and the bar over top indicates
logical NOT.
For one of these formulas, we first need a collection of variables, x, y, z for this example. These variables appear in the formula

as literals. A literal can be either a variable or the variables negation. For example, the x, or y , etc.
At the next higher level, we have clauses, which are disjunctions of literals. You could also say a logical OR of literals. One
clause is what lies inside of a parentheses pair.
Finally, we have the formula as a whole, which is a conjunction of clauses. That is to say, all the clauses get ANDed together.
Thus, this whole formula is in conjunctive normal form. In general, there can be more than two clauses that get ANDed together.
That covers the terms well use for the structure of a CNF formula.
As for satisfiability itself, we say that
A boolean formula is satisfiable if there is a truth assignment for the formula, a way of assigning the variables true and false
such that the formula evaluates to true.

The CNF-satisfiability problem is: given a CNF formula, determine if that formula has a satisfying assignment.
Clearly, this problem is in NP since given a truth assignment, it is takes time polynomial in the number of literals to evaluate the
formula. Thus, we have accomplished the first part of showing that satisfiability in NP-complete. The other part, showing that
every problem in NP is polynomial-time reducible to it, will be considerably more dicult.

Does it Satisfy - (Udacity)


To check your understanding of boolean expressions, I want you to say whether the given assignment satisfies this formula.

Find a Satisfying Assignment - (Udacity)


Just because one particular true assignment doesnt satisfy a boolean formula doesnt mean that there isnt a satisfying
assignment. See if you can find one for this formula. Unless you are a good guesser, this will take longer than just testing
whether a given assignment satisfies the formula.

Cook Levin - (Udacity, Youtube)


Now that weve understood the satisfiability problem, we are ready to tackle the Cook-Levin theorem. Remember that we have
to turn any problem in NP into an instance of SAT, so its natural to start with the thing that we know all problems in NP have in
common: there is a nondeterministic machine that decides them in polynomial time. Thats the definition after all.
Therefore, let L be an arbitrary language in NP and let M be a nondeterministic Turing machine that decides L in time p(n),
where p is bounded above by some polynomial.
An accepting computation, or sequence of configurations, for the machine M can be represented in a tableau like this one here.

Each configuration in the sequence is represented by a row, where we have one column for the state, one for the head position
and then columns for each of the values of the first p(n) squares of the tape. Note that no other squares can written to because
there just isnt time to move the head that far in only p(n) steps.
Of course, the first row must represent the initial configuration and the last one must be in an accepting state in order for the
overall computation to be accepting.
Note that its possible that the machine will enter an accepting state before step p(n), but we can just stipulate that when this
happens all the rest of the rows in the table have the same values. This is like having the accept state always transition to itself.
The Cook-Levin theorem them consists of arguing that the existence of an accepting computation is equivalent to being able to
satisfy a CNF formula that captures all the rules for filling out this table.

The Variables - (Udacity, Youtube)


Before doing anything else, we first need to capture the values in the tableau with a collection of boolean variables. Well start
with the state column.

We let Qik represent whether after step i , the state is qk . Similarly, for the position column, we define H ij to represent whether
the head is over square j after step i. Lastly, for the tape contents, we define S ijk to represent whether after step i, the square j
contains the k th symbol of the tape alphabet.
Note that as weve defined these variables, there are many truth assignments that are simply nonsensical. For instance, every
one of these Q variables could be assigned a value of True, but in a given configuration sequence the Turing machine cant be
in all states at all times. Similarly, we cant assign them to be all False; the machine has to be in some state at each time step.
We have the same problems with the head position variables and the variables for the squares of the tape.
All of this is okay. For now, we just need a way to make sure that any accepting configuration sequence has a corresponding
truth assignment, and indeed it must. For any way of filling out the tableau, the corresponding truth assignment is uniquely
defined by these meanings. It is the job of the boolean formula to rule out truth assignments that dont correspond to valid
accepting computations.

Configuration Clauses - (Udacity, Youtube)


Having defined the variables, we are now ready to move onto building the boolean formula. We are going to start with the
clauses needed so that given a satisfying assignment its clear how to fill out the table. Never mind for now about whether the
corresponding sequence is has valid transitions. For now, we just want the individual configurations to be well-defined.
First, we have to enforce that at each step the machine is in some state.
Hence for all i, at least one of the state variables for step i has to be true.

(Qi0 Qi1 Qir ) i

Note that r denotes the number of states here. In this context, it is just a constant. (The input to our reduction is the string that
we need to transform into a boolean formula, not the Turing Machine description).
The machine also cant be in two states at once, so we need to enforce that constraint as well by saying that for every pair of
state variable for a give time step, one of the two has to be false.


(Q ij Q ij ) i, j j
Together these sets of clauses enforce that the machine corresponding to a satisfying truth assignment is in exactly one state
after each step.
For the position of the head, we have a similar set of clauses. The head has to be over some square on the tape, but it cant be
in two places.

(Hi0 H i1 H ip(n) ) i


(H ij H ij ) i, j j
Lastly, each square on the tape has to have exactly one symbol. Thus, for all steps i and square j, there has to be some symbol
on the square, but there cant be two.
(S ij0 S ij1 S ij|| ) i and j


(S ijk S ijk ) i, k k
So far, we have the clauses

The other clauses related to individual configurations come from the start and end states.
The machine must be in the initial configuration at step 0. This means that the initial state must be q0 .
Q 00
The head must also be over the first position on the tape.
H 01

The first part of the tape contain the input w. Letting k1 k 2 k |w| be the encoding for the input string, we
(S 01k 1 ) (S 02k 2 ) (S 0|w|k |w| )
The rest of the tape must be blank to start.
(S 0(|w|+1)0 ) (S 0(|w|+2)0 ) (S 0p(n)0 )
These additional clauses are summarized here

Transition Clauses - (Udacity, Youtube)


Any assignment that truth assignment that satisfies these clauses we have defined so far indicated how the tableau should be
filled out. Moreover, the tableau will always start out in the initial configuration and end up in an accepting one. What is not
guaranteed yet it that the transitions between the configurations are valid.
To see how we can add constraints that will give us this guarantee, well use this example.

Suppose that the transition functions tells us that if the machine is in state q3 and it reads a 1. Then it do one of two things:
it can stay in q0 , leave the 1 alone, and move the head to the right,
it can move to state q4 , write a 0 and move the head to the left.

To translate this rule into clauses for our formula, its useful to make a few definitions. First we need to enumerate the tape
alphabet so that we can refer to the symbols by number.

Next, we define a tuple (k, , k , , ) to be valid if the triple of the state q_{k }, the symbol s_{ } , and the direction delta is
in the set delta of state qk ,

For example, the tuple (3, 2, 0, 2, R) is valid.


The first two numbers here indicate which transition rule applies (use the enumeration of the alphabet to translate the symbols),
and the last three indicate what transition is being made. In this case, it is in the in the set defined by delta, so this is a valid
transition.
On the other hand, the transition (3, 0, 4, 2, R) is invalid. The machine cant switch to state 4, write a 1, and move the head to
the right. Thats not one of the valid transitions given that the machine is in state 3 and that it just read a blank.
Now, there are multiple ways to create the clauses needed to ensure that only valid transitions are followed. Many proofs of
Cooks theorem start by writing out boolean expressions that directly express the requirement one of the valid transitions be
followed. The diculty with this approach is that the intuitive expression isnt in conjunctive normal form and some boolean
algebra is need to convert it. That is to say we dont immediately get an intuitive set of short clauses to add to our formula. On
the other hand, if we rule out in invalid transitions instead, then we do get a set of short intuitive clauses that we can just add to
the formula.
To illustrate, in order to make sure that this invalid transition is never followed, for every step i and position j, we add a clause
that starts like this.


( Hij Qi3 S ij0
It starts with three literals that ensure the transition rules for being q3 and reading the blank symbol actually do apply. If the head
isnt in the position were talking about, the state isnt q_3 , or the symbol being read isnt the blank symbol, then the clause is
immediately satisfied. The clause can also be satisfied if the machine behaves in any way that is dierent from what this
particular invalid transition would cause to happen.


H(i+1)(j+1) Q(i+1)4 S (i+1)(j+1)2 )
The head could move dierently, the state could change dierently, or a dierent symbol could be written.
Another way to parse this clause is as saying if the ( q3 , ) transition rule applies, then the machine cant have changed in this
particular invalid way. Logically, remember that A implies not B is equivalent to not A or not B. Thats the logic we are using here.

Transition Clauses Cont - (Udacity, Youtube)

Now lets state the general rule for creating all the transition clauses. Recall that a tuple (k, , k , , ) is valid if switching to

state q_{k }, writing out symbol and moving in direction is an option given that the machine is currently in state qk and
reading symbol s_.

For every step i, position j, and invalid tuple then, we include in the formula this clause. The first part tests to whether the truth
assignment is such that the transition rule applies
and the next three ensure that this invalid transition wasnt followed. This is just the generalization of the example we saw
earlier.

Cook Levin Summary - (Udacity, Youtube)


Were almost done with the proof that satisfiability is NP-complete, but before making the final arguments I want to remind you
of what were trying to do.
Consider some language in NP, and suppose that someone wants to be able to determine whether strings are in this language.
The Cook-Levin theorem argues that because L is in NP, there is a nondeterministic Turing machine that decides it, and it uses
this fact to create a function computable in polynomial time that takes any input string x and outputs a boolean formula that is
satisfiable if and only if the string x is in L.

That way, any algorithm for deciding satisfiability in polynomial time would be able to decide every language in NP.
Only two questions remain.
Is the reduction correct?
Is it polynomial in the size of the input?

Lets consider correctness first. If x is in the language, then clearly the output formula f is satisfiable. We can just use the truth
assignment that corresponds to an accepting computation of the nondeterministic Turing machine that accepts x. That will
satisfy the formula f . That much is certainly true.
How about the other direction? Does the formula being satisfiable imply that x is in the language? Take some satisfying
assignment for f . As weve argued,
the corresponding tableau is well-defined. Only one of the state variables can be true at any time step, etc,
the tableau starts in the initial configuration,
every transition is valid,
the configuration sequence ends in an accepting configuration.

Thats all that is needed for a nondeterministic Turing machine to accept, so this direction is true as well.
Now, we have to argue that the reduction is polynomial.
First, I claim that the running time is polynomial in the output formula length. There just isnt much calculation to be done
besides iterating over the combination of steps, head positions, states, and tape symbols and printing out the associated terms
of the formula.
Second, I claim that the output formula is polynomial in the input length. Lets go back and count

These clauses pertaining to the states require order p(n) log n string length. The number of states of the machine is a constant
in this context. The p factor comes from the number of steps of the machine. The log n factor comes from the fact that we have
to distinguish the literals from one another. This requires log n bits. In all these calculations, thats where the log factor comes
from.
For the head position, we have O(p(n)3logn) string length one factor of p from the number of time steps and two from all pairs
of head positions.
2

There are O(p ) combinations of steps and squares, so this family of clauses requires O(p logn) length as well.

The other clauses related to individual configurations come from the start and end states. The initial configuration clause is a
mere O(p(n) log n)
The constraint that the computation be accepting only requires O(log n) string length.
The transition clauses might seem like they would require a high order polynomial of symbols, but remember that the size of the
nondeterministic Turing machine is a constant in this context. Therefore, the fact that we might have to write out clauses for all
pairs of states and tape symbols doesnt aect the asymptotic analysis. Only the range of the indices i and j depend on the size
2
of the input string, so we end up with O(p log n).

Adding all those together still leaves us with a polynomial string length. So, yes, the reduction is polynomial! And Cooks
theorem is proved.

Conclusion - (Udacity, Youtube)


Congratulations!! You have seen that one problem, satisfiability, captures all the complexity of P versus NP problem. An ecient
algorithm for Satisfiability would imply P = NP. If P is dierent than NP then there cant be any ecient algorithm for satisfiability.
Satisfiability has ecient algorithms if and only if P = NP.

Steve Cook, a professor at the University of Toronto, presented his theorem on May 4, 1971 at the Symposium on the Theory of
Computing at the then Stouers Inn in Shaker Heights, Ohio. But his paper didnt have an immediate impact as the Satisfiability
problem was mainly of interest to logicians. Luckily a Berkeley professor Richard Karp also took interest and realized he could
use Satisfiability as a starting point. If you can reduce Satisfiability to another NP-complete problem than that problem must be
NP-complete as well. Karp did exactly that and a year later he published his famous paper on Reducibility among
combinatorial problems showing that 21 well-known combinatorial search problems that were also NP-complete, including, for
example, the clique problem.
Today tens of thousands of problems have been shown to be NP-complete, though ones that come up most often in practice
tend to be closely related to one of Karps originals.
In the next lesson, well examine some of the reductions that prove these classic problems to be NP-complete, and try to
convey a general sense for how to go about finding such reductions. This sort of argument might come in handy if you need to
convince someone that a problem isnt as easy to solve as they might think.

NPC Problems - (Udacity)


Introduction - (Udacity, Youtube)
In the last lecture, we proved the Cook-Levin theorem which shows that CNF Satisfiability is NP-complete. We argued directly
that an arbitrary problem in NP can be turned into the satisfiability problem with a polynomial reduction. Of course, it is possible
to make such a direct argument for every NP-complete problem, but usually this isnt necessary. Once we have one NPcomplete problem, we can simply reduce it to other problems to show that they too are NP-complete.
In this lesson, well use this strategy to prove that a few more problems are NP-complete. The goal is to give you a sense of the
breadth of the class of NP-complete problems and to introduce you to some of the types are arguments used in these
reductions.

Basic Problems - (Udacity, Youtube)


Here is the state of our knowledge about NP-complete problems given just what we proved in the last lesson.

Weve shown that we can take an arbitrary problem in NP and reduce it to CNF satisfiabilitythat is what the Cook-Levin
theorem showed.
We also did another polynomial reduction, one of the Independent Set/Clique problem to the Vertex Cover problem. Again, Im
treating Independent Set and Clique here as one, because the problems are so similar. Much of this lesson will be concerned
with connecting up these two links into a chain.

First we are going to reduce general CNF-SAT to 3-SAT where each clause has exactly three literals. This is a critical reduction
because 3-SAT is much easier to reduce to other problems than general CNF. Then we are going to reduce 3-CNF to
Independent Set, and by transitivity this will show that vertex cover is NP-complete. Note that this is very convenient because it
would have been messy to try to reduce every problem in NP to these problems directly. Finally, we will reduce 3-CNF to the
Subset Sum problem, to give you a fuller sense of the types of arguments that go into reductions.

CNF 3CNF - (Udacity)


First we are going to reduce general CNF-SAT to 3-CNF-SAT. Before we begin however, I want you to help me articulate what
we are trying to do. Here is a question. Choose from the options at the bottom and enter the corresponding letter in here so that
this description matches the reduction we are trying to achieve.

Strategy and Warm up - (Udacity, Youtube)

Here is our overall strategy. Were going to take a CNF formula f and turn it into a 3CNF formula f , which will includes some

additional variables, which well label Y. This formula f will have the property that for any truth assignment t for the original

formula f , t will satisfy f if and only if there is a way tY of assigning values to Y so that t extended by t Y satisfies f .
Lets illustrate such a mapping for a simple example. Take this disjunction of 4 literals.

(z1 z 2 z 3 z4 )

Note that the zi are literals here, so they could be x 1 or x1 , etc. Remember too that disjunction means logical OR; the whole
clause is true if any one of the literals is.
We will map this clause of 4 into two clauses of 3 by introducing a new variable y and forming the two clauses

(z 1 z 2 y) (y z3 z4 ).

Lets confirm that the desired property holds. If the original clause is true, then we can set y to be z1 z 2 . If z 1 z2 is true, then
this first clause will be true by itself, and y can satisfy the other clause. Going the other direction, suppose that we have a truth
assignment that satisfies both of these clauses. Suppose first that y is true. That implies that z1 z2 is true and therefore, the
clause of four literals is. Next, suppose that y is false. Then z 3 or z4 must true, and again so must the clause of four. If you have
understood this, then you have understood the crux of why this reduction works. More details to come.

Transforming One Clause - (Udacity, Youtube)


We are going to start by looking at how to convert just one clause from CNF to 3-CNF. This is going to be the heart of the
argument, because we will be able to apply this procedure to each clause of the formula independently. Therefore, consider one
clause consisting of literals z 1 through z k . We are going to have four cases depending on the number of literals in the clause.
The case where k = 3 is trivial since nothing needs to be done.
When there are two literals in a clause, the situation is as simple trick suces. We just need to boost up the number of literals to
three. To do this, we include a single extra variable and trade in the clause

(z 1 z 2 )
for these two clauses,

(z 1 z2 y1 ) (z 1 z2 y1 )

one with the literal y1 and one with the literal y1 . If in a truth assignment z1 z 2 is satisfied, then clearly both of these clauses
have to be true. On the other hand, taking a satisfying assignment for the pair of clauses, the y-literal has to be false in one of
them, so z 1 z2 must be true.
We can play the same trick when there is just one literal z 1 in the original clause. This time we need to introduce two new
variables, which well call y 1 and y 2 . Then we replace the original clause with four. See the chart below.

Note that if z 1 is true, then all of collection of four clauses are true too. On the other hand, for any given truth assignment for
z1 , y1 , and y2 that satisfies these clauses, the y-literals will have both be false in one of the four clauses. Therefore, z 1 must be
true.
Lastly, we have the case where k > 3 . Here we introduce k 3 variables and use them to break up the clauses according to
the pattern shown above.
Lets illustrate the idea with this example

( z1 z2 z 3 z 4 z5 z6 ).
Looking at this we might wish that we could have a single variable that captured whether any one of these first 4 variables were
true. Lets call this variable y3 . Then we could express this clause in as the 3 literal clause

(y z 5 z6 ).
At first, I said we wanted y3 would to be equal to z1 z2 z3 z 4 , but its actually sucient, and considerably less trouble, for
y3 to just imply it. This idea is easy to express as a clause.

(z 1 z 2 z3 z4 y )
Either y is false, in which case this implication doesnt apply, or one of z 1 through z 4 better be true. Together, these two clauses
imply the first one. If y is true, then one of z1 through z4 is, and so is the original. And if y is false, then z4 or z5 is true and so is
the original.
Also, if there original is satisfied, then we can just set y3 equal to the disjunction of z 1 through z 4 and these two clauses will be
satisfied as well.
Note that we went from having a longest clause of length 6 to having one of length 5 and another of length 3. We can apply this
strategy again to the clause that is still too long. We introduce another variable y2 and trade this long clause in for a shorter
clause and another one with three literals. Of course, we can then play this trick again, and eventually, well have just 3-literal
clauses.

The example is for k = 6, but an inductive argument will show that the argument works for any k greater than 3.

Extending A Truth Assignment - (Udacity)


Here is an exercise intended to help you solidify your understanding of the relationship between the original clause and the set
of 3 clauses that the transformation produces. I want you to extend the truth assignment below so that it satisfies the given
3CNF formula. This one is the image of this clause here under the transformation that weve discussed.

CNF SAT - (Udacity, Youtube)


Recall that our original purpose was to find a transformation of the formula such that an truth assignment satisfied the original if
and only if it could be extended to the transformed formula.

Weve seen how to do this transformation for a single clause, and actually this is enough. We can just transform each clause
individually introducing a new set of variables for each; all of the same argument about extending or restricting the truth
assignment will hold.
Lets illustrate what a transformation of multi-clause formula looks like with an example. Consider this formula here,

(z11 z12 ) (z 21 z22 z23 z24 z 25 )


where Ive indexed the literals z with two indices now, the first referring to the clause its in and the second being its
enumeration within the clause.
The first clause only has two literals, so we transform it into these two clauses with 3 literals by introducing a new variable y11 . It
gets the first 1 because it was generated by the first clause.
We transform this second clause with 5 literals into these three clauses introducing the two new variables y 21 and y22 . Note that
these are dierent from the variables used in the clauses generated by the the first original clause. Since all of these sets of
variables are disjoint, we can assign them independently of each other and apply all the same arguments as we did to individual
clauses. The results are shown below.

Thats how CNF can be reduced to 3CNF. This transformation runs in polynomial time, making the reduction polynomial.
We just reduced the problem of finding a satisfying to a CNF formula to the problem of finding a satisfying assignment to a
3CNF formula. At this point, its natural to ask, can we go any further? Can we reduce this problem to a 2CNF? Well no, not
unless P=NP. There is a polynomial time algorithm for finding a satisfying a 2-CNF formula based on finding strongly connected
components. Therefore, If we could reduce 3-CNF to 2CNF, then P=NP.
So 3CNF is as simple as the satisfiability question gets, and it admits some very clean and easy-to-visualize reductions that
allow us to show that other problems are in NP. Well go over two of these: first, the reduction to independent set (or clique) and
then to Subset Sum.

3CNF INDSET - (Udacity, Youtube)

A the beginning of the lesson, we promised that we could link up these two chains and use the transitivity of reductions to show
that Vertex Cover and Independent Set were NP-complete. We now turn to the last link in this chain and will reduce 3CNF-SAT
to Independent Set. As weve already argued these problems are in NP, so that will complete the proofs.
Here then is the transformation, or reduction, we need to accomplish. The input is a 3CNF formula with m clauses, and we need
to output a graph that has an independent set of size m if and only if the input is satisfiable.
Well illustrate on this example.
\[(a \vee b \vee c) \wedge (\overline{b} \vee c \vee d) \wedge (a \vee \overline{b} \vee \overline{c}) For each literal in the formula,
we create a vertex in a graph. Then we add edges between all vertices that came from the same clause. We'll refer to these as
the within-clause edges or simply the clause edges. Then we add edges between all literals that are contradictory. We'll refer to
these as the between-clause edges or the contradiction edges.

The implementation of this transformation is simple enough to imagine, and I'll leave it to you to convince yourself that it can be done in polynomial
time. ###Which Edges Don't Belong - (Udacity) Here is a question to make sure you understand the reduction just given. Consider the formula below
and the associated graph. Indicate the edges would NOT have been output by the transformation just described.

###Proof that 3CNF INDSET - (Udacity, Youtube) Next, we are going to prove that the transformation just described does in fact reduce 3CNF
Satisfiability to Independent set. We'll start by arguing that if \\(f\\) is satisfiable then the graph \\(G\\) that is output has an independent set of size \\
(m,\\) the number of clauses in \\(f.\\) Let \]$t\[ be a satisfying assignment. In our example, let's take the one that makes \\(a\\) true, \\(b\\) false, \\(c\\)
false and \\(d\\) false, and well set the complements accordingly. Then we choose one literal from each clause in the formula that true under the truth
assignment to form our set S. Thus, in our example I might choose this the literals, as indicated by the circled T's in the figure below.

Clearly, the size of the set is \\(m,\\) the number of clauses. Because the vertices comes from distinct clauses there can't be any within-clause edges,
and because the truth assignment t doesn't contradict itself, there can't be an contradiction or between-clause edges either. Therefore, \\(S\\) must
be an independent set. Let's prove the other direction next. If graph \\(G\\) output by the proposed reduction has an independent set of size \\(m,\\)
then \\(f\\) is satisfiable. We start with an independent set of size \\(m\\) in the graph. Here, I have marked an independent set in our example graph.

The fact that there can be no within-clause edges in \\(S\\) implies that the literals in \\(S\\) must come from distinct clauses. The fact that there can
be no between-clause edges implies that the literals in \\(S\\) are non-contradictory. Therefore, any assignment consistent with the literals of \\(S\\)
will satisfy \\(f.\\) Here, the choice of literals implies that \\(a,\\) \\(b,\\) and \\(c\\) all be true, but \\(d\\) can be set to be true or false, and we still have a
satisfying assignment. So that completes the proof that Independent Set is as hard as 3CNF, and that completes the chain, \] \mbox{CNF } \leq_P
\mbox{ 3-CNF } \leq_P \mbox{ Independent Set } \leq_P \mbox{ Vertex Cover },$$

showing that both Independent Set and Vertex Cover are NP-complete.
Next, we are going to branch out both in the tree here and in the types of arguments well make by considering the subset sum
problem.

Subset Sum - (Udacity, Youtube)


Before reducing 3CNF to Subset Sum, we have to define the Subset Sum problem first. We are given a multiset of a 1 , , a m ,
where each element is a positive integer. A multiset, by the way, is just a set that that allows the same element to be included
more than once. Thus, a set might contain three 5s, two 20s and just one 31, for example. We are also given a number k. The
problem is to decide whether there is a subset of this multiset whose members add up to k.
One instance of this problem is partitioning, and Ill use this particular example to illustrate.

Imagine that you wanted to divide assets evenly between two partiesmaybe were picking teams on the playground, trying to
reach a divorce settlement, dividing spoils among the victors of war, or something of that nature. Then the question becomes,
Is there a way to partition a set so that things come out even, each side getting exactly 1/2?
In this case, the total is 18, so we could choose k to be equal to 9 and then ask if there is a way to get a subset to sum to 9.
Indeed, there is. We can choose the two 1s and the 7, and that will give us 9. Choosing 4 and 5 would work too, of course.
Thats the subset sum problem then. Note that the problem is in NP because given a subset, it takes polynomial time to add up
the elements and see if the sum equals k. Finding the right subset, however, seems much harder. We dont know of a way to do
it that is necessarily faster than just trying all 2m subsets.

A Subset Sum Algorithm - (Udacity)


We are going to show that Subset Sum is NP-complete, but here is a simple algorithm that solves it. W is a two-dimensional
array of booleans
and W[i][j] indicates whether there is a subset of the first i elements that sums to j. There are only two ways that this can be
true:
either there is a way to get j using only elements
or there is a way to get j

1 through i 1,
a i using the first i 1, and then we include a i to get j.

For now, I just want you to give the the tightest valid bound on the running time for a Random Access Machine.

3 CNF Subset Sum - (Udacity, Youtube)


Here is the reduction of 3-CNF SAT to Subset Sum. Well illustrate the reduction with an example, because writing out the
transformation in its full generality can get a little messy. Even with just an example, the motivation for the intermediate steps
might not be clear at the time. It should all become clear at the end, however.
Were going to create a table with columns for the variables and columns for the clauses. The rows of the table are going to be
numbers in our subset, and well have this column represent the ones place, this one the tens place, and so forth.

The collection of numbers will consist of two numbers for every variable in the formula: one that we include when t(x_i) is true
well notate those with y and another that we include when it is falsewell notate those with z.
In the end, we want to include either yi or z i for every i, since we have to assign the variable x i one way or the other. To get a
satisfying assignment, we also have to satisfy each clause, so well want the numbers yi and z i to reflect which clauses are
satisfied as well. The number y1 sets x1 to true, so we put a 1 in that column to indicate that choosing y1 means assigning x1 .
Since this assigns x 1 to be true, it also means satisfying clauses 1 and 3 in our example.

Choosing z1 also would assign the variable x1 a value, but it wouldnt satisfy any clauses. Therefore, that row is the same as for
y1 except in the clause columns where are all set to zero.

We do the analogous procedure for y2 and z2 : the literal x2 appears in clause 1, and the literal x2 appears in clauses 2 and 3.
We can do the same for variables x3 and x4 . These then are the numbers that we want to include in our set A.
It remains to choose our desired total k. For each of these variable columns the desired total is exactly 1. We assign each
variable one way or the other. For these clause columns, however, the total just has to be greater than 0. We just need one literal
in the clause to be true in order for the clause to be satisfied. Unfortunately, that doesnt yield a specific k that we need to sum
to.
The trick is to add more numbers to the table.

These all have zeros in the places corresponding to the variables, and exactly one 1 in a column for a clause. Each clause j gets
two numbers that have a 1 in the j th column. Well call them gj and hj .
This allows us to see the desired number to be 3 in the clause columns. Given a satisfying assignment, the corresponding
choice of y and z numbers will have at least a 1 in all the clause columns but no more than 3. All the 1s and 2s can be boosted
up by including the g and h numbers. Note that if some clause is unsatisfied, then include the g and h numbers isnt enough to
bring the total to 3 because there are only two of them.
Thats the construction. For each variable the set of number to choose from includes two numbers y and z which will
correspond to the truth setting, and for each clause it includes g and h so that we can boost up the total in the clause column to
three where needed.
This construction just involves a few for-loops, so its easy to see that the construction of the set of numbers is polynomial time.

Proof that 3CNF SUBSET SUM - (Udacity, Youtube)


Next, we prove that this reduction works. Let f be a 3CNF formula with n variables and m clauses. Let A be the multiset over
the positive integers and k be the total number output by the transformation.

First, we show that if f is satisfiable then there is a subset of A summing to k. Let t be the satisfying assignment. Then we
include yi in S if xi is true under t, and well include zi otherwise. As for the g and h families of numbers, if there are fewer than
three literals of clause j satisfied under t, then include gj . If there are a fewer than 2, then include hj as well. In total, the sum of
these numbers must be equal to k.
In the other direction, we argue that if there is a subset of A summing to k, then there is a satisfying assignment. Suppose that S
is a subset of A summing to k. Then the impossibility of any carrying of the digits in any sum implies that exactly 1 of yi or zi
must have been set to true. Therefore, we can define a truth assignment t where t(xi ) is true if yi is included in S and false
otherwise. This must satisfy every clause; otherwise, there would be no way that the total in the clause places would be 3.
Altogether, weve seen that Subset Sum in NP and we can reduce 3-CNF SAT an NP-complete problem to it, so Subset Sum
must be NP-complete.

Other Bases - (Udacity)


Here is an exercise to encourage you to consider carefully the reduction just given. We interpreted the rows of the table we
constructed as the base 10 representation of the collection of numbers that the reduction output. What other bases would have
worked as well? Check all that apply.

Conclusion - (Udacity, Youtube)


So far, weve built up a healthy collection of NP-complete complete problems, but given that there are thousands of problems
know to be NP-complete, weve only scratched the surface of what is known. In fact, we havent even come close to Karps
mark of 21 problems from his 1972 paper.
If you want to go on and extend the set of problems that you can prove to be NP-complete, you might consider reducing subset
sum to the Knapsack problem, where one has a fixed capacity for carrying stu and want to pack the largest subset of items
that will fit.
Another classic problem is that of the Traveling Salesman. He has a list of cities that he wants to visit and, he wants to know the
order that will minimize the distance that he has to travel. One can prove that this problem is NP-complete by first reducing
Vertex Cover to the Hamiltonian Cycle problemwhich asks if there is a cycle in a graph that visits each vertex exactly once
and then Hamiltonian Cycle to Traveling Salesman.
Another classic problem that is used to prove other problems are NP-complete is 3D-matching. 2D matching can be thought of
as the problem making as many compatible couples as possible from a set of people. 3D matching extends this problem by
further matching them with a home that they would enjoy living in together.

Of course, there are many others. The point of this lesson, however, is not so that you can produce the needed chain of
reductions for every problem known to be NP-complete. Rather, it is to give you a sense for what these arguments look like and
how you might go about making such an argument for a problem that is of particular interest to you.

The reductions weve given as examples are a fine start, but if


you want to go further in understanding how to use
complexity to understand real-world problems, take a look at
the classic text by Garey and Johnson on Computers and
Intractability. Even though it is from the same decade as the
original Cook-Levin result and Karps list of 21 NP-complete
problems, it is still one of the best. As Lance, says a good
computer scientist shouldnt leave home without it.
Dynamic Programming - (Udacity)
Intro to Algorithms - (Udacity, Youtube)
In this and the remaining lessons for the course, we turn our attention to the study of polynomial algorithms. First, however, a
little perspective.

In the beginning of the course, we defined the general notion of a language. Then we began to distinguish between decidable
languages and undecidable ones. Remember that there were uncountably many undecidable ones but only countable decidable
ones. In comparison, the decidable ones should be infinitesimally small, but well give the decidable ones this big circle anyway
because they are so interesting to us.
In the section of the course on Complexity, we considered several subclasses of the decidable languages:
- P which consisted of those languages decidable in polynomial time time,
- NP, which consisted of the languages that could be verified in polynomial time (represented with the purple ellipse here, and
that includes P.),
- and then we distinguished certain problems in NP which were the hardest and we called these NP-complete. We visualize
these in this green band at the outside of NP, since if any one of these were in P, then P would expand and swallow all of NP (or
equivalently, NP would collapse to P).
In this section, we are going to focus on the class P, the set of polynomially decidable languages. The overall tone here will feel
a little more optimistic. In the previous sections of the course, many of our results were of a negative nature: No, you cant
decide that language, or No, that problem isnt tractable unless P=NP. Here, the results will be entirely positive. No longer will
we be excluding problems from a good class; we will just be showing that problems are in the good class P: Yes, you can do
that in polynomial time, or its even a low order polynomial and we can solve huge instances in practice.
It certainly would be sad if this class P didnt contain anything interesting. One rather boring language in P, for instance, is the
language of lists of numbers no greater than 5. Thankfully, however, P contains some much more interesting languages, and we
use the associated algorithms everyday to do things like sort email, find a route to a new restaurant, or apply filters to enhance
the photographs we take. The study of these algorithms is very rich, often elegant, and ever changing, as computer scientists all
over the world are constantly finding new strategies and expanding the set of languages that we know to be in P. There are nonpolynomial algorithms too, but since computer scientists are most concerned with practical solutions that scale, they tend to
focus on polynomial ones, as will we.
In contrast to the study of computability and complexity, the study of algorithms is not confined to a few unifying themes and
does not admit a clear overall structure. Of course, there is some structure in the sense that many various real-world
applications can be viewed as variations on the same abstract problem of the type we will consider here. And, there are general
patterns of thinking, or design techniques, that are often useful. We will discuss a few of these. And sometimes even abstract
problems are intimately related, when on the surface the may not seem similar. We will see this happen too. Yes, despite these
connections, in large part, problems tend to demand that they be solved by their own algorithm that involves a particular
ingenuity, at least if they are to yield up the best running time.
If the topic is so varied, how then can a student understand algorithms in any general way, or become better at finding
algorithms for his own problems? There are two good, complementary ways.
First, and most important, is practice. The more problems that you try to solve eciently, the better you will become at it and the
more perspective you will gain.
Second, is to study the classic and most elegant algorithms in detail. Here, we will do this for a few problems often not covered in
undergraduate courses.

Even better is if you can combine the two approaches. Dont just follow along with pen and paper. Pause and see if you can
anticipate the next step and work ahead both on the algorithm and the analysis. Keep that advice in mind throughout our
discussion of algorithms.

Introduction - (Udacity, Youtube)


Our first lesson is not on one algorithm in particular but on a design technique called dynamic programming. This belongs in the
same mental category as divide and conquer: its not a algorithm in itself because it doesnt provide fixed steps to follow; rather,
its a general strategy for thinking about problems that often produces an ecient algorithm.
The term dynamic programming, by the way, is not very descriptive. The word programming is used in an older sense of the
word that refers to any tabular method, and the word dynamic was used just to make the technique sound exciting.
With those apologies for the name, lets dive into the content.

Optimal Similar Substructure - (Udacity, Youtube)


The absolutely indispensable element of dynamic programming is what well call an optimal similar substructure (often simply
called optimal substructure). By this, I mean that we have some hard problem that we want to solve, and we think to ourselves,
oh, if only had the answer to these two similar, smaller subproblems this would be easy. Because the subproblems are similar
to the original, we can keep playing this game, letting the problems get smaller and smaller until weve reached a trivial case.

Well at first, this feels like an ideal case for recursion. Since the subproblems are similar, perhaps we can use the same code
and just change the parameters. Starting from the hard problem on the right, we could recurse back to the left. The diculty is
that many of the recursive paths visit the same nodes, meaning that such a strategy would involve wasteful recomputaion.
This is sometimes called one of the perils of recursion, and its often illustrated to beginning programmers with the example of
computing the Fibonacci sequence. Each number in the Fibonacci sequence is the sum of the previous two with the first two
numbers both being 1 to get us started. That is,
F 0 = 1, F 1 = 1 and F n = F n1 + F n2
This hard problem of computing the nth number in the sequence depends on solving the slightly easier problems of computing
the (n 1) th and the (n 2) th elements in the sequence.
Computing the (n 1) th element depends on knowing the answer for n 2 and n 3 . Computing the (n 2) th depends on
know the answer for n 3 and n 4 and so forth.

Thinking about how the recursion will operate, notice that well need to compute Fn2 once for Fn and once for Fn1 , so theres
going to be some repeated computation here, and its going to get worse the further to the left we go.
How bad does it get? The top level problem of computing the n th number will only be called once and he will find the problem
of computing the (n 1) th number once.
Computing n 2 needs to happen once for each time that the two computations that depend on it are called: once for n 1
and once for n.
Similarly, computing n 3 needs to happen once for every time that problems that depend on it are called, so it gets called two
plus one for a total of three time.
Notice that each number here is the sum of the two numbers to the right, so this is the Fibonacci sequence, and the number of
times that the (n kk) th number is computed will be equal to the k th Fibonacci number, which is roughly the golden ratio raised
to the k th power,

. The recursive strategy is exponential!

There are two ways to cope with this problem of repeated computation.
One is to memoize the answers to the subproblems. After we solve it the first time, we write ourselves a memo with the answer,
and before we actually do the work of solving a subproblem we always check our wall of memos to see if we have the answer
already.
Alternatively, we can solve the subproblems in the right order, so that anytime we want to solve one of the problems we are sure
that we have solved its subproblems already.

The latter can always be done, because the dependency relationships among the subproblems must form a directed acyclic
graph. If there were a cycle, then we would have circular dependencies and recursion wouldnt work either. We just find an
appropriate ordering of the subproblems, so that all the edges go from left to right and then we solve the subproblems in the left
to right order. This is the approach well take for this lesson. It tends to expose more of the underlying nature of the problem and
to create faster implementations than using recursion and memoizing.

Edit Distance Problem - (Udacity, Youtube)


The first problem well consider is the problem of sequence alignment or edit distance. This often comes up in genetics or as
part of spell-checkers.
For example, suppose that we are given two genetic sequences and we want to line them up as best as possible, minimizing
the needed insertions, deletions and changes to the characters that would be needed to switch one to the other; and well do
this according to some cost function that represents how likely these changes are to have occurred in nature.

In general, we say that we are given two sequences X and Y over some alphabet and we want to find an alignment between
themthat is a subset of the pairs between the sequences so that each each letter appears only in one pair and the pairs dont
cross, like in the example above.
For completeness, we give the formal definition of an alignment, though for our purposes the intuition given above is sucient.
An alignment is a subset A of

{1, , n} {1, , m} such that for any (i, j), (i , j ) A,

ii

j j,

and i < i j < j .

The cost of an alignment comes from two sources:


one is the number of unmatched characters, which we can write as n

+ m 2|A| . This corresponds to the number of insertions or

deletionsor matching with the dash character in our figures.


The other part of the cost comes from matching two of the characters.
Typically, this is zero when the two characters are the same.

In total

c(A) = n + m 2|A| +

(i,j)A

(x i , yj ).

The problem then is to find an alignment of the sequences that minimizes this cost.

Sequence Alignment Cost - (Udacity)


To make sure we understand this cost function, I want you to suppose that the function alpha is zero if the two characters are
the same and 1 otherwise, and to calculate the cost of the alignment given below.

Prefix Substructure - (Udacity, Youtube)


The key optimal substructure for the sequence alignment problem comes from aligning prefixes of the sequence that we want to
align. We define c(i, j) to be the minimum cost of aligning the i characters of X with the first j characters of Y. Since X has m
characters and Y has m, our overall goal is to compute c(m, n) and the alignment that achieves it.
Lets consider problem of calculating c(m, n). There are three cases consider.

First, suppose that we match the last two characters of the sequences together. Then the cost would be the minimum cost of
aligning the prefix from m 1 of X and the prefix through n 1 of Y, plus the cost associated with matching the last two
characters together.
Another possibility is that we leave the last character of the X sequence unmatched. Then the cost would be the minimum cost
of aligning the prefix from m 1 of X and the prefix through n of Y, plus 1 for leaving Xm unmatched.
And the last case where we leave Yn unmatched instead is analogous.
Of course, since c(m, n) is defined to be the minimum cost aligning the sequences, it must be the minimum of these three
possibilities. Notice, however, that there was nothing special about the fact that we were using n and m here, and the same
argument applies to all combination of i,j. The problems are similar. Thus, in general,

c(i, j) = min{c(i 1, j 1) + (X i , Yj ), c(i 1, j) + 1, c(i, j + 1) + 1}.


Including the base cases where the cost of aligning a sequence with the empty sequence is just the length of the sequence,

c(0, j) = j and c(i, 0) = i


we have a recursive formulationthe optimal solution expressed in terms of optimal solutions to similar subproblems.

Sequence Alignment Algorithm - (Udacity, Youtube)


As you might imagine, a straightforward recursive implementation would involve an unfortunate amount of recomputation, so
well look for a better solution. Notice that it is natural to organize our subproblems in a grid like this one.

Knowing C(3,3) the cost of aligning the full sequence depends on knowing C(3,2) C(2,2) and C(2,3).
Knowing C(2,3) depends on knowing C(2,2), C(1,2), and C(1,3), and indeed in general, to figure out any cost, we need to know
the cost north, west and northwest in this grid.
These dependencies form a directed acyclic graph, and we can linearize them in a number of ways. We might in scanline order,
left to right, or even along the diagonals. Well keep it simple and do a scanline order. First, We need to fill in the answers for the
base cases, and then, its just a matter of passing over the grid and taking the minimum of the three possibilities outlined earlier.
Once weve finished, we can retrace our steps by looking at west, north, and northwest neighbors and figuring out what step we
took, and that will allow us to reconstruct the alignment.

Sequence Alignment Exercise - (Udacity)


Lets practice computing these costs and extracting the alignment with an exercise. Well align snowy with sunny and well use
an alpha value with a penalty of 1.5 for flipping an character. This makes flipping a character more expensive than an insertion
or deletion.
Compute the minimum cost and write out a minimum cost alignment here, using the single dash to indicate where a character
isnt matched.

Sequence Alignment Summary - (Udacity, Youtube)


To sum up the application of dynamic programming to the sequence alignment problem, recall that we gave an algorithm, which
given two sequences of lengths m and n found a minimum cost alignment in O(mn) time. We did a constant amount of work for
each element in the grid.

Dynamic programming always relies on the existence of some optimal similar substructure. In this case, it was the minimum of
cost aligning prefixes of the sequences that we wanted to align. The key recurrence being that the cost of aligning the first i
characters of one sequence with the first j other other was the minimum of the costs of matching the last characters, leaving the
last character of the first sequence unmatched, or the cost of leaving the last character of the second sequence unmatched.

Chain Matrix Multiplication - (Udacity, Youtube)


Now we turn to another problem that well be able to tackle with the dynamic programming paradigm, Chain Matrix
Multiplication. We are given a sequence of matrices of sizes m 0 m 1 , m 1 m 2 , etc, and we want to compute their product
eciently. Note, how dimensions here are arranged so that such a product is always defined. The number of columns in one
matrix always matches the number of rows in the next.
First, recall that the product of an m n matrix with an n p matrix produces an m p matrix and that the cost of computing
each entry is n. Each entry can be thought of as the vector dot product between the corresponding row of the first matrix and
the column of the second. Both vectors have n entries. In total, we have about mnp additions and as many multiplications.

Next, recall that matrix multiplication is associative. Thus, as far as the answer goes it doesnt matter whether we multiply A and
B first and then multiply that product by C, or if we multiply B and C first and then multiply that product by A.

(AB)C = A(CB)
The product will be the same, but the number of operations may not be.
Lets see this with an example. Consider the product of an 50 20 matrix A with a 20 50 matrix B with a 50 1 matrix C.,
and lets count the number of multiplications using both strategies.

If we multiply A and B first, then we spend 50 20 50 on computing AB. This matrix will have 50 rows and 50 columns, so its
product with C will take 50 50 1 multiplications for a total of

50 20 50 + 50 50 1 = 52, 500.
On the other hand, if we multiply B and C first, this costs 20 50 1. This produces a 20 1 and multiplying it by A costs
50 20 1 for total of only

20 50 1 + 50 20 1 = 2000
a big improvement. So, its important to be clever about the order we choose for multiplying these matrices.

Subchain Substructure - (Udacity, Youtube)


In dynamic programming, we always look for optimal substructure and try to find a recursive solution first.
We can gain some insight into the substructure by writing out the expression trees for various ways of placing parentheses.
Consider these examples.

This suggests that we try all possible binary trees and pick the one that has the smallest computational cost.
Starting from the top level, we would consider all n 1 ways of partitioning into left and right subtrees, and for each one of
these possible partitions, we would figure out the cost of computing the subtrees and then multiplying them together.

To figure out the cost of the subtrees themselves, we would need to consider all possible partitions into left and right subtree
and so forth.
More precisely, let C(i, j) be the minimum cost of multiplying Ai through A j . For each way of partitioning this range, the cost is
the minimum cost of computing each subtree, plus the cost of combining them, and we just want to take the minimum over all
such partitions.

C(i, j) = min C(i, k) + C(k + 1, j) + mi1 mk mj


k:ik<j

The base case is the product of just one matrix, which doesnt cost anything.

C(i, i) = 0.

CMM Algorithm - (Udacity, Youtube)


Now that we have this recursive formulation, we are ready to develop the algorithm. Lets convince ourselves first that the
recursive formulation would indeed involved some needless recomputation.
Take the two top-level partitions of ABCDE , on the one hand (ABC) and (DE) and on the other (ABCD) and E .

Clearly, we will have to compute the minimum cost of multiply (ABC) in the left problem. But we are going to have to compute it
on the right as well, since we need to consider pulling D o from the (ABCD)chain as well. We end up re-doing all of the work
involved in figuring out how to best multiply ABC over again. As we go deeper in the tree, things get recomputed more and
more times.
Fortunately for us, there are only n choose two subproblems, so we can create a table and do the subproblems in the right
order.

The entries along the diagonal are base cases which have cost zero. A product of one matrix doesnt case anything. Our goal is
to compute the value in the upper right corner C(1, n).
Consider which subproblems this depends on. When the split reprsented by k is k = 1 , we are considering these costs C(1, 1)
and C(2, n). When k = 2, we consider the problems C(1, 2) and C(3, n) . In general, every entry depends on all the elements
the left and down in the table.
There are many ways to linearize this directed acyclic graph of dependence, but the most elegant is to fill in the diagonals going
towards the upper right corner. In the code, we have let s indicate which diagonal we are working on. The last step, of course, is
just to return this final cost.
The binary tree that produced this cost can be reconstructed from the k that yielded these minimum values. We just need to
remember the split we used for each entry in the table.

CMM Exercise - (Udacity)


Lets do an exercise to practice this chain matrix multiplication algorithm. Consider matrices ABCD with dimensions given
below. Use the dynamic programming algorithm to fill out the table to the right, and then place the parentheses in the proper
place below to minimize the number of scalar multiplications. Dont use more parentheses than necessary in your answer.

CMM Summary - (Udacity, Youtube)


Lets summarize what weve learned about the Chain Matrix Multiplication problem. There is algorithm which given the
dimension of the n matrices finds an expression tree that computes the product with the minimum number of multiplications in
O(n3 ) time. Recall that there were order n2 entries in the table and we spent order n time figuring out what value to write in each
one.

The optimal similar substructure that we exploited was the minimum cost of evaluating the subchains the key recurrence saying
that the cost of each partition is the cost of evaluting each part plus the cost of combining them, and of course, we want to take
the minimum cost over all such partitions.

All Pairs Shortest Path - (Udacity, Youtube)


Now, we turn to out last example of dynamic programming and consider the problem of finding shortest paths between all pairs
of vertices in a graph.
If you are rusty on ideas like paths in graphs and Dijkstras algorithm, it might help to review these ideas before proceeding with
this part of the lesson.
Note that the idea of shortest is in terms weights or costs on the edges. These might be negative, so we dont refer to them as
lengths or distances themselves to avoid confusion, but we retain the word shortest with regard to the path instead of saying
lightest or cheapest. Unfortunately, the use of this mixed metaphor is standard.
Here is the formal statement of the problem. Given a graph, and a function over the edges assigning them weights, we want to
find the shortest (i.e. the minimum weight) path between every pair of vertices.
Recall that for the single source problem, where we want to figure out the shortest path from one vertex to all others, we can
use Dijkstras Algorithm, which takes O(VlogV + E) time, when used with a Fibonacci queue. But this algorithm requires that
the weights all be non-negative, an assumption that we dont want to make.

For graphs with negative weights, the standard single source solution is the Bellman-Ford algorithm, which takes time O(VE) .
Now we can run these algorithms multiple times, once with each vertex as the source. If we visualize the problem of finding the
shortest path between all pairs as filling out this table. Each run of Dijkstra or Bellman-Ford would fill out one row of the table.
We are running the algorithm V times, so we just add a factor of V to the running times.

For the case where we have negative weights on the edges and the graph is dense, we are looking at a time that is V to the
fourth power here. Using dynamic programming, were going to be able to do better.

Shortest Path Substructure - (Udacity, Youtube)


Since we are using dynamic programming, the first thing we look for is some optimal similar substructure. Recall that the key
realization in understanding the single source shortest path algorithms like Dijkstra and Bellman-Ford was that subpaths of
shortest paths are shortest paths. So if this shortest path between u and v happens to go through these two vertices x and y,
then this subpath between x and y must be a shortest path. If there were a shorter one, then it could replace the subpath in the
path from u to v to produce a shorter u-v path. This type of argument is sometimes called cut and paste for obvious reasons.

By the way, throughout Ill use the squiggly lines to indicate a path between two vertices and a straight line to indicate a single
edge.
Unfortunately, by itself this substructure is not enough.
Sure, we might argue that a shortest path from u to v take a shortest path from u to a neighbor of v first, but how do we find
those shortest paths. The subproblems end up having circular dependencies.
One idea is include the notion of path length defined by the number of edges used. If we knew all shortest paths that only used
k 1 edges, then by examining the neighbors of v, we could figure out a shortest paths with k edges to v. Well let (u, v, k) be
the weight of the shortest path that uses at most k edges. Then the recurrence is that (u, v, k) is the minimum of

(u, v, k 1) this is where we dont use that potential k th edge


the minimum of the distances from u to the neighbors of v using only k 1 edges, plus the weight of that last step.

3
This strategy works, and it yields the matrix multiplication shortest paths algorithm that runs in time O(V logV). See CLRS for
details.

We are going to take a dierent approach that will yield a slightly faster algorithm and allow us to remove that log factor.
Were going to recurse on the set of potential intermediate vertices used in the shortest paths.
Without loss of generality, well assume that the vertices are 1 through n for convenience of notation, i.e. V = {1, , n}.
Consider the last step of the algorithm, where we have allowed vertices 1 through n 1 to be intermediate vertices and just
now, we are considering the eect of allowing n to be an intermediate vertex. Clearly, our choices are either using the old path,
or taking the shortest path from u to n and from n to v. In fact, this is the choice not just for n, but for any k. To get from u to v
using only intermediate vertices, 1 through k, we can either use k, or not.

Therefore, we define (u, v, k) to be the minimum weight of a path from u to v using only 1 through k as intermediate vertices.
Then the recurrence becomes

(u, v, k) = min{(u, v, k 1), (u, k, k 1) + (k, v, k 1)}


In the base case, where no intermediate vertices are allowed the weights and the edges provide all the needed information.

w(u, v)

(u, v, 0) =

if (u, v) E
otherwise

The Floyd-Warshall Algorithm - (Udacity, Youtube)


With this recurrence defined, we are now ready for the Floyd-Warshall Algorithm. Note that if we were to simply implement this
with recursion, we would do a lot of recomputation. (u, k, k 1) would be computed for every vertex v and (k, v, k 1)
would be computed for every vertex u.

As you might imagine, we are going to fill out a table. The subproblems have three indices, but thankfully only two will be
required for us. Well create a two dimensional table d indexed by the source and destination vertices of the path. We initialize it
for k = 0, where no intermediate vertices are allowed.

Then we allow the vertices one by one. For each source, destination pair, we update the weight of the shortest path accounting
for the possibility of using k as an intermediate vertex. Note, that when i or j is equal to k the weight wont change since a
vertex cant be useful in a shortest path to itself. Hence we dont need to worry about using an outdated value in the loop.
To extract not just the weights of the shortest paths but also the paths themselves, we keep a predecessor table that contains
the second to last vertex on a path from u to v.
Initially, when all paths are single edges, this is just the other vertex on the incident edge. In the update phase, we either leave
the predecessor alone if we arent change the path, or we switch it to the precessor of the k to j path if the path via k is
preferable.

Floyd-Warshall Algorithm Exercise - (Udacity)


To check our understanding the Floyd-Warshall algorithm, lets do an iteration as an exercise. Suppose that the table d is in the
state below after the k = 3 iteration, which vertex 3 was allowed to be an intermediate vertex along a path, and I want you to fill
in the table on the right with the values of d after the final iteration where k = 4.

All Pairs Summary - (Udacity, Youtube)


Now, lets summarize what weve shown about the All-Pairs shortest path problem. The Floyd-Warshall Algorithm, which is
3
based on dynamic programming, finds the shortest path for all pairs of vertices in a weighted graph in time O(V ). Recall that
2
V times we have to fill out the whole table of size V .
The key optimal similar substructure came from considering optimal paths via a restricted set of vertices 1...k
That gave us the recurrence, where the shortest path between two vertices either used the new intermediate vertex, or it didnt.

Transitive Closure - (Udacity, Youtube)


The Floyd-Warshall Algorithm has a neat connection to finding the transitive closure of mathematical relations. Consider a
relation R over a set A. That is to say, R is a subset of A A. For example, A might represent sports that one can watch on
television and the relation might be someones viewing preferences.
Maybe we know some individual prefers the NBA to college basketball and to college football and that he prefers college
football to pro football. Since a relation is just a collection of ordered pairs, it makes sense to represent them as a directed
graph.

Given these preferences, we would like to be able to infer that this individual prefers the NBA over the NFL as well.
In eect, if there is a path from one vertex to another, we would like to add a direct edge between them. In set theory, this is
called the transitive closure of a relation. Given what we know already, there is a fairly simple solution: just give each edge
weight 1 and run Floyd-Warshall! The distance will be infinity if there is not path and it will minimum number of edges traversed
otherwise.
This is more information that we really need, however. We really just want to know whether there is a path, not how long it is.
Hence, in this context we use a slight variant where the entries in the table are all booleans 0 or 1, instead of integers, but
otherwise, the algorithm is essentially the same.
We initialize the table so that entry i, j is 1 if (a i , a j ) is in the relation and zero otherwise. Note that Im letting a 1 through a n be
the set A here. Then we keep on adding potential intermediary elements and updating the table accordingly. We have that i and
j are in the relation if either they are in the relation already or they are linked together by a k .

Often, we are interested not in the transitive closure of a relation but in the reflexive transitive closure. In this case, we just set
the diagonal elements of the relation to be true from the beginning.

Conclusion - (Udacity, Youtube)


There are many other applications of dynamic programming that we havent touched on, but the three we covered provide a
good sample, and help illustrate the general strategy. Perhaps, the most important lesson is that any time that you have found a
recursive solution to a problem, think carefully about what the subproblems are and whether there overlap that you can exploit.
Note that this wont always work. Dynamic programming will not create yield a polynomial algorithm for every problem with a
recursive formulation. Take satisfiability for example. Pick a variable and set it to true and eliminate the variable from the
formulation. We are left with another boolean formula, and if its satisfiable then so was the original. The same is true if we set
the variable to false, and original formula will have a satisfying assignment if either of the other two do. This is a perfectly
legitimate recursive formulation, yet there isnt enough overlap between the problem to create a polynomial algorithm at least,
not unless P is equal to NP.
In the next lesson, well examine the Fast Fourier Transform. This is much more a divide and conquer problem than it is a
dynamic programming one, though many of the same themes of identifying subproblems and re-using calculation we appear as
we study it. Keep this in mind as we study it.

Fast Fourier Transform - (Udacity)


Introduction - (Udacity, Youtube)
In this lesson, we will examine the Fast Fourier Transform and apply it in order to obtain an ecient algorithm for convolving two
sequences of numbers.
If you have seen the Fourier Transform before in the context of mathematics, physics, or engineering, this lesson may have a
dierent flavor from what you are used to. We wont be using it to solve dierential equations or to characterize the behavior of
an electrical circuit. Instead, we will be focused on the much more mundane tasks of multiplying polynomials and doing it
quickly.
This algorithmic aspect of the Fourier transform is actually almost as old as the Fourier Transform itself, appearing in an early
19th century paper by Gauss on interpolation. The transform itself, by the way, gets it name from Jean Baptiste Fouriers 1807
paper on the propagation of heat through solids. Gausss trick seems to have been largely forgotten until Cooley and Tukey
published a paper on the Fast Fourier transform in 1965. Tukey was apparently somewhat reluctant to publish the paper,
because he thought it was a simple observation and the how-to questions of algorithms were still considered second-class at
the time. Well, much has changed since then. Their paper is now one of the most cited in scientific literature, and the idea is
considered one of the most elegant in algorithm design.

Prerequisites - (Udacity, Youtube)


Before beginning this lesson, it might be worth brushing up on a few concepts so that you dont have to go back and review
them later. We will be using complex numbersand in particular, their polar representation. Just a familiarity with the basics is
required here. We will also be using a little linear algebra, the ideas of matrix inverse and orthogonal matrices being the most
important. And since we will be using a divide and conquer strategy, it might be a good idea to review the Master Theorem
briefly.

Convolution - (Udacity, Youtube)


The Fast Fourier Transform is an instance of the Discrete Fourier Transform which as weve said, has its own significance in
various branches of mathematics, physics, and engineeringsignal processing most especially. In the study of algorithms,
however, the Fast Fourier Transform is most interesting for its role in a very practical and very fast way of convolving two
sequences of numbers.
Well illustrate convolution by an example. We are given two sequences of numbers a and b as shown below, and we want to
obtain a new sequence defined by the formula

ck =

ai bki

i=0

We can visualize convolution by reversing b and lining it up with a so that zeroth element element of b is under the k th element
of a. This in the alignment for k = 0.

Then, we multiply all elements that overlap and add up all these products. For k = 0, this is just 2 1 = 2. Therefore, c0 = 2.
For k = 1, we slide the reversed b sequence one unit to the right and perform the analogous calculation c1 = 0 1 + 2 0 = 0.
We continue slide b along and doing these sums until there is no more overlap left.

Convolution has many applications, but the one that will be most convenient for us to talk about is multiplying polynomials.
Given the coecients of two polynomials, we can find the coecients of the product just by convolving the two sequences of
coecients.
In fact, we can easily repeat the example we just did but in the context of polynomial multiplication. Once the sequence b is
reversed multiplying corresponding elements gives all the terms with a given power on the exponent of the variable x . For
example, this alignment calculates all the x 2 terms and yields the coecient c2 .

How long does this process take? Well for each element in the longer sequence, we had to do as many multiplications and
additions as there are elements in the shorter sequence. Sometimes, it was a little shorter around the edges, but on average it
was at least half this length. Therefore, we can say that convolving two sequences via the naive strategy outlined here takes
(nm) operations, where n and m are the lengths of the two sequences. The Fast Fourier Transform will give us a way to
improve upon this.

Representations of Polynomials - (Udacity, Youtube)


2

So far, weve assume that polynomials are represented by their coecients. For example, A(x) = 2 + x + x . If you have
worked with polynomial interpolation or fitting before, however, you will know that an order n polynomial is uniquely
characterized by its values at any n points. (The order of a polynomial by the way, the is number of coecients used to define it
or the degree plus one.) Hence, we might just as well represent a polynomial by its values at a sequence of inputs as by its
coecients. For example, that same polynomial could be represented by saying that A(1) = 2, and A(0) = 2 and

A(1) = 0.

Going from the coecient representation can be thought of as matrix multiplication. To calculate A at some value, I take the dot
product of the corresponding row of the matrix consisting of the powers of x of the argument, with a column vector consisting
of the coecients of A.

This matrix where the rows are geometric progressions of values xi a is important enough that it gets its own name and is called
a Vandermonde matrix. Its determinant is the product of the dierences of all of the values for x.

det(V) = ( i j )

det(V) = ( xi xj )
1i<jn

As long as these are distinct, the matrix is invertible and we can recover the coecients given the values!

Multiplying Polynomials - (Udacity, Youtube)


Now that weve seen an alternative way of representing polynomials, lets turn back to the problem of multiplying them.
Multiplying via the convolution equation takes (nm) time as weve seen.
If the polynomials are represented as values, however, we can just multiply the corresponding values to obtain the values at the
product.

Note that I did have to start with a number of points that was equal to the order of the product here.
The fact that multiplying in the value representation is so much faster suggests that it might make sense to convert to the value
representation, do the multiplication there, and then interpolate back to the coecient representation.
Well visualize the process like this. First we convert from the coecient representation to the value representation.
Then we multiply the corresponding values to get the values of C. Then we interpolate to get back the coecients of the
product via interpolation.

Multiplication Exercise - (Udacity)


Lets multiply two polynomials together with this process. Start with the coecients over here. Write their values here. Multiply
them together. Ill do the interpolation, since that computation is a bit tedious.

Multiplying Polynomials Continued - (Udacity, Youtube)


As you might have intuited from the exercise some more cleverness will be needed to make this process ecient. Even
evaluation of polynomials at arbitrary points will take quadratic time. For each point, we need to do a number of multiplications
and additions proportional to the number of coecients.

The most ecient way to do this is via Horners Rule. You can also think about filling out the Vandermonde matrix and then
doing the matrix multiplication. Regardless, for arbitrary points we end up with a quadratic running time.
Multiplying in the value domain takes order n+m time, since we just multiply values for corresponding inputs x_j. This was fine.
Interpolation involves solving a system of equation with n+m equations and n+m unknowns. By Gaussian elimination this would
3
take O((n + m) ) for the worst case. There is also a method called Lagrange interpolation that allows us to do this time that is
just quadratic.
Is there any hope? Well, yes there is. All of these running times pertain to an arbitrary set of points, but since we are only
interested in the coecients of C, we get to choose the points! As it turns out this freedom is very powerful.

Divide and Conquer Inspiration - (Udacity, Youtube)


At this point, weve seen how polynomials can be represented by their values at a set of distinct inputs and how multiplying
polynomials is easy when they are represented this way. The problem is that we are really interested in the coecients. Recall
that the coecients of the two polynomials can represent any two sequences that we want to convolve.
To exploit the speed of multiplying in the value representation, therefore, we need an ecient way evaluate a polynomial at
some distinct input points and an ecient way to interpolate back the result to the coecient representation.
Well focus on optimizing for quick evaluation first. Our goal is to evaluate a polynomial A of order N at N points. Note that Ive
made the order of the polynomial and the number points the same here. We can always pad the coecients with zeros,
eectively increasing the order, and we can always add more points.

As you did the calculation for the exercise, you may have taken advantage of the fact that the input values were arranged in
positive-negative pairs. For higher order polynomials, this advantage becomes greater. All the even terms are the same for x and

and the odd terms are just the negatives of each other.

Lets define Ae to be the polynomial whose coecients are the even coecients of A,

A e (x) = a 0 + a 2 x + a4 x 2 +
and define Ao to be the polynomial whose coecients are the odd coecients of A

A o (x) = a 1 + a 3 x + a5 x 2 +
Then we can write
2
2
A(x) = Ae (x ) + x Ao (x )

and
2

A(x) = A e (x ) xA 0 (x ).
We get two points for the price of one!
More formally, lets say that we choose xi such that x i = xi+N/2 for i {0, , N/2 1}. Then we can compute the values
2
two at a time by computing Ae and A o at x and using them in these equations.
Overall, weve changed the problem from evaluating a polynomial of order N at N points to evaluating two polynomials of half
the order at half the number of points. This is good, but at best, weve only reduced the running time by a constant factor. We
need to be able to apply this strategy recursively to change the asymptotic running time. A set of points that would allow us to
do that would be very special indeed.

Roots of Unity - (Udacity, Youtube)


The desire to apply the trick of using positive-negative pairs recursively leads us to a very special set of points called the roots
of unity.
Recall that our goal is to compute two values of a polynomial for the price of one by dividing the terms up into the even and odd
powers.
From here on, well assume that the number of points is equal to the order of the polynomial and that this number is a power of
2. In the context of polynomial multiplication, N will the power of two that is at least as great as the number of coecients of the
product of the two polynomials, and we will pad the coecients of the polynomials being multiplied with zeros as needed to
make their number equal to N as well.

In order to be able to do this computation recursively, we need the sequence x to have the following properties:
first, they should all be distinct (xj

xj unless j = j ). Otherwise, our eorts will be wasted and we wont have enough points to

interpolate back to find the coecients.


second the points should be to be symmetric (or anti-symmetric, depending on how you want to look at it). That is to say, we want

xj+N/2 = x j .
lastly, we want all of these properties to apply to the squares of these numbers, so that we can use the trick again recursively.

If your polynomials are over some unusual field, then it may make sense to choose an unusual set of values for x. For most
applications, however, the coecients will be integers, or reals, or complex numbers and the choice of x will be the complex
roots of unity.
We define

N = e2i/N
and we let

xj = jN
Lets visualize these points in the complex plane for N = 8.

All of the points have magnitude one, so they will be arranged around the unit circle, and the angle from the positive real axis
will be determined by the exponent. Thus, omega to the j th power will be j/N of the way around the unit circle.
Lets confirm that all the desired properties hold. Indeed the points are unique, as j is always less than N, so there is no wrap
around.
The symmetric property holds because adding N/2 to j corresponds to an increase in the exponent by . This has the eect of
increasing the angle by half the circle or equivalently, multiplying by negative one.
The recursive property is the hardest to confirm. Notice, however, that for all of points that have odd powers in the exponent,
squaring these numbers makes the exponent even. Thus, 3 when squared becomes 6 . The point 5 becomes 10 , which
2
wraps around and becomes .
Moreover, each of the even powers is the square of exactly two of the other points. Which points? Just divide the exponent by
2
two. That gives you one. For example, for 4 , its . Where is the other point? On the opposite side of the circle, of course,
6
. The additional N/2 in the exponent comes an additional N when the point is squared, meaning that it maps to the same
place.
The result of all of this is that when we square these numbers, the odd powers of omega go away.

Once we are left with only the odd powers, however, it doesnt make sense to express these points in terms of 8 any more. We
end up with the 4th roots of unity instead of the 8th roots. This same logic applies to any N where N is a power of two.
It is worth noting here how few of the properties of complex numbers were necessary for the recursion we needed. In fact, there
are number theoretic algorithms that use modular arithmetic and avoid the diculty with precision associated working with
complex numbers.

FFT Example - (Udacity, Youtube)


Lets illustrate the FFT scheme by looking at the case where N = 4. That means that = i , and we want to evaluate the
polynomial at the points 1, i, 1, and i.
Recall that we want to use our odd-even decomposition and recycle as much of the computation as possible. Therefore, we
reduce the problem from computing A at the fourth roots of unity to computing Ae and Ao at the second roots of unity, plus
some eort to combine the results.

To compute Ae and Ao at the 2nd roots of unity, we apply the same strategy again. First, rewriting them in terms of the even
and odd coecients, and recycling as much of the computation as possible.

Each of the two previous problems has been reduced to evaluating an order one polynomial at one point. But this is trivial, as it
only involves the constant term. The upward pass of the recursion then fills in all these intermediate values, eventually giving us
the values of A at the fourth roots of unity.

FFT Exercise - (Udacity)


To help solidify our understanding of how this process works, lets do an example. I want you to use the Fast Fourier Transform
strategy to evaluate a polynomial at the 4th roots of unity.

FFT Algorithm - (Udacity, Youtube)


Having seen an example for N = 4, lets now state the Fast Fourier Transform precisely for the general case. As input we have a
sequence of numbers a 0 , , a N1 where = N is a power of two, and we want to return the values of the corresponding
polynomial at the \)N\(th roots of unity.

We'll state this as a recursive algorithm, and the base case is where \N is equal to one, in which case we just return the single element sequence. If

N > 1, then we call the FFT recursively once with the even coecients and once with the odds. Then we combine the results, taking care of paired

values together. Notice the dierence in sign on the contribution from the odd powers.

How long does this take? We traded one problem of size N for two problems of size N/2, plus theta N work for all the arithmetic
in this loop. That is,

T(N) = 2T(N/2) + (N).


2
By the master theorem, this gives us a running time of (N log N). This is much better than the (N ) time from the naive
evaluation by Horners rule or matrix multiplication.

There is one other wrinkle that I want to add to the algorithm, and that is to say that this parameter here can be any primitive
N th root of unity. The real key is only that its N powers all be roots of unity. We can add omega as a parameter to the algorithm.
This will come in handy later.

Butterfly Network - (Udacity, Youtube)


Before moving on from the FFT, I want to take another look at the connections between the various subproblems.

This network is called the butterfly network because these connections over here on the left look a little like a butterfly.
Also note that there is a unique left-to-right path between all nodes on the right and those on the left.
Another thing to note is that this sequence of even odds on these polynomials can be translated to binary. Thus, ooe becomes
110. Under this transformation, these numbers indicate which coecient of the original polynomial gets returned. Even
corresponds to grabbing the numbers with zero in the lowest order bit, odd corresponds to grabbing the numbers with 1 in the
lowest order bit.
It will also be instructive to write out the power on in the value here on the right-hand-side. It turns out that these numbers on
the left act as instructions for any path to this node from any node on the right.

These numbers on the right act as instruction for how get here from any node on the left.

Recap of Progress - (Udacity, Youtube)


Recall that our original goal was not just to evaluate a polynomial at the roots of unity but rather to multiply two polynomials
together, or even more generally to convolve two sequences of numbers together. The Fast Fourier transform would seem to
only get us a little past half-way.
Lets take a step back and see where we are in trying to find a faster way to multiply polynomials.

We have an N log N way to evaluate the polynomials. We can multiply them in the value representation easily in N time. But the
interpolation remains a problem. Remember that this runtime involved solving a system of equations involving the Vandermonde
matrix.

Vandermonde at Roots of Unity - (Udacity, Youtube)


Lets see what the Vandermonde matrix looks like at the complex roots of unity.

2
Each row corresponds to the powers of a value. The powers of 1 are all 1. The powers of are 1, , , . The next value is
2
4
2 , so its powers are 1, , , .

In general, element kj of the matrix is

M N ()[k][j] = (k1)(j1)
This matrix has some very special properties. For our purposes, however, the key one can be summarized with the following
claim.
Let be a primitive N th root of unity. Then

M N ()M N (1 ) = NI.
For the proof, consider element kj of this product. This will be the sum over of the corresponding powers of (k1) and (j1) .
That is,
N1

(k1) (j1)

=0

Gathering terms in the exponent, this becomes


N1

=0

(kj) .

Now, if k = j , then every term is 1 and the sum is N. Otherwise, we recognize this as a geometric series and rewrite it as the
ratio

1 (kj)N
= 0 when k j
1 (kj)
Raising any root of unity to the N th power is just 1, so this expression is zero when k j Thus, we have that entry kj is N if
k = j, and 0 otherwise, proving the claim.
This claim is terribly important. Recall that evaluating a polynomial at the roots of unity corresponded multiplying the coecients
by the matrix MN (), and we used the FFT to do that. Now, we see that we can multiply these values the inverse of this matrix
also using the FFT to allow us to recover coecients given the values! This was why it was key that the FFT work with any root
of unity.

Inverse FFT - (Udacity, Youtube)

This realization about the Vandermonde matrix leads us to the inverse fast Fourier transform. We are given the values of a
polynomial at the roots unity, and we want to recover the coecients of the polynomial. The algorithm is fantastically simple
given what weve established so far. Just run the regular FFT only passing in the inverse of the root of unity that we used the
first time. Then divide by N.

Recall that the values that we received as input were equal to the Vandermonde matrix times the coecients. By multiplying the
vector of these values by the conjugate of Vandermonde matrix via the FFT with omega inverse, we end up with N times original
coecients. Hence, we just need to divide by N to recover the original coecients.

Inverse FFT Example - (Udacity)


To solidify our understanding of this inverse FFT, lets do an exercise. Suppose that A is at most a cubic and the values at the
fourth roots of unity are as show here. Find the coecients of A.

Putting It All Together - (Udacity, Youtube)


Now that weve seen how to invert the Fast Fourier Transform, we are ready to put all the pieces together. Recall that we started
the lesson with this idea for multiplying polynomials: convert to the value representation, multiply the values and then convert
back.
Rounding up to the nearest power of two, we might have written these running times like so in terms of our parameter N.
With the Fast Fourier Transform, we were able to do the evaluation not in quadratic time but in linearithmic time N log N.
And even better in a sense, we were able to solve the interpolation problem using this strategy too, replacing a slower operation
with once again the Fast Fourier transform and time N log N.

The conclusion is that


Order

N polynomials in their coecient representation can be multiplied in O(Nlogn) time.

Remember that polynomial multiplication was just a convenient way to think about the more general problem of convolution.
Therefore, in general, convolving an n long sequence with an m long sequence need only take time O(n + m log(n + m)), a
remarkable and truly powerful result.

Conclusion - (Udacity, Youtube)


That concludes our discussion of the Fast Fourier Transform. From a practical perspective, probably the most important thing to
2
remember from the lesson is that convolution can be done in O(N log N) time, not O(N ) as one might naively think. Dont be
needlessly intimidated by the need to perform convolution in an application, and be aware that image and signal processing
libraries use this technique.
From the perspective of algorithm design, the Fast Fourier Transform falls into the divide and conquer strategy family along with
mergesort and Strassens algorithm for matrix multiplication for those who are familiar with those algorithms. It has some
resemblance to the dynamic programming algorithms that we studied too, however. Instead of just having one problem, we
have multiple problems, as we want to evaluate a polynomial at multiple values, and their subproblems overlap in a way
captured by the butterfly network. The butterfly network by the way is a fascinating structure in its own right and is sometimes
used massively parallel computers.
Another thing to appreciate from the lesson is the strange twists that our development of the algorithm took. We started by
thinking about the general problem of convolution, but the algorithm came much more specifically from thinking about the
special case of polynomial multiplication. We started by only considering sequences of integers, yet complex numbers became
an essential part of the algorithmic solution. Sometimes, the ideas you need come from unexpected places, so soak up as
much mathematics as you can. You never know when it might come in handy.

Maximum Flow - (Udacity)


Introduction - (Udacity, Youtube)
Over the next four lessons, we will discuss two very important problems: finding a maximum flow in a graph and solving a linear
programming optimization. These problems are P-complete, meaning that any problem that can be solved on a Turing machine
in polynomial time can be turned into one of these problems. We will also see an important restricted case in the bipartite
matching problem and then will explore the connections among these problems in a discussion of the principle of Duality, which
is the inspiration for a large class of approximation algorithms.

In this first lesson, we discuss the problem of finding a maximum flow through a network. A network here is essentially anything
that can be modeled as a directed graph, and by flow, we mean the movement of some stu through this mediumfrom a some
designated source to a destination. Typically we want to maximize the flow. That is to say, we want to route as many packets as
possible from one computer across a network to another, or we want to ship as many items from our factory to our stores. Even
if we dont actually want to move as much as possible across our network, understanding the maximum flow can give us
important information. For instance, in communication networks, it tells us something about how robust we are to link failures.
And even more abstractly, maximum flow problems can help us figure things that seem unrelated to networks, like which teams
have been eliminated from playo contention in sports.
Given the variety and importance of these applications, its should be no surprise that maximum flow problems have been
studied extensively, and we have some sophisticated and elegant algorithms for solving them. Lets dive in.

Flow Networks - (Udacity, Youtube)


We begin with a definition. A flow network consists of among other things, a directed graph G. Well disallow antiparallel edges
(i.e. (u, v) and (v, u) cant both be in the graph) to simplify some of the equations. This is not a serious limitation.
Well distinguish two special vertices: a source typically labeled sthis is where whatever is flowing through the network starts
fromand a sink typically labeled t this is where whatever is flowing ends up. We call all other vertices internal. To keep our
equations a little simpler, well assume that there are no incoming edges to the source and no outgoing edges from the sink.
Associated with every pair of vertices is a capacity, which indicates how much flow it is possible to send directly between two
vertices. Well assume that these capacities are nonnegative integers. This will make some of the arguments easier and its not a
serious limitation. In fact, the last algorithm we see will overcome it. Also note that if there is no edge in the graph then the
capacity is defined to be zero. Thats the flow network.

+
The flow itself is a function from pairs of vertices to the nonnegative integers, say f : V V Z . By this definition, the flow
must be nonnegative. It also cant exceed the capacity for any pair of vertices. That is to say,

f (u, v) c(u, v).


Also, we require that between any two vertices at least one direction of the flow be zero.

f (u, v)f (v, u) = 0.


It doesnt make sense to have flow going from one vertex to another and right back again.
Lastly, we require that flow be conserved at every internal vertex. We define

f in (v) =

uV

f (u, v)

to be the flow into a vertex and

out

= f (v, u)

f out =
f (v, u)
uV

to be the flow out and we require the two to be equal

f in (v) = f out (v)


for every internal vertex v.
For example in the lower left node of the diagram below,

we have 4 + 2 = 6 units of coming in and 5 + 1 = 6 units of flow going out. Intuitively, this just means the internal nodes cant
generate or absorb any of the stu thats flowing. Those are the jobs of the source and the sink.
The overall value of the flow is defined as the flow going out of the source or equivalently the flowing going into the sink.

v(f ) = f (s, u) = f (u, t).


uV

uV

Fill In the Flow - (Udacity)


Now, well do a quick exercise to sharpen our understanding of these rules for the flow. Fill in the blanks with the appropriate
numeric value.

Tricks of the Trade - (Udacity, Youtube)


In our definitions so far, weve limited our flow networks in several ways to make our treatment of them a little simpler. Before
going any further, I want to show some tricks of the trade that allow all of the theorems and algorithms that well study to be
applied to more general networks.

One such limitation is the need for all of the capacities on the edges to be integers. We can extend all our arguments to all
rational capacities just by multiplying the capacities by the least common multiple of denominators to create integer capacities.
This just amounts to a change of units in our measurement of flow.

Another limitation weve imposed is that no antiparallel edges are allowed in the network. This forces us to choose a direction
for the the flow between every pair of vertices. In general, however, it might not be clear in which direction the flow should go
before solving the maxflow problem. Its possible to cope with this problem with some slightly less elegant analysis, or just to
convince yourself that the theorems still hold, you can add an artificial vertex between the the two nodes and route the reverse
flow through there.

Another limitation of our model is that weve limited ourselves to single source, single sink networks. At first it might seem that
we couldnt handle a network like this one which has two sources and three sinks. Actually, however, this situation is quite easy
to deal with. We can just add an artificial source node and connect that to the others with an infinite capacity and similarly add
an artificial sink.

So dont let that little limitation trouble you either.

Residual Networks - (Udacity, Youtube)


Everything weve done so far has been to set up the rules of the game. Weve defined a network and the notion of a flow over it,
and weve justified some of the simplifications weve made. Now, we are going to turn to the task of actually finding a maximum
flow.
The search will be an incremental one. Well start with a suboptimal flow and then look for a way to increase it. Suppose that we
are given a particular flow over a network and we want explore how we might change it.

Perhaps, we realize that we increase the flow by routing it through this just the top and bottom paths and leaving out the middle.
This is equivalent to adding a flow that goes like the one shown in orange.

Notice that the flows through the middle cancel out. By adding this flow to original, we get the desired result.

Alternatively, if we just wanted to re-route the flow through the top link we could add a circular flow which would then re-route
the flow.

In fact, all possible changes that we might make to our original flow can be characterized as another flowonly dierent rules
apply. Certainly, if weve used up some of the capacity on an edge, we cant use the full capacity in the flow were going to add.
We capture the rules for the flow that we are allowed to add with the notion of a residual flow network. We start by defining the
residual capacity for all pairs of vertices.

If there is an edge between the pair, then the residual capacity is just however much capacity is left over after our original flow
uses up some of it. For reverse edges, the capacity is however much flow went over the edge in the opposite direction. We can
just unsend this much flow. Everywhere else the residual capacity is zero.
The edges of our residual network are just the places where the residual capacity is positive. Keeping the network sparse helps
in the analysis.

Find the Residual Errors - (Udacity)


Now for a quick exercise on residual networks. Consider the flow over the flow network on the left, and I want you to mark the
edges with errors in this attempt at a residual network on the right.

Augmentations - (Udacity, Youtube)


Ultimately, as we try to find a maximum flow we are going to start with a suboptimal flow and the augment it with a flow we find

in the residual network. Lets see how this works. Well start with a flow f in G. Then drawing the residual network, well let f
be a flow there.

Of course, it obeys the residual capacity contraints. Then, we can add these two flows together using this special definition.

f (u, v) + f (u, v) f (v, u)

(f + f )(u, v) =

if (u, v) E
otherwise

Note only one of the values from f can be positive in the (u, v) E case.

Now, I claim without proof that


The augmented flow f

+ f is a flow in the original network G and that its value is just the sum of the values of the two

individual flows.

This is pretty easy to verify with the equations, but its not very illuminating, so well skip it here and focus on the intuition
instead. Is this augmented flow a flow in original network G ? It fits within the capacity constraints, essentially by construction of

the residual capacities. And it and conserves flow because both f and f do. So, yes it is a valid flow.
The flow out from the source is a simple sum, which makes it a linear function, so yes it makes sense that the flow value of the
sum should be the sum of the flow values as well.

The Ford-Fulkerson Algorithm - (Udacity, Youtube)


With all of this notation and background behind us, we are ready for the Ford-Fulkerson method. As well see, its not really
specific enough to merit being called an algorithm though well sometimes call it that anyway.

We begin by initializing the flow to zero. Then, while there is a path from the source s to the sink t in the residual graph, we are
going to calculate the minimum capacity along such a path and then augment with a flow that has this value along that path.
Once there are no more paths in the residual graph, we just return the current flow that weve found f .
Lets see how this works on a toy example. [See Video]

Does the algorithm terminate? Remember that the capacities are all integral, so each augmenting flow f has to have value at
least 1; otherwise, the path wouldnt be in the graph. Therefore, we cant have more iterations than the maximum value for a
flow. So, yes it terminates.
How much time does it spend per iteration? Finding a path can be done with breadth-first search or depth-first search in time
proportional to the number of edges. Constructing the residual graph also takes this amount of time, as it has at most twice the
number of edges of the original graph. And, of course updating the flow requires a constant amount of arithmetic per edge. Allin-all then, we have just O(|E|) time for each iterationthat is time proportional to the number of edges.

This is a good start for the analysis, but it leaves us with some unanswered questions. Most important, perhaps, is whether the
returned flow is a maximum flow. Sure its maximal in the sense that we cant augment it any further, but how do we know that
with a dierent set of augmenting paths or perhaps with an entirely dierent strategy altogether that we wouldnt have ended up
with a greater flow value?
Also, this bound on the number of iterations is potentially exponential in the size of the input, leaving us with an exponential
algorithm. Perhaps, there is some way to improve the analysis or revise the algorithm to get better runtime.
These two questions will occupy the remainder of the lesson. Well start with showing that Ford-Fulkerson does indeed produce
a maximum flow and then well see about improving the running time.

Flows and Cuts - (Udacity, Youtube)


If your intuition tells you that the incremental approach of the Ford-Fulkerson method produces a maximum flow, thats a good
thing. Your intuition is correct. But its important that it be correct for the right reasons. Not all incremental approaches work.
Often in optimization, we get stuck in local minima or local maxima. We need to argue either that we never make a mistake
analysis of greedy algorithms typically have this feelor we need to be able argue that the rules for incrementing allow us to
undo any mistakes that may have made along the way. The latter will be the case for maximum flow.
The argument well make is complex and will require us to introduce a new notion of a minimum cut in a graph along the way,
but in the end it is rewarding and provides an example of how to find elegance in analysis.
So far, we know that the network will have some possible nonnegative flow values but its not exactly clear how high this range
goes or where the results of Ford-Fulkerson fit in. One observation is that the flow value cant exceed the capacity of the edges
coming out of the source. Remember that the flow value is defined to be the sum of the flows over these edges. Separating s
from t in this way constitutes a cut in the graph, which as well see is an upper bound on the value of any flow.

Its possible that there will be smaller cuts as well, which will shrink the range of possible flow values for us.
Even better, however, well see that the flow produced by Ford-Fulkerson has the same value as a cut, and since this cut serves
as an upper bound on all possible flows, this flow must be a maximum. In other words, the two arrows in the diagram will meet
at the value of the flow produced by Ford-Fulkerson. Thats were this argument will end up.

Cut Basics - (Udacity, Youtube)


We start with making a more precise definition of an s-t cut, saying that
A cut it is a partition

(A, B) of the vertices such that the source s is in the set A and the sink t is in the set B .

For example, in this network here, the green nodes might be in A and the orange ones in B .

Or we might have this cut here.

The vertices within one side of the partition dont have to be connected in the definition.
Next, we observe that if f is an s-t flow and (A, B) is an s-t cut, then the value of the flow is the flow out of A minus the flow into
A , or equivalently, the flow into B minus the flow out of B .

v(f ) = f

out

(A) f

in

in
out
(A) = f (B) f
(B).

For this first cut shown above, we have 2 + 6 entering and 2 leaving for a total of 6. For the second cut shown, we have
1 + 1 + 5 + 1 units exiting A and 2 units entering for a total of 6, which is indeed the flow.
As you might imagine the proof comes from the conservation of flow. We start with the definition of the flow and then add a zero
in the form of the conservation equation for each node in A .
For every edge where both vertices are in A terms cancel, leaving us with just the outgoing edges from A to B and the incoming
edges from B to A. And these sums then are just the flows out and flows into A as stated by the theorem. Check this for
yourself with the equations.

Cut Capacity - (Udacity, Youtube)


From this notion of a cut, it is natural to think about, how much flow could possibly come across the cut? Clearly, it cant
exceed the sum of the capacity of the crossing edges. This intuition gives rise to the idea of the capacity of a cut, which is just
the sum of the capacities of edges crossing over from A into B.

c(A, B) =

uA,vB

c(u, v).

Cut Capacity Calculation - (Udacity)


Lets do a quick exercise on cut capacity. What is the capacity of the cut shown in the above flow network?

Cut Capacity Continued - (Udacity, Youtube)


Intuitively, it should be clear that the capacities of s-t cuts represent upper bounds on the amount of flow that could go from s to
t. We state this as the following lemma
Let f be a flow and let (A, B) be an s -t cut in a flow network. Then

v(f ) c(A, B).


The proof goes like this. The value of the flow is the flow out of the set A minus the flow into the set A.

v(f ) = f

out

(A) f
out

in

(A).

(A).

out
(A).
We can just drop this second second term, leaving us with just f
v(f ) f out (A) =

u A, v Bf (u, v).

This is bounded by the capacities on the edges crossing the cut, and this sum is then the capacity of a cut.

v(f )

u A, v Bf (u, v)

u A, v Bc(u, v) = c(A, B).

Note from this proof that the inequality will be tight when there is no flow re-circulating back into A and all the crossing edges
are saturatedi.e. the flow is equal to the capacity. Keep this in mind.

The Max-Flow Min-Cut Theorem - (Udacity, Youtube)


We are now ready for the climactic big theorem of this lesson, the Max-Flow Min-cut Theorem.
The following are equivalent:

f is a maximum flow in G.
2. G f has no augmenting paths.
3. There exists an s -t cut (A, B) such that v(f ) = c(A, B).
1.

This is the realization of the strategy outlined earlier, where were going to introduce the notion of the cut, show that it served as
an upper bound the on flow, and then show that Ford-Fulkerson produced a flow with the same value as a cut.

As an immediate corollary we have then that


The Ford-Fulkerson approach returns a maximum flow.

Lets see the proof of the Max-Flow Min-Cut theorem. We start by showing that if f is a maximum flow in a network, then it has

no augmenting paths in the residual network. Well, suppose not, and let f be an augmenting flow.

Then we can augment f by f and the flow of the sum will be the sum of the flows. The value of the augmenting flow is positive,
so we have created a new, greater flow contradicting the the fact that f was a maximum flow.

v(f + f ) = v(f ) + v(f ) > v(f ).


Next, we show that the fact that the residual network has no augmenting paths implies that there is an s-t cut that has the same
capacity as the flow. This is real heart of the theorem.
Let A be the set of vertices reach from the source s in the residual network, and let B be all the other vertices.

Flows Across the Cut - (Udacity)

Well make the next step in the proof an exercise. If (u, v) goes from A to B, what does that implies about the flow. Similarly, if it
goes from B to A, what does that imply about the flow. Answer both questions.

The Max-Flow Min-Cut Theorem Continued - (Udacity, Youtube)


We have concluded that any forward edge from A to B must be saturated and any backward edge from B to A must be unused.
Before going any further, lets illustrate this with an example. Here we have a flow on the left and the corresponding residual
graph on the right.

Note that there are no path from s to t in the residual graph. The vertices that can be reached from s are marked as green and
the other ones as orange. Its easy to see that all the edges from the green to the orange vertices are saturated, and the edges
from the orange to the green are empty, just like the theorem claims.
Recall that for any cut the value of a flow is the flow out of the partition with the source minus the flow into the partition with the
source.
As, weve just argued, however, in this case there is no flow back into the source partition. Moreover, the flow out saturates all
the edges, so its just the sum of the capacities across the cut, which is then defined as the cut capacity.

v(f ) =

u A, v Bf (u, v) =

u A, v Bc(u, v) = c(A, B).

That completes that part of the theorem.


Lastly, we need to show that the existence of a cut with the same capacity as a flow shows that the flow is a maximum one.
This follows immediately from our earlier argument that the cut capacity is an upper bound on the max-flow. There cant be a
bigger one or it would violate the bound.
That then completes the theorem. These three things are equivalent. Maximum flow is equal to the minimum capacity cut, and
the Ford-Fulkerson approach returns a maximum flow.

If you followed all of that, congratulations! The max-flow/min-cut theorem is one of the classic theorems in the study of
algorithms and a wonderful illustration of duality, which well discuss in a later lesson. For now, however, we are not quite ready
to leave maximum flow yet.

Bad Augumentations - (Udacity)


So far, weve show that Ford-Fulkerson does in fact produce a maximum flow. Now, we turn to the question of its running time.
Recall that the only bound we have so far on the number of iterations is the value of the flow itself, since each augmentation
must increase the flow value by one.
Lets see if this bound is tight. Consider the flow network below and give the maximum number of augmenting paths that FordFulkerson might find.

Better Augumentations - (Udacity, Youtube)


Obviously, we made some pretty poor choices for picking the augmenting paths. One idea for picking better paths is to prefer
heavier ones. Another is to prefer shorter paths. Note that the poor choices of paths here were the longer ones.

Scaling Algorithm - (Udacity, Youtube)


This idea that we should prefer heavier flows brings us to the scaling algorithm.If heavy flows are good, why not choose the
heaviest possible flow. We could do this by starting with an empty graph and then adding the edge with the largest remaining
residual capacity until there was an s -t path, but this would be unnecessarily slow. Instead, we fix a schedule of thresholds for
the residual capacity and lower the threshold when there are no more augmenting paths in the residual network.
Well start by defining the residual network with threshold to include only those edges whose residual capacity exceed .

Gf () = (V, {(u, v) : cf (u, v) }).

Note that when is 1 this is equivalent to the traditional residual network. Then we can state the algorithm as follows. We
initialize the flow to be zero and the parameter Delta to be this trivial upper bound on the flow in a single path.
Then while 1, we look for an augmenting path p in G f () and use it to augment the flow. Once all such paths are
exhausted, we cut in half.

Some of the analysis, we can do just by inspection. Letting C be the initial value for , we see that we use O(log C) iterations
of the outer loop.
The steps of the inner loop cost only O(|E|) time.
The big question then is how many iterations of this inner loop are there?

Analysis of Scaling - (Udacity, Youtube)


Here, we will prove that
The maximum number of iterations for a scaling phase is at most 2|E|.

The key lemma will be that


If G f () has no s-t paths, then there is an s-t cut (A, B) such that c(A, B)

v(f ) + |E|( 1).

The proof will feel a lot like the max-flow min-cut proof. We let A be the set of vertices reachable from s in G f () and we let B
be the complement of this set in V . Edges from A to B in this graph must have residual capacity at most 1 ,

(u, v) A B c(u, v) f (u, v) 1.


and edges from B to A cant have flow more than 1 or else the reverse edge would appear in the graph

(u, v) B A f (u, v) 1.
The flow is then

v(f ) =

f (u, v)

c(u, v) ( 1)

(u,v)AB

(u,v)AB

(u,v)BA

f (u, v)
( 1)

(u,v)BA

c(A, B) |E|( 1).


With this lemma complete, we now return to the main claim we want to prove.

The base case where = C is trivial. Here there can be only at most one iteration. For subsequent iterations, we let f be the
flow after a phase of augmentations and let g be the flow before, which would be after a phase of 2 or or a phase of
2 + 1 augmentations.
The flow f is a most the maximum flow, but this is at most the capacity of the s-t cut induced by the graph G f (2 + 1) from the
previous iteration. Our lemma then says

v(f ) v(f ) c(A, B) v(g) + 2.

Now let k be the number of iterations used to go between g and f . Each iteration increased the flow by at least , so

k v(f ) v(g) 2|E|,


Hence, k 2|E| : we can have used at most 2|E| iterations.
This then completes the analysis of the scaling algorithm. We have at most log C iterations of the outer loop, O(E) iterations of
2
the inner loop with each one costing O(|E|) time. The total is then O(|E| log C).

The Edmonds-Karp Algorithm - (Udacity, Youtube)


So far weve explored the idea that we should prefer heavier augmenting paths. It turns out that the idea of using shorter paths
also gives rise to an ecient algorithm. This is the Edmonds-Karp algorithm, also discovered independently by Dinic in the
Soviet Union. This is exactly Ford-Fulkerson, only we are sure to choose a minimum length path.

The cost of an iteration is O(|E|) as always, as we just use bread-first-search to the shortest s-t path. If we can bound the
number of iterations, most tightly well have a better found than for the naive Ford-Fulkerson. Indeed we will be able to bound it
to be at most |V||E|, showing that
The Edmonds-Karp returns a maximum flow in

O(|E| |V|) time.

Analysis of Edmonds-Karp - (Udacity, Youtube)


Now for the analysis. Mostly, Ill just try to share the intuition. We want to show that only O(|V||E|) augmentations are used.
To see this, we define something called a level graph. In a formal way, we say that
The level of a vertex v is the shortest path distance from the source vertex s.

We denote the level of a vertex v as (v).


The level graph is the subgraph of the original graph that includes only those edges from one level to one higher, i.e. edges

(u, v) : (v) = (u) + 1.


A graph and the corresponding level graph are shown below would be this subgraph down here.

Three edges have been deleted because they go between vertices of the same level or back up a level.
We first observe that augmenting along a shortest path only creates paths that are longer ones. Say that we push flow along the
topmost path as shown below.

This introduces back edges along the path. Note, however, that these back edges are useless for creating a path of the same
length. In fact, because they go back up a level, any path that uses one of them must use two more edges that the augmenting
path we just used.
Next, we observe that the shortest path distance must increase every |E| iterations. Every time that we use an augmenting path
we delete an edge from the level graph, the one that got saturated by the flow.
From our example, suppose that the middle edge was saturated in an augmentation along this path.

This edges wont come back into the level graph until the minimum path length is increased. As weve already argued the
reverse edges are useless until we are allowed to use longer paths.
Lastly, there are only |V| possible shortest path lengths, so that completes the theorem.
2

Notice that with the Edmonds-Karp time bound of O(E V), weve eliminated the dependence of the running time on the
capacities. This means that the algorithm is now strongly polynomial, and actually we can eliminate the requirement that the
capacities be integers entirely.

Dinics Algorithm - (Udacity, Youtube)


There is one more refinement to the algorithm that I cant resist and that is due to Dinic. He actually published his algorithm in
1970, two years before Edmonds-Karp.
His key insight is that the work of computing the shortest path can be recycled so that a full recomputation only needs to
happen when the shortest path distance changes, not for every augmenting path.

As with all augmenting flow-based strategies, we start with an initial flow of zero. Then we repeat the following. We build a level
graph from the residual flow network, and let k be the length of a shortest path from s to t. Then while there is a path from
source to sink that has this length k, we use it to augment the flow and then update the residual capacities.
Then we repeat this until there are no more s -t paths and we return the current flow.
Turning to the analysis, well call one iteration of this outer loop a phase, and well be able to argue that each phase increases
the length of the shortest s- t path in the residual network by 1. The principle here is the same as for Edmonds-Karp: augmenting
by a shortest path flow doesnt create a shorter augmenting flow. Hence, once we have exhausted all paths of a given length,
the next shortest path must be one edge longer. Hence there only O(|V|) phases.
Within a phase the level graph is built by bread-first-search so that costs only O(|E|).
The hardest part of the argument will be that this loop altogether takes only O(EV) time. Well see this argument in a second.

O(|E||V 2 )

O(|E 2 |V|)

2
2
Altogether, we will show that Dinics algorithm takes O(|E||V | ) time which is an improvement upon Edmonds-Karp O(|E| |V|)
time.

Analysis of Dinics Algorithm - (Udacity, Youtube)


We turn to the key part of the analysis where we show that each phase takes O(VE) time. As with Edmonds-Karp we will use a
level graph. In this case, however, the algorithm actually builds the graph, whereas in Edmonds-Karp we simply used it for
analysis. The level graph can be built by running breadth-first search and saving all forward edges, while ignoring backwards
and lateral ones.
When we augment the flow along a path, say along the topmost path, we introduce reverse edges into the residual graph.
Recall that these are always backwards edges in the level graph and hence arent useful in building a path equal to or shorter
than the previous shortest path. Well, if the new edges are useless, why rebuild the level graph of G f when the old one will serve
just as well? We can just update the residual capacities.
More precisely, given the possibly outdated level graph, we can build a path from the source just by making the first vertex on
the adjacency list the next vertex on the path.
If this generates a path to

t, then we augment the flow and update the residual capacities.

If it doesnt, then we delete the last vertex in the path from the level graph.
In this example, we would first find a path from from s to t, and maybe the middle edge is the bottleneck. Its capacity gets set
to zero and it gets deleted.Next, we would build a path again from s, and this time, we would run into a dead-end. So we delete,
this vertex and continue.
There are only V vertices in the graph, so we cant run into more than V dead-end, and every augmentation deletes the
bottleneck edge and we cant delete more than E edges. Overall, then we wont try more than E paths.
The process of building these paths and augmenting flows is just proportional to the path length, however. Making the overall
time cost of a phase just O(VE).

2
Taking this altogether, we have |V| phases each costing O(|V||E|) time for a total of O(|V | |E|) time as desired.

Conclusion - (Udacity, Youtube)


We have now seen a few improvements to the naive version Ford-Fulkerson method, but this isnt the end of the story. There are
some even more sophisticated approaches that improve the runtime even further also based on the idea of augmenting flows.
There is also a family of algorithms called push-relabel that allow internal vertices to absorb flow during intermediate phases of
the algorithms. In practice, these seem to perform the best.

Beyond the just seeing algorithms in this lesson, we examined the max-flow min-cut theorem. This was more than just a trick for
proving the correctness of Ford-Fulkerson. Its part of a larger pattern of Duality that provides important insight in a variety of
contexts. Well explore this more fully in our lesson on Duality.

Bipartite Matching - (Udacity)


Introduction - (Udacity, Youtube)
In this lesson, well discuss the problem of finding a maximum matching in a bipartite graph, a problem intimately related to
finding a maximum flow in a network. Indeed, our initial algorithm will come from a reduction to max flow, and at first, maximum
bipartite matching might seem just like a special case. As we examine the problem more carefully, however, well see some
special structure, and eventually, we will use this insight to create a better algorithm.

Bipartite Graphs - (Udacity, Youtube)


We begin our discussion by defining the notion of a bipartite graph.
An undirected graph is bipartite if there is a partition of the vertices L and R (think of these as left and right) such that every edge
has one vertex in L and one in R. For example, this graph here is bipartite.

I can label the green vertices as L and the orange ones as R, and every edge is between a green and an orange vertex.
A few observations are in order. First, saying the a graph is bipartite is equivalent to saying that it is two-colorable for those
familiar with colorings.
Next, lets take this same graph and add this edge here to make it non-bipartite.

Note that Ive introduced an odd-length cycle, and indeed saying that a graph is bipartite is equivalent to saying that it has no
odd-length cycles.
For graphs that arent connected its possible that there will be some ambiguity in the partition of the vertices, so sometimes the
partition is included in the definition of the graph, as in G = (L, R, E) .

Bipartite Graph Quiz - (Udacity)


Here is a quick exercise on bipartite graphs. Ive drawn a graph here that it not bipartite, and I want you to select one or two
edges for deletion so that it become bipartite.

Matching - (Udacity, Youtube)


Our next important concept is the matching.
Given a graph, a subset of the edges is a matching if no two edges share (or are incident on) the same vertex.

Note that the graph doesnt have to be bipartite. Take for example, this graph here.

These two edges marked in orange constitute a matching. By the way, well refer to an edge in a matching as a matched edge,
so these two edges are matched, and well refer to a vertex in a matched edge as a matched vertex. Here, there are four
matched vertices.
A maximum matching is a matching of maximum cardinality.

Note that a maximal matching is not a maximum matching. For instance, the shown above is maximal because I cant add any
more edges and have it still be a matching. On the other hand, it is not maximum because here is a matching that has greater
cardinality.

This happens in bipartite graphs too. It is possible to have a maximal matching that is not maximum because there is a greater
matching.

Applications - (Udacity)
Now that we know what bipartite graphs and matchings are, lets consider where the problem of finding a maximum matching in
a bipartite graph comes up in real-world applications. Actually, well make this a quiz where I give you some problem
descriptions and you tell me which can be cast as maximum matching problems in bipartite graphs.
First consider the compatible roommate problem. Here some set of individuals give a list of habits and preferences and for each
pair we decide if they are compatible. Then we want to match roommates so as to get a maximum number of compatible
roommates.
Next, consider the problem of assigning taxis to customers so as to minimize total pick-up time.
Another application is assigning professors to classes that they are qualified to teach. Obviously, we hope to be able to oer all
of the classes we want.
And lastly, consider matching organ donors to patients who are likely to be able to accept the transplant. Of course, we want to
be able to do as many transplants as possible.
Check all the applications that can naturally cast as a maximum matching problem.

Reduction to Max Flow - (Udacity, Youtube)


At this point, weve defined a bipartite graph and the notion of a matching. Now we are going see the connection with maximum
flow.
Intuitively, a maximum matching problem should feel like we are trying to push as much stu from one side of the partition to
the other. It should be no surprise then that there turns out to be an easy reduction to the maximum flow problem, which weve
already studied.

We build a flow network that has the same set of vertices, plus two more which serve as the source and the sink. We then add
edges from the source to one half of the partition and edges from the other half of the partition to the sink. Edges are given a
direction from the source side to the sink side. All the capacities are set to 1. Setting the capacities of the new edges to 1 is
important to ensure that the flow from or to any vertex isnt more than 1.

Having constructed the graph, we then run Ford-Fulkerson on it, and we return the the edges with positive flows as the
matching. Actually, all edges will have flows 0 or 1 as well see.
The time analysis here is rather simple. Building the network is O(E) maybe O(V) depending on the representation used.
This is small, however, compared to the cost of running Ford-Fulkerson which is O(EV). Note that V is a bound on the total
capacity of the flow and hence a bound on the total number of iterations. In this particular case, this gives a better bound than
given by the scaling algorithm or by Edmonds-Karp.
Lastly, of course, returning the matched edges is just a matter of saying for each vertex on the left, which vertex on the right
does it send flow to. Thats just O(V) time.
Clearly, Ford-Fulkerson is the dominant part and we end up with an algorithm that runs in time O(EV).

Reduction Correctness - (Udacity, Youtube)


We assert the correctness of the reduction with following theorem
Consider a bipartite graph G and the associated flow network

F. The edges of G which have flows of 1 in a maximum flow


obtained by Ford-Fulkerson on F constitute a maximum matching in G.
To prove this theorem, we start by observing that the flow across any edge is either zero or one. Any augmenting path, will have
a bottleneck capacity of 1, so well always augment flows by 1, but 1 is also the maximum flow that can flow across any edge.

The conservation of flow then implies that each vertex is in at most 1 edge of that we return. On the left, hand side, we only
have one unit of capacity going in, so we cant have more than one unit of flow going out. On the right we have only, one unit of
capacity going out so we cant have more than one unit of flow going in. This means that the set of edges in the original graph
that have flow 1 represent a matching.
Is it a maximum matching? Well, if there were are larger matching it would be trivial to construct a larger flow just by sending
flow along those edges, so yes it must be a maximum matching.

Deeper Understanding - (Udacity, Youtube)


At this point, its tempting to be satisfied with the result. After all, we have found a relatively low-order polynomial solution for
problem.
On the other hand, the sort of flow networks we are dealing with here are very specialized, having all unit capacities and a
bipartite structure. Our algorithm doesnt exploit this at all.
In the remainder of the lecture, well look at this special structure more closely, gain some additional insights about bipartite
matchings, and finally these will lead us to a faster algorithm.

Augmenting Paths - (Udacity, Youtube)


We start this search for a deeper understanding with the augmenting path. Recall that in our treatment of the maximum flow
problem, we were given a flow over some graph and we defined the residual network, which captured the ways in which we
were allowed to modify this flow. This included adding backwards edges that went in to opposite of the direction from edges in
the original graph. Augmenting paths then were paths over this residual network from the source to the sink that increased (or
augmented) the flow.

In a network that arises from a bipartite matching problem, there are a few special phenomena that are worth noting. First,
observe that all intermediate flows found by Ford-Fulkerson correspond to matchings as well. If there is flow across an internal
edge, then it belongs in the matching.

Also, because flows along the edges are either 0 or 1, there are no antiparallel edges in the residual network. That is to say that
either the original edge is there or the reverse is, never both. Moreover, the matched edges are the ones that have their direction
reversed. Also, only unmatched vertices have an edge from the source or to the sink. Matched vertices have these edges
reversed.
The result of all of this is that any augmenting path must start at an unmatched vertex and then alternately follow an unmatched,
then a matched edge, then an unmatched edge, and so forth until it finally reaches an unmatched vertex on the other side of the
partition.
This realization allows us to strip away much of the complexity of flow networks and define an augmenting path for a bipartite
matching in more natural terms.
We start by defining the more general concept of an alternating path.
Given a matching

M an alternating path is one where the edges are alternately in m and not in M . An augmenting path is

an alternating path where the first and last edges are unmatched.

For example, the path shown in blue on the right is an augmenting path for the matching illustrated in purple on the left.

We use it to augment the matching by making the unmatched edges matched and the matched ones matched along this path.
This always increases the size of the matching because before we flipped there was one more unmatched edge than matched
edge along the path, so when we reverse the matched and unmatched edges, we increase the size of the matching by one.
In fact, we can restate the Ford-Fulkerson method purely in these terms.
1. Initialize M

= .

2. While there is an augmenting math p , update M


3. Return

M.

= M p.

(Here the operator denotes the symmetric dierence. I.e. A B = (A B) (A B). )

Vertex Cover - (Udacity, Youtube)


Now we turn to the concept of a vertex cover, which will play a role analogous to the one played by the concept a minimum cut
in our discussion of maximum flows.
Given a graph G

= (V, E), S V is a vertex cover if every edge is incident on a vertex in S.

We also say that S covers all the edges. Take this graph, for example. If we include the lower left vertex marked in orange, then
we cover three edges.

By choosing more vertices so as to cover more edges, we might end up with a cover like this one.

One pretty easy observation to make about a vertex cover is that its size serves as an upper bound for the size a matching in
the graph.
In any given graph, the size of a matching is at most the size of a vertex cover.

The proof is simple. Clearly, for every edge in the matching at least one vertex must be in the cover, and all of these vertices
must be distinct because no vertex is in two matched edges.

Find a Min Vertex Cover - (Udacity)


Lets get a little practice with vertex covers. Find a minimum vertex cover (i.e. one of minimum cardinality) for the graph below.

Max Matching Min Vertex Cover - (Udacity, Youtube)

Now, we are ready for the matching equivalent of the maxflow/mincut theorem, the max-matching/min vertex cover theorem.

The proof is very similar. We begin by showing that if M is a matching then it admits no augmenting paths. Well, suppose not.
Then there is some augmenting path, and if we augment M by this path, we get a larger matching, meaning that M was not a
maximum as we had supposed.
Next, we argue that if M admits no augmenting paths. Then we claim that there there exists a vertex cover of the same size as
M. This is the most interesting part of the proof. Well let H be the set of vertices reachable via an alternating paths from
unmatched vertices in L (the left hand side of the partition).
We can visualize this definition by starting with some unmatched vertices in L, then following its edges to some set in R, then
including its the vertices in L that these are matched with, etc.

Note that because, M doesnt admit any augmenting paths, all of these paths must terminate in some matched vertex in L.
Lets draw the rest of the graph here as well.

We have some matched vertices in L, the vertices in R that they are matched to, and some unmatched vertices in R. Note that

H and H correspond to the minimum cut we used when discussing flows. To get a min vertex cover, we select the part of H
that is in R and the part of L that is not.

We call this set S. This set S contains exactly one vertex of each edge in M, so |S| = |M|.
Next, we need to convince ourselves that S is really a vertex cover. The edges we need to worry about are those from L H to
R H . Such an edge cannot be matched by our definition, and any such unmatched edge would place the vertex in R into H .
Therefore, there are no such edges and S is a vertex cover.
Finally, we have to prove that the existence of a vertex cover that is the same size as a matching implies that the matching is a
maximum. This follows immediately from our discussion that a vertex cover is an upper bound of the size of a matching.

The Frobenius-Hall Theorem - (Udacity, Youtube)


Before we turn to finding a faster algorithm for finding a max matching, there is a classic theorem related to matchings that we
should talk about. This is the Frobenius-Hall theorem. For a subset of the vertices X, well use N(X) to indicate the union of the
neighbors of the individual vertices in X.
If we consider this graph here, then the neighbors of the orange vertices will be the green ones.

Note that the fact that the neighborhood of X is larger than X bodes well for the possibility of finding a match for all the vertices
on the left hand side. At least, there is a chance that we will be able to find a match for all of these vertices.
When this is not the case, as seen below, then it is hopeless.

Regardless of how we match the first two, there will be no remaining candidates for the third vertex.
We can make this intuition precise with the Frobenius-Hall Theorem which follows from the max-matching/min vertex cover
argument.
Given a bipartite graph G

|N(X)| |X|.

= (L, R, E) , there is a matching of size |L| if and only if for every X L, we have that

The forward direction is the simpler one. We let M be a matching of the same size as the left partition, let X be any subset of
this side of the partition, and we let Y be the vertices that X is matched to.

Because the edges of the matching dont share vertices, |Y| = |X|. Yet, Y N(X), implying that
|X| = |Y| |N(X)|. neighborhood of X is at least the size of X.
The other direction is a little more challenging. Suppose not. That is, there is a maximum matching M, and it has fewer than |L|
edges. We let H be the set of vertices reachable via an alternating path from an unmatched vertex in L. This is the same picture
used in the max-matching/min-vertex cover argument. There is at least one such unmatched vertex by our assumption here.

The neighborhood of the left side of H is just the right side of H by construction, but the left hand side must be strictly larger,
because the matched vertices on either side eectively cancel each other out, leaving the unmatched vertices in L as extras.

Perfect Matchings - (Udacity)


Our treatment of matchings wouldnt be complete without talking about the notion of a perfect matching.
In a bipartite graph

G = (L, R, E)) a matching M is perfect if |M| = |L| = |R|.

That is to say all of the vertices are matched. To review some of the key concepts from the lesson so far, well do a short
exercise. Assume that the left and right sides are of the same size. Which of the following implies that there is a perfect
matching?

Toward a Better Algorithm - (Udacity, Youtube)


Now that have a deeper understanding of the relationship between max-matching and max-flow problems, we are ready to
understand a more sophisticated algorithm. Back in the maximum-flow lecture we considered two ways to improve over the
naive Ford-Fulkerson. One was to prefer heavier augmenting paths, ones that pushed more flow from the source to the sink. In
the max-matching context this doesnt make much sense because all augmenting path have the same eect, they increase the
matching by one.
The other idea was to prefer the shortest augmenting paths and there was Dinics further insight that the breadth first search
need only be done when the shortest path length changed, not once for every augmentation. Pursuing these ideas gives us the
Hopcroft Karp algorithm.

The Hopcroft-Karp Algorithm - (Udacity, Youtube)


The Hopcroft-Karp algorithm goes like this.

We first initialize the matching to the empty set. Then we repeat the following: first build an alternating level graph rooted at the
unmatched vertices in the left partition L, using a breadth-first-search. Here the original graph is shown on the left and the
associate level graph on the right.

Having built this level graph, we use it augment the current matching with a maximal set of vertex-disjoint shortest augmenting
paths.
We accomplish this by starting at the the unmatched vertex in R and tracing our way back. Having found a path to an
unmatched vertex in R , we delete the vertices along the path as well as any orphaned vertices in the level graph. (See the video
for an example.)
Note that we only achieve a maximal set of vertex-disjoint paths here, not a maximum. Once we have applied all of these
augmenting paths, we go back an rebuild the level graph and keep doing this until no more augmenting paths are found. At that
point we have found a maximum matching and we can return M.
From the description, it is clear that each iteration in this loopwhat we call a phase from now ontakes only O(E) time. The first
part is accomplished via bread-first-search and the second also amounts to just a single traversal of the level graph.
phases are needed.
The key insight is that only about the V
Our overall goal, then is to proof the theorem stated in the figure below.

Matching Dierences - (Udacity)


Our first step is to understand the dierence between the maximum matching that the algorithm finds and the intermediate
matchings that the algorithm produces along the way. Actually, well state key claim in terms of two arbitrary matchings M and
M and I want you to help me out. Think about the graph containing edges that are in M but not in M or vice-versa and tell me
which of the following statements are true.
(Watch the answer on the Udacity site)
The key result is the following.
If M

is a maximum matching and

augment M.

M is another matching, then M M contains |M | |M| vertex-dsijoint paths that

Shortest Augmenting Paths - (Udacity, Youtube)


This next lemma will characterize the eect of choosing to augment by a shortest path.

This lemma also has an important corollary.

Analysis of a Phase - (Udacity, Youtube)


Our next lemma states the key property of a phase of the Hopcroft-Karp algorithm.
Each phase increases the length of the shortest augmenting path by at least two.

Let Q be a shortest augmenting path after the phase with path length k. Its impossible for |Q| to be less than k by a previous
lemma, which showed that augmenting by a shortest augmenting path never decreased the minimum augmenting path length.
On the other hand |Q| being equal to k implies that Q is vertex disjoint from all paths found in the phase. But then it would have
been part of the phase, so this isnt possible either.
Thus, |Q| > k, and also odd because it is augmenting, so |Q| k + 2, completing the lemma.

Number of Phases - (Udacity, Youtube)


Now that we know that each phase must increase the length of the shortest augmenting path by at least two, we are ready to
). Note that the trivial bound
bound the number of phases. Specifically, the number of phases used by Hopcroft-Karp is O(V
saying that there can be only one phase per possible augmenting path length isnt good enough. That would still leave us with
an O(V) bound.
In reality, we will have a phase for length 1 and length 3, probably for length 5, maybe not for length 7 and so forth, but as we
consider greater lengths the ones for which we will have augmenting phases get sparser and sparser.

V , there will only be V phases left. Let M be the matching found by HopcroftThe key argument will be that after roughly
phases. Because each phase increased the shortest augmenting path length by
Karp and let M be the matching after the V
+ 1.
two, the length of the shortest augmenting path in M is 2V

Hence, no augmenting path in the dierence between M and M can have shorter length. This implies that there are at most |V|
divided by this length augmenting paths in the dierence. We just run out of vertices. If there can be only so many augmenting
.
paths in the dierence, then M cant be too far away from M certainly, at most V

more phases.
Hence M will only be augmented square root of V more times, meaning that there cant be more than V
phases to make all of the augmenting paths long enough so that there cant only be V
more possible
We have V
augmentations. That completes the theorem.

) time.
In summary, then the Hopcroft-Karp algorithm yields a maximum bipartite matching in O(EV

Conclusion - (Udacity, Youtube)


That concludes our lesson on max matchings in bipartite graphs. If the topic of matchings is interesting to you, I suggest
exploring matchings in general graphs, instead of just bipartite ones we studied here, and also taking a look at minimum cost
matchings and the Hungarian algorithm. Good references abound.

Linear Programming - (Udacity)


Introduction - (Udacity, Youtube)
The subject for this lesson is linear programming. Weve seen some very general tools and abstractions in this course, but it
would be hard to argue that any other combines the virtues of simplicity, generality and practicality as well as linear
programming. It is simple enough to be captured with just a few matrix expressions and inequalities, general enough so that it
can be used to solve any problem in P in polynomial time , and practical enough that it helped revolutionize business and
industry in the middle of the twentieth century. In many ways, this is algorithms at its best.

Preliminaries - (Udacity, Youtube)


This lesson begins by reviewing the two dimensional linear programming problems that high school students often solve in their
Algebra 2 classes. Then, it extends the the equations to N dimensions and captures the essential intuitionthat optimal
solutions are at the corners of the allowed region with the Fundamental Theorem of Linear Programming. Finally, it covers the
simplex algorithm, a very practical way for solving these optimizations.
There are many good references for linear programming. This treatment will follow most closely David Luenbergers, Linear and
Nonlinear Programming.
Before we begin, however, I should mention that parts of this lecture will use notation and some ideas from linear algebra. Some
notation we will use is summarized in the image below.

As far as concepts go, the ideas of how to represent systems of equations as matrices, of linear independence of vectors,
matrix rank, and the inverse of a matrix should all be familiar. If they arent, then it would be a good idea to refresh your
understanding before watching the rest of this lesson.
As always, it is recommended that you watch with pencil and paper handy so that you can pause the video and work out details
on your own as needed.

HS Linear Programming - (Udacity, Youtube)


I want to begin our discussion of linear programming with a kind of problem that you likely first encountered in a High School
algebra class. A graduate student is trying to balance research and relaxation time. He figures that eating, sleeping, and
commuting leave him with 14 hours in the day for other activities. He has also found that after two hours of work, he needs to
relax for a half-hour before he can work eectively for another hour again. Of course, his advisor wants him to work as much as
possible.
Well let x1 be the amount of time spent on research and x 2 the amount of time spent relaxing. Then we can express the
graduate students time management problem as the following optimization.

max
s.t.

x1
x1
x1

x2
2x2
x1 , x2

14
2
0

We express the fact that he only has 14 hours for these activities by saying that x 1 + x2 is at most 14. We express the fact that
he feels for half as much relaxations as work after two hours of work with the second constraint x1 2x2 2. Of course, he
cant spend negative time on either of these activities, so we need to add that constraint as well. The overall goal is to maximize
time worked, so we make that our objective function, and we want to maximize that subject to these constraints.
Now in HS, your teacher probably asked you to begin by graphing the inequalities. When we do this, we see that the constraints
generate the following polytope.

Perhaps, your HS teacher didnt use the word polytope thats what this region here is. Each constraint restricts our solution to
half of the plane, called a half-space, and a polytope is the intersection of half-spaces.
After you graph this region, the solution can be picked out as one of the the vertices. In this case, its pretty easy to see that its
this one on the right, which is at the intersection of the two problem constraints. Maybe, if the formula was a little more
complicated and you werent sure, you could have tested each one of the vertices and picked the one with the highest objective
value.
Why is the optimal solution at one of the vertices? Well, remember that in this problem and all similar ones from High school, the
objective function, the thing were optimizing, is linear. The only thing that matters is how far we can move in a certain direction ,
in this case the x1 direction, but it could be any direction in the plane.
If you like, you can think of there being a giant magnet infinitely far away pulling our point x in a certain direction.

In trying to get as close as possible to the magnet, this point must end up at one of the vertices.
If some point is interior, then we can clearly improve by moving in this direction. If a point is on an edge, then we can improve by
moving along this edge. The only time we couldnt improve in this way would be if the edge were perpendicular to the direction
we wanted to move in. But then \emph{both} vertices on either side of the segment have the same value and therefore are also
optimal solutions.
Thinking more abstractly, there isnt always an optimal solution, as there is in this particular case. If I eliminate one constraint as
shown below, then polytope is unbounded in the gradient direction for our objective.

In this case, we can keep moving our point x further and further getting greater values for our object. You give me an x Ive got
a better one! Hence there is no optimal solution.
On the other hand, if I put back that constraint and add another one, we might find that they are contradictory.

There is no way to satisfy them. If there are no solutions, there cant be an optimal one.
So those are the three things that can happen: the constraints can create a bounded region and we find an optimum, the region
can be unbounded, in which case we might find an optimum or the problem might be unbounded, or the region can be empty.

Workout Plan - (Udacity)


Lets do a quick exercise on this High School linear programming. Heres the problem:
A movie actor is developing a workout plan. He will burn 12 Calories for every minute of step-aerobics he does and 4 for every
minute of stretching he does. The workout must include 5 minutes of stretching and must last no longer than 40 minutes in total.
The actor wants to burn as many Calories as possible.
Well let x be the number of minutes spent on step-aerobics and let y be the number of minutes spent stretching.
I want you to express the actors problem as a linear program and give the optimal values for x and y in the boxes below.

To n Dimensions - (Udacity, Youtube)


Linear Programming is largely just the generalization of this sort of problem solving to n dimensions, instead of just the two that
weve used so far.

(Note that inequality over matrices means that the inequality holds for each element.)
When we first encounter an linear programming optimization problem, it might not be in this form. In fact, the only requirements
for an optimization problem being a linear program are that both the objective function and the constraints, inequalities or
equalities, be linear. If this is true then, we can always turn it into a canonical form like this one.
Here are the key transformations.

Things get a little more interesting when we go from one of the inequalities to an equality. Here we introduce a new variable,
called a slack or surplus variable depending on the inequality.
There is also the problem of free variables that are allowed to be negative. There are two ways to cope with one of these. If it is

involved in an equality constraint, then you can often simply eliminate it through substitution. Otherwise, you can replace it with
the dierence of two new non-negative variables.

Transformation Quiz - (Udacity)


Lets do a quick exercise to practice these transformations.

Favored Forms - (Udacity, Youtube)


By these various transformations, its possible to write any linear program in a variety of forms. Two forms, however, tend to be
the most convenient and widely used. First is what well call the symmetric formwell see why it gets that name when we
consider dualityand second, is the standard form. The key dierence between the two is that we have changed the inequality
for an equality.

To better understand the relationship between these two forms, Im going to write the standard form in terms of the symmetric
form.

To convert the m inequalities to equalities we introduce m slack variables x n+1 xn+m . Of course, this means that we need to
augment our matrix A as well so that these slack variables can do their job. And c also needs to be augmented so that we can
multiply it with the new x without changing the value of the objective function.

Geometrically, weve switched our optimization from being over a polytope in n dimensions (note the inequalities in the
symmetric form) to being over the intersection of a flat (note the equality constraints) intersected with cone defined by the
positive coordinate axes (note the non-negativity constraints) in n + m dimensions.
We expect that an optimum for the symmetric problem will be one of the vertices of the polytope, where n of the hyperplanes
defined by the constraints intersect. That is to say, of these n + m constraints, ( m from A and n from the non-negativity of x) n
must hold with equality, or be tight in the common parlance. Some might come from A, others from the non-negativity
constraints, but there will always be n tight constraints.
Over in standard form, the notion of whether the constraints from A are tight are not is captured by the slack variables that we
introduced. A slack variable is zero if and only if the corresponding constraint is tight. Thus, at least n of the variables will be
zero when we are at a vertex of the original polytope. In fact, if I tell you which n variables are zero and these correspond to an
linearly independent set of constraints, then you can construct the rest of the variables based on the equality constraints.
Now, so far I have kept on using the number of variables n and the number of constraints m from the symmetric form, even as
we talk about the standard form. In general, however, when discussing the standard form we redefine the new n to be the total
number of variables (the old n + m ).
One other thing to note about this equality form is that enforce the matrix A to have rank m where m is the number of
constraints. That is to say, the rows should be linearly independent. If the rows arent linearly independent, there are two
possibilities. One is that the constraints are inconsistent meaning that there is no solution. The other possibility is that the
constraints are redundant meaning that some of them can be eliminated. So from now on well assume that A has full rank.

Basic Solutions and Feasibility - (Udacity, Youtube)


From this point onward, we will consider linear programs in the standard form, where the constraints are expressed as
equalities. We will have an underdetermined system of equations that has lots of solutions, and we will try to find the one that
maximizes the objective function. Now, if you have ever had to come up with a solution to an underdetermined system on your
own, you will have noticed that its easiest to find one by simply setting some of the coecients to zero so as to create a square
system and then solving for the rest. In eect, this is what solvers like Matlab do as well. These solutions, are called basic
solutions, and it turns out that to solve linear programs basic solutions are the only ones that we will need to consider. The trick,
however, is figuring out which coecients need to set to zero. With this intuition in mind, lets dive into the details.
I want to define some vocabulary that will be useful going forward.
First, we say that a vector x is a solution if it solves

Ax = b.
A basic solution is one generated as follows: well pick an increasing sequence of m column numbers so that the
corresponding columns are independent and call the resulting matrix B. This is easiest to see when the chosen columns are the
first m , and well use this convention for most of our treatment.
B

We define xB = B b and then embed this in the longer vector x , putting in the value from xB for columns in our sequence and
zero otherwise. Then x is a basic solution.
Really, all that were trying to accomplish here is to let x B get multiplied with the columns of A corresponding to B and have
zero multiplied with all the other columns. (Remember post-multiplication corresponds to column operations).

So thats a basic solution, and we call it basic because it came from our choice of this linearly independent set of columns,
which forms a basis for the column space. From the basis, we get a basic solution.
It is possible for more than one basis to yield the same basic solution if some of the entries of xB are zero. Such a solution is
called degenerate. This corresponds to a vertex being the intersection of more than n hyperplanes in the symmetric form.
So far, this vocabulary only addresses the equality constraints. Adding in the non-negativity constraints on the variables, we use
the word feasible. Thus, a feasible solution is a solution that has all non-negative entries, and a basic feasible solution is
one that comes from a basis as described above and has all non-negative entries.

Find a Basic Solution - (Udacity)


Now for an exercise on basic solutions: Given the equations below, find a basic solution for x. Remember that since there are
only two rows in the matrix your solution most not have more than two non-zero entries.

Fundamental Theorem of LP - (Udacity, Youtube)


So far, weve reminded ourselves of the basics of linear programming by examining it in two dimensions. Then we built up some
vocabulary and notation that allow us to extend these notions to n dimensions. Now, we are ready for the culmination of all this
work in the fundamental theorem of linear programming, which captures the idea that the optimal solutions should be a the
corners (also called extremes points) of the feasible region, and tells us that we need only consider basic feasible solutions.

Well start by proving the first point of the theorem statement above.
Let x be a feasible solution and well consider only the positive entries. Without loss of generality, lets assume that they are the
first p. Then it must be the case that that

x 1 a 1 + + xp a p = b.
That is, after all, part of what it means to be feasible.
Case 1: Suppose first that the columns a 1 a p are linearly independent.
Then, its not possible that p should be greater than m . If p = m, then x is basic, and were done. The quantity p could be less
than m , but then we would just add columns as needed until we formed a basis. That covers the independent case.
Case 2: Suppose that a 1 ap are linear dependent.
That means that there are coecients y1 yp such

y 1 a 1 + + yp a p = 0
with at least one of these coecients being positive. Well then choose

= min{xi /yi |yi > 0}.


Then multiplying the above equation in y by and subtracting it from x1 a 1 + xp a p = b, we end up with another feasible
solution. This one, however, has at most p 1 positive coecients because our choice of sent at least one of them to zero.
We can then repeat this argument as needed to reduce the problem to case 1.
Now, onto part 2 of the theorem, which shows that if there is an optimal feasible solution, there is an optimal basic feasible
solution. This will feel similar to the first part. We let x be an optimal feasible solution, meaning that not only is it a solution but it
also has the highest possible dot-product with c.
As before, well let columns a 1 a p correspond to the non-zero entries of x and consider first the case where these are linearly
independent. The situation is the same as before. p being greater than m is impossible, equal means that it is a basic solution,
and less just means that it is a degenerate solution.
This case is simple.
If the columns are dependent, then we have a set of coecients y1 yp with at least one positive so that

y 1 a 1 yp a p = 0.
T
Note, however, that for suciently close to zero (both positive and negative) x y is feasible. Thus, c y = 0. Otherwise, we
could choose the sign of so as to make

cT x < cT (x y).

y = 0.

Since we assumed that x is an optimal solution we conclude that c y = 0. Therefore, we can set to the same choice as before
that sent one of the coecients 1 p to zero. By repeating this argument, we eventually reach case 1.
Weve just seen how the fundamental theorem of linear programming tells us that we can alway achieve an optimal value for the
program with a basic solution. Moreover, basic solutions come from a choice of m linearly independent columns for the basis.
Remember this key point going forward.

Brute Force Algorithm - (Udacity)


The fundamental theorem of linear programming immediately suggests an algorithm where we just try all the possible bases and
take the best generated basic feasible solution, as outlined here. And my question for you is Why is this algorithm problematic
or impractical? Check all that apply.

Simplex Equations - (Udacity, Youtube)


Now, well talk about a much more ecient approach called the Simplex Algorithm. This actually will not be polynomial time in
the worst case, but as a practical matter this algorithm is ecient enough to be widely used. The Simplex algorithm allow us to
move from one basic feasible solution to a better, and by moving to better and better basic feasible solutions, we eventually
reach the optimum.
For convenience, lets suppose that our current basis consists of the first m columns of A . Well call this submatrix B and the
remaining submatrix D. It will also be convenient to partition x and c in an analogous fashion. Overall, then we can rewrite our
standard form as

For the simplex algorithm, we want to consider the eects of swapping out one of the current basis columns for another one. To
do this, we first want to identify a good candidate for moving into the basis, one that will improve the objective function. As the
program stands, however, it not immediately which if any are good candidates. Sure, for some xi the coecient might be
positive, but raising that value might force others to change because of the constraints, making the whole picture rather
opaque. Therefore, it will be convenient to parameterize our ability to move around in the flat defined by the equality constraints
solely in terms of xD , the variables that we are thinking about moving into the basis.

To this end, we solve the equality constraint for xB so that we can substitute for it where desired. First we substitute it into the
objective function, and through a little manipulation, we get this expression.

The constant term here doesnt matter, since we are only considering the eects of changing x D . This quantity here that is
multiplied with x D , however, is important enough that it deserves its own name. Lets call it rD .

r DT = cTD cTB B1 D
T

In our reframing of the problem, therefore, we want to maximize r D x. How about the other constraints? Well, the first one goes
away with the substitution. The requirement, however, that xB remain non-negative remains.
Substituting our equation for xB , we get the linear program

max

r DT xD

s.t.

B1 DxD B1 b
xD 0

where, of course, xD must remain non-negative as well.


Note xD = 0 the current situation for the algorithm is feasible, and it is very easy to see a way to improve the objective value
just by looking at the vector r D . Our real goal, however, is not just to climb uphill but to figure out which column should enter
the basis.

Who Enters the Basis - (Udacity)


Actually then, well make this quiz! What makes for a good candidate to enter the basis? Check the best answer.

The answer is that any column corresponding to a positive entry of r D is a good candidate. We want the entry to be positive
because the corresponding element of xD is also going to positive as we increase it. Just picking the greatest entry of rD
doesnt work because this still might be negative.
This idea then becomes the basis for the simplex algorithm. Pick q such that r q > 0 and let xD = eq , just the unit vector along
the qth coordinate axis.
This choice simplifies the optimization even further since xD is now just proportional the qth column of D. Well define
u = B1 Deq and v = B1 b. Now we have

max

r DT eq

s.t.

u v
xq 0

Who Exits the Basis - (Udacity)


Of course, the greater the , the greater the value, so we want to make this as big as possible. But how big can we make it?
Lets make this another quiz.

The answer is the first expression

= min{vi /ui : ui > 0}


Unless, ui is positive, we can make as big as we want without running into the constraint. Of these constraints, well hit the
one with this lowest ratio first.
Setting to this value makes one of these constraints tight, and sends the corresponding entry of x B to zero. Remember that
this equation came the constraint that xB be nonnegative. We can then bring d q into our basis and kick out the column
corresponding to the constraint that became tight, and repeat.

Simplex Algorithm - (Udacity, Youtube)


In more detail, we can express the simplex algorithms as follows:

Simplex Example - (Udacity, Youtube)


Simplex Correctness - (Udacity, Youtube)
We have now described the simplex algorithm and seen it illustrated on a simple example. Next, we argue that the algorithm is
correct, giving us a basic feasible solution for bounded linear programs and reporting that unbounded ones are unbounded.
Lets consider the bounded case first. First we recognize that at each step we make some progress, usually improving the
objective value. We have to be careful here in the case of degenerate basic solutions. Going back to the algorithm, remember
that we pick a new columns to go into the basis because it corresponds to a positive value in r D , and hence, increasing it,
increases the objective value. The trouble is that if the current basic solution is degeneratei.e. v has a zero entry, then its
possible that we wont get to move in this direction at all. The nightmare scenario is that we end up in some kind of cycle.
There are two ways of coping with this challenge. One is to perturb the constraints slightly. The other is to give preference to
lower indexed columns both for entering and leaving the basis. This is known as Blands rule. In either case, we can be assured
of making some progress in each step, but the notion is a little tricky.
Clearly, we have a finite number of steps because there are only n choose m possible bases, and because we make progress,
we dont cycle among them.
Now, we just need to make sure that we dont stop too early. There are two possible ways the algorithm can terminate: either
because u is nonpositive, or because rD < 0. Lets consider the termination because of u first. This turns out to be pretty trivial.
If u 0, then we get to keep going in the direction of a q as far as we want, increasing the objective value the whole way, so
clearly the problem is unbounded. We wont terminate falsely on that score.
Lets consider the case where we terminate on after examining rD then. Let x be an optimal feasible solution and let x be the
current suboptimal basic solution in the simplex algorithm.
Recall that once I choose that basis, I can solve for xB like so,

then substitute back into the objective function. Note that the choice of basis (partitioning the columns of A into B and D ) is

done the basis of x , not x .


This expression here is the same r D that we obtained in the simplex method. Because x is suboptimal, we obtain have a strict
inequality when replacing x with x.
Note, however, that x D is zero. Also, xD is nonnegative, so for the inequality to hold at least one entry of r D must be positive.
Hence, if there is a better solution, the simplex algorithm wont terminate.
That wraps up the case where the program is bounded. How about when it is unbounded? By the same argument just given, we
wont hit the case where r D 0 . The algorithm also cant run forever because it avoids cycling, and can thus can only visit
each of the n choose m bases once. The only possible remaining outcome is termination after inspecting u, as desired.

Getting Started - (Udacity, Youtube)


If you have been paying careful attention, you will have noticed that the simplex algorithm started with a basic feasible solution.
Now in some cases, basic feasible solutions are easy to find just by inspection, but not always.
For these harder cases, we can create an auxiliary program to help us. First we negate the constraint equations as necessary so
that we have b 0.
Then we create our auxiliary program as follows.

We introduce artificial variables y that represent the slack between Ax and b, require these variable to be non-negative, and
then try to minimize their sum.
For this auxiliary program, it is easy to find a basic feasible solution: just set x = 0 and y = b. Therefore, we can start the
simplex algorithm. If we find that the optimum value is zero, then we can start our original program with the values in x. On the
other hand, if the optimum is greater than zero, that means that the original problem was infeasible. This is sometimes called the
Two Phase approach for solving LPs.

Conclusion - (Udacity, Youtube)

The simplex method as it is described here was first published by George Dantzig in 1947. Fourier apparently had a similar idea
in the early 19th century, and Leonid Kantorovich had already used the method to help the Soviet army in World World II. It was
Dantzigs publication, however, that led to the widespread application of the method to industry as the lessons of operations
research learned from the war began to be applied to the wider economy and fuel the post-war economic boom. It remains a
popular algorithm today.
As practical as the algorithm was, theoretical guarantees on its performance remained poor, and in fact, in 1972 Klee and Minty
showed that the worst-case complexity is exponential. It wasnt until 1979 that a Khachiyan published the ellipsoid algorithm
and showed that linear programs can be solved in polynomial time. His results were improved upon in 1984 by Karmarkar,
whose method turned out to be practical enough to be competitive with the Simplex method for real-world problems. Both of
these algorithms take shortcuts through the middle of the polyhedron instead of always going from vertex-to-vertex.
In the next lecture, well talk about the duality paradigm, which rises out of linear programming and has been the source of
many insights and the inspiration for new algorithms. Even with a whole other lesson, however, we are only able to scratch the
surface of the huge body of knowledge surrounding this fundamental problem that has shown itself to be of deep importance in
both theory and practice.

Duality - (Udacity)
Introduction - (Udacity, Youtube)
Every linear program, it turns out, has a dual program which mirrors the behavior of the original. In this lesson, we will examine
this phenomenon to give us a chance to apply some of the knowledge we gained about linear programs, as well as to deepen
our understanding of some other problems that weve already studied. See if you can guess which problems as the lesson goes
along.

Bounding an LP - (Udacity)
I want to start o our discussion with a little exercise where we try to find an upper bound on the value of a linear program. Well
start with this linear program here,

and were going to take a linear combination of these inequality constraints to obtain a bound on the objective function.
Multiplying the first inequality by y 1 and the second by y2 , and adding them together, we obtain this inequality here. Note that it
is important that the ys be non-negative to avoid reversing the inequality.

If we chose y 1 and y 2 such that 6 2y1 y2 and 2 y1 + 2y2 , then the objective function can be at most the left hand side
of our new inequality, which can be at most the right.

The quantity 2y 1 + 3y 2 then becomes an upper bound on our objective function.


For this exercise, I want you to choose y1 and y2 to make this bound as tight as possible.

Dual Programs - (Udacity, Youtube)


Associated with every linear program is a so-called dual program, which is also a linear program. This definition is most elegant
when stated in terms of the symmetric form. Indeed, now you see why this form gets the name symmetric.

As we saw in the exercise, the dual program can be thought of as the problem of minimizing an upper bound on the primal.
T
T
Note that for all feasible y , we have b y is at most y Ax using the constraint from the primal and the nonnegativity of y. And this
T
is at most c x, using the constraint from the dual and nonnegativity of x.

In fact, we just proved the Weak Duality Lemma, which states that if x is feasible for the primal problem and y is feasible for the
T
T
dual problem, then c x is at most b y .
Another thing to note here, is that if your primal problem isnt in this exact form, you can always convert it, then look at the
corresponding dual and simplify. Often, however, it is easier just to remember that the dual is the problem of bounding the
primal as tightly as possible. For instance, if we change the inequality in the primal to equality, then we can proceed by the
same argument, only this first inequality becomes an equality, and I dont have to rely on y being non-negative. Everything else
is the same.

Duality Theorem - (Udacity, Youtube)


Here is the picture so far. We have primal programs over here trying to be maximized, and we have our dual program here, trying
to be minimized, and the obvious question is Do they ever meet?

Well, the answer is ``Yes, they always do. More precisely, we state this as follows in the Duality Theorem.
If either the primal problem or the dual has a feasible optimal solution, then so does the other, and the optimal objective
values are equal. If either problem has an unbounded objective value, then the other is infeasible.

Well start the proof by showing the second part. Suppose the primal is unbounded and y is a feasible for the dual. (Were going
T
T
to show that both of these cant be true.) By weak duality, b y c x for all feasible x . Since the primal is unbounded, however,
T
T
I can find x that gives me a value as high as I want. Whatever, the value of b y is, I can find a feasible x such that c x is larger,
which creates a contradiction. The case where the dual is unbounded, is analogous.
Now, we return to the first part: If either the primal problem or the dual has a feasible optimal solution, then so does the other,
and the optimal objective values are equal. Lets start with the primal having a finite optimal solution. From this it follows that
there is a finite basic optimal solution by the Fundamental Theorem of LP. Lets let the basis be the first m columns of the matrix
A as usual and divide x and c up accordingly. (As usual B stands for basic here)
Recall then from the simplex algorithm that the vector r D which represented the eects of moving along on of the directions in
xD had to be nonpositive. I.e.
1
0 rDT = c TD cTB B D.

Otherwise, this basic solution wasnt optimal. Now, were going to actually construct a solution for the dual. Letting

yT = cTB B1
T
, we have that y D cTD from the nonpositivity of r . Therefore,

yT A = [y T B, yT D] = [yT B, cTB B1 D] [cTB , cTD ] = c T .


We conclude that y is feasible for the dual.
Moreover, by substitution, we see that

yT b = cTB B 1 b = cTB xB .
where x is the basic optimal solution. By weak duality, this is the best we can do, so both y also is optimal.

Dual Optimal Solutions - (Udacity, Youtube)


With this proof, we actually have shown something even stronger than the Duality theorem we set out to show, because we
have actually given a way to determine a dual optimal solution. Well start with the linear program in standard form, as usual,
and well let the columns of the matrix B form an optimal basis, meaning that it generates an optimal basic feasible solution.
Then yT defined at c TB B1 is an optimal solution to the dual problem by our previous argument. Moreover, the optimal values are
equal.

Dual Solution Calculation - (Udacity)


Lets do an exercise on this idea of a dual optimal solution. Given that x as shown here is an optimal basic solution to the linear
program below, find the dual optimal solution.
Well let y1 correspond to this first constraint and y2 correspond to the second.

Duality of Max Matching - (Udacity, Youtube)


By now, we have seen this picture several times, where we one quantity that we are trying to maximize and another which
serves as an upper bound that we are trying to minimize, and luckily the two meet at some point that is optimal for both. We
have just seen this with our primal and dual linear programs
but we saw it earlier in the semester as well with our max-flow/min-cut problem
and also with our max-matching and vertex cover problems in bipartite graphs.

Its natural to ask, are these phenomena all related? Well, yes they are and probably the easiest way to see that is to realize that
all of these can be characterized as linear programs and their duals.
Lets take a look at the duality of maximum matching in bipartite graphs first.
Well let the variable xij indicate whether x ij should be included in the matching. Then as a linear programming the problem
becomes to maximize the number of matched edges subject to the constraints that no vertex in L can be matched more than
once and no vertex in R can be matched more than once. Of course, we cant have negatively matched edges.

To build the dual program, we let yi and y j be the variables corresponding to these constraints, and we want to minimize their
sum because the constraint vector here is just all ones.

For the constraints, observe that the coecients in the objective function are 1 and that any xij appears once in the equation for
i and once in the equation for j.
Hence y i + yj 1. And of course yi and yj cant be negative.

The interpretation here is straightforward: vertex i is in the cover if and only if y i = 1 and similarly vertex j is in the cover if and
only if yj = 1.
Every edge must have at least one vertex in the cover and we are trying to minimize the size of the cover.
So we have just seen how maximum bipartite matching can be expressed as a linear program and its dual also turned out to
have a natural interpretation as the vertex cover problem. This is really neat. Every decision problem in P can be converted to a
linear program ultimately, just because linear programming is P-complete, but not every conversion will result in variables and a
dual program that have such intuitive interpretations. When this happens, it often gives a way to gain deeper insight into a
problem and its structure.
As you might have guessed, this happens also for the max-flow/min-cut problem and well explore that next.

Duality of Max Flow - (Udacity, Youtube)


For completeness, well go ahead and explore the duality in the maximum flow problem as well. We can cast it as a linear
programming problem by letting fuv be the flow and letting cuv be the capacity across an edge (u, v).
Our goal is to maximize the flow out of the source s subject to the conservation of flow constraint and the capacity constraint.
Of course, flows must be nonnegative as well.

To express the dual well use yu for conservation contraint at vertex u and yuv for capacity contraint at edge (u, v). Two
subscripts mean a capacity contraint, one subscript means a conservation constraint.
The dual problem is to minimize the sum over all edges of cuv yuv . Note that the yu s have no role in the objective function
because their coecients are zero.

The constraints for the dual involve several cases. Well consider first those arising from the objective function coecients being
one for edges out of the source. The flows appear onces in the capacity constraint and once in the conservation equation for
the receiving vertex.

The case for edges going into the sink is analogous. The flow is present in the capacity constraint and in the conservation of
flow equation for the sending vertex. These must be at least one because the objective function coecient is zero.

For all other edges, the the flow appear in the capacity constraint and BOTH conservation of flow equations. Again, the
coecient in the objective function is zero so that becomes the constraint. And these dual variables have to be nonnegative.
The interpretation of these dual variable can be a little tricky so, Im going to rearrange the constraints to isolate the capacity
variables on the left-hand side.

This makes it a little easier to see what is going on. Actually, I think this would make a good exercise.

Interpretation of y - (Udacity)
Suppose that y is a basic optimal feasible solution for the given LP. Which statements are part of an interpretation of y as an s-t
cut, say (A, B)?

Conclusion - (Udacity, Youtube)


In this lesson, we defined the dual of a linear program and showed how this dual program can be see as the problem of making
a certain kind of bound on the primal program as tight as possible. Then, we saw how maximum flow and maximum bipartite
matching can be expressed as linear programs and how the minimum s-t cut and vertex cover problems were their duals.
If this idea of duality appeals to you, I suggest looking at the primal-dual algorithm for solving a linear program. Although this is
not a practical strategy for solving general linear programs, it has served as the inspiration for many algorithms for specific
types of problems. In fact, the Ford-Fulkerson algorithm which we studied and the famous Hungarian algorithm for finding
minimum cost matchings can be seen as primal-dual algorithms. Those are good places to start.
More broadly, the type of duality that weve studied has profound implication for game theory, convex geometry, and convex
optimization. Keep this lesson in mind when you encounter those topics.

Approximation Algorithms - (Udacity)


Introduction - (Udacity, Youtube)
So far in our discussion of algorithms, we have restricted our attention to problems that we can solve in polynomial time. These
arent the only interesting problems, however. As we saw in our discussion of NP-completeness there are many practical and
important problems for which we dont think there are polynomial algorithms. In some situations, the exponential algorithms we
have are good enough, but we cant obtain any guarantees of polynomial eciency for these problems, unless P=NP.
Besides resorting to exponential algorithms, we also have the option of approximation. For instance, we might not be able to
find the minimum vertex cover in polynomial time, but we can find one that is less than twice as big as it needs to be. And many
other NP-complete problems admit ecient approximations as well. This idea of approximation will be the subject of this
lesson.

An Approximation for Min Vertex Cover - (Udacity, Youtube)


Well start our discussion with a very simple approximation algorithm, one for the minimum vertex cover problem.

As input, we are given a graph and we want to find the smallest set of vertices such that every edge has at least one end in this
set. Recall that this problem is NP-complete. We reduced maximum independent set to it earlier in the course.
The approximation algorithm goes like this.

We start with an empty set, and then while there is still an edge that we havent covered yet, we chose one of these edges
arbitrarily and add both vertices to the set. Then we remove all the edges incident on u and all those incident on v, since those
edges are now covered. Next, we pick another edge and remove all edges incident on it. And so on, and so forth until there
arent any edges left. At the end of this process the set C, which weve picked must be a cover. (See the video for an animation.)

Looking back at the original graph, its not too hard to see that a cover obtained in this way need not be a minimum one. Here is
a cover obtained with the algorithm (orange) and optimal cover (green).

In this case, ApproxVC algorithm returned a cover twice as large as the optimal cover. Fortunately, this is as bad as it gets.

|C|

Given a graph G that has minimum vertex cover C , the algorithm VCApprox returns a vertex cover C such that | C |

2.

To prove this, it is useful to consider the set of edges chosen by the algorithm at the start of an iteration. Well call this set M.
We use the letter M here because this set is a maximal matching. It pairs o vertices in such a way that no vertex is part of
more than one pair. In other words, this set of edges must be vertex disjoint. That means that in order to cover just this set of

edges, any vertex cover must include at least one vertex from each. Therefore, |M| |C |.
Since C is a minimum cover, the set C chosen by the algorithm can only be larger. It is of size 2|M|. Altogether then,

|M| |C | |C| = 2|M| 2|C |.

Dividing through by |C | then gives the desired result.

C and C - (Udacity)
Given this theorem, lets explore the relationship between the size of the optimum cover and the one returned by our ApproxVC

algorithm with a quick question. Suppose that G is a graph with a minimum vertex cover C and that our ApproxVC algorithm
returned a set of vertices C. Fill in these blanks below so as to make these statements as strong as possible.

Lower Bounding the Optimum - (Udacity, Youtube)


Even though the algorithm we just discussed for minimum vertex cover was very simple, it illustrates some key ideas found in
many approximation schemes.
Consider this figure here, illustrating the possible sizes of the minimum set cover.

We have the size of the set returned by our algorithm |C| and size of the optimal one C . We would like to be able to find some
GUARANTEE about the relationship between the two. The trouble, of course, is that we dont know the optimal value. Actually,
finding the optimal value is NP-complete, and thats why we are searching for an approximation algorithm anyway.

We resolve this dilemma by finding a lower bound on the size of the optimal cover in the maximal matching that the algorithm
finds. Then we find the UPPER bound on our approximation then in terms of this lower bound. Note that the approximation is
not enough to tell us enough about optimum value to allow us to solve the decision version of the problem: does the graph have
a vertex cover of a particular size?
Our approximation might be twice the optimum value,

or it might find an exact solution

Since we cant tell which situation we are in, we cant tell where in this range the the minimum vertex cover falls.

Optimization - (Udacity, Youtube)


In our discussion of computability and complexity, we focused on deciding languages. As we began our discussion of
algorithms, however, we began to talk about optimization problems instead without ever formally defining them. This was fine
then, but as we discuss approximation algorithms, we are going to circle back to some of the ideas we encountered in
complexity and we need a formal way to connect them.
Therefore, we are going to define an optimization problem, using min vertex cover as an example to illustrate, and this will allow
us to give a formal definition of an approximation algorithm.

The first thing we need is a set of problem instances. For the example of minimum vertex cover this is just the set of undirected
graphs. For each instance, there is a set of feasible solutions. For min vertex cover, this is the set of covers for the graph. Next,
we need an objective function, the thing we are trying to optimize. For min vertex cover this is the size of the cover. And we
need to say whether we are minimizing or maximizing the objective. For min vertex cover, we are minimizing it of course.
Relating this back to our treatment of complexity, we say that an NP-optimization problem is one where these first three criteria
are computable in polynomial time. That is to say, there is a polynomial algorithm that says whether the input instance is value,
one that can check whether a solution is feasible for the given instance, and one that can evaluate the objective function.
Now, every optimization problem has a decision version of the form, is the optimum value at most some value for the min and at
least some value for the max. For minimum vertex cover, we ask is there a cover of size less than some threshold. With this in
mind, we can then say an optimization problem is NP-hard if its decision version is. A problem is NP-hard, by the way, if an NPcomplete problem can be reduced to it. In our example, min vertex cover is NP-hard because the decision version is.
Remember that reduced from the maximum independent set problem.

So thats how optimization relates to complexity.

Approximation Algorithms - (Udacity, Youtube)


Next, I want to use these definitions of an optimization problem to define an approximation algorithm.

Note that its okay for to be a function of the instance size.


Also, when we are working with maximization problems instead of minimization ones, this inequality gets reversed. Our previous
result for the min vertex cover problem can be stated like in terms of this definition by saying Min vertex cover has a polynomial
factor 2 approximation algorithm or approximation scheme.

An Approximation for Maximum Independent Set - (Udacity)


As we saw earlier in the course, the complement of a minimum vertex cover is a maximum independent set. Its natural to ask
then, can we turn our 2-factor algorithm for minimum vertex cover into an algorithm for maximum independent set? Such an
approach yields what about maximum independent set?

An FPTAS for Subset Sum - (Udacity, Youtube)


The subset sum problem admits a fully polynomial time approximation scheme, or FPTAS for short. Recall that the decision
version for subset sum, was whether give a set of numbers there was any subset that summed up to a particular value. The
optimization version of this is to maximize the sum without going over some threshold t.

This problem admits a very important class of approximations.


For any

> 0, there is an O(

n 2 log t

) time, factor (1 + )

algorithm for subset sum.

The smaller the epsilon, the better the approximation, but he worse the running time.
This is a remarkable result. It may be intuitive that one should be able to trade-o spending more time for a better
approximation guarantee, but it isnt always the case that we get to do so arbitrarily as in this theorem. Because this isnt a
particular algorithm but rather a kind of recipe for producing an algorithm with the right properties, we call this a polynomial time
approximation scheme, or PTAS for short. For every , you choose there is algorithm that can approximate that well.
This approximation scheme is extra special because the running time is polynomial in 1/ as well as polynomial the size of the
input. Therefore, we say that this is a fully polynomial time approximation scheme.The alternative would be for the epsilon to
appear in one of the exponents, perhaps. Then it would just be a polynomial time approximation scheme.

Traveling Salesman Problem - (Udacity, Youtube)


After hearing about these approximation schemes, the optimists may be saying, hey, maybe every NP-complete problem
admits an FPTAS. Unfortunately, this isnt true unless P=NP. There are some problems where approximating the optimum
within certain factors would lead to a polynomial algorithm for solving every problem in NP. This phenomenon is known as
Hardness of Approximation, and it occupies an important place in the study of complexity.
Well illustrate this idea by showing that the Traveling Salesman Problem is hard to approximate to within any constant factor.
In case you havent seen the traveling salesman problem before, it can be stated like this.

We are given a graph G. Usually, all the edges are present, and with each of them is associated some cost or distance. Well
assume that all the edges are present, so we wont draw them in our examples like this one here. The goal is to find the
minimum cost Hamiltonian cycle. That is to say, we want to visit each of the vertices without ever visiting the same one twice.
This problem is NP-complete in general. And even a constant factor approximation is impossible unless P=NP, as we will prove
next.

Hardness of Approximation for TSP - (Udacity, Youtube)


Being a little more formal, we can say
If P is not equal to NP, then for any constant

1 , there is no polytime factor approximation algorithm for the Traveling

Salesman problem.

For the proof, we reduce from the Hamiltonian cycle problem, where are given a graph, not necessarily complete this time, and
we want to know if there is a cycle that visits every vertex exactly once. Here then is how we set up the traveling salesman
problem. We assign a cost of 1 to every edge in the original graph G and assign a cost of |V| + 1 for every edge not in the
original graph.

Clearly, then if G has a Hamiltonian cycle, then the optimum for the traveling salesman problem has a cost of |V|, with a cost of
1 for every edge that it follows. A factor approximation would then find a Hamiltonian cycle with cost at most |V|. Letting H
be an optimal Hamiltonian cycle for the TSP problem and letting H be the cycle returned by the -approximation , we have that

c(H) |V|.
On the other hand, if the original graph G has no Hamiltonian cycle, then the cost of the one returned by the approximation
algorithm must be at least as large as the optimum, which must follow at least one edge not in the original graph. Hence,

c(H) c(H ) |V| 1 + |V| + 1.


This term on the right hand side comes from the edges in the original graph, and the second from following one of the edges
not in the original graph. Simplifying gives the lower bound of (1 + )|V|.
Therefore, to decide Hamiltonian cycle, we just run the -approximation on the graph and compare the resulting cost to |V| . If
it is larger, then there cant be a Hamiltonian cycle in the graph. On the other hand, if the same or smaller, then we cant have
used one of the edges not in the original graph, so there must be a Hamiltonian cycle.
Thus, if there were polynomial constant factor approximation for the Traveling Salesman problem, it would yield a polynomial
algorithm for Hamiltonian cycle, which is NP complete. Unless P is equal to NP, no such approximation algorithm can exist.

Summary - (Udacity, Youtube)


So far, weve seen three dierent kinds of results related to approximation algorithms. We seen a factor 2 approximation in the
example of Vertex cover. We talked about Fully Polynomial Time Approximation Schemes in the context of subset sum. And
weve seen a hardness of approximation result in showing that the general traveling salesman problem cannot be approximated
to within any constant factor. Thats a good sample of the types of results ones sees in the study of approximation algorithms.
Before ending this lesson, however, I want to talk about one more classic result.

Metric TSP - (Udacity, Youtube)


Perhaps the traveling salesman problem cant be approximated in general, but if we insist that the cost function obey the
triangle inequality as it does in many practical applications, then we can find an approximation. The triangle inequality, by the
way, just says that is never faster to go from one vertex to another via a third vertex.

Here is the approximation algorithm. We start by building a minimum spanning tree. The usual approach here is to use one of
the greedy algorithms, either of Kruskal or Prim, that are typically taught in an undergraduate class. In Kruskals algorithm, the
idea is simply to take the cheapest edge between two unconnected vertices and add that the graph until a tree is formed. (See
the video for an animation.)

Next, we run a depth-first search on the tree, keeping track of the order in which the vertices are discovered. For this example,
lets label the vertices with the letters of the alphabet, and start from C. Then the discovery order would go something like this.

Note that this cycle follows along the tree at first, from c to b to e to h to i, nut instead of back tracking to h, it goes directly to j.
Then, it goes directly to d, and so on.
This cycle seems to always be taking short-cuts compared to the traversal the depth-first search performed. For the general
Traveling Salesman problem, we cant be sure these are in-fact short-cuts, because we cant assume the triangle inequality.
Where we do have the triangle inequality, however, these *will be short cuts, and as well see that will be the key to the analysis.

Correctness of Factor 2 TSP Approx - (Udacity, Youtube)


2
The algorithm just described, which we call ApproxMetricTSP, is an O(V ) time, factor 2 approximation algorithm for the metric
traveling salesman problem.

O(

The process for building the min spanning tree is O(V ) for dense graphs and the depth first search process takes time
proportional to the number of edges, which is the same as being proportional to the number of vertices for trees. That takes
care of the eciency of the algorithm.
Now for the factor-two part. Consider this example here,

and let H be a minimum cost Hamiltonian cycle over this graph. This is what an exact algorithm might output. Well, the cost of
the minimum spanning tree that the algorithm finds must be less than the total cost of the edges in this cycle. Otherwise, just
removing an edge from the cycle would create a lower cost minimum spanning tree. (Remember that the cost must be nonnegative). Thus, for a minimum spanning tree T , we have

eT

c(e)

eH

c(e).

Now, lets draw a minimum cost spanning tree.

The cost of a depth-first search traversal is twice the sum of the costs of the edges in the tree, i.e. 2 eT c(e) This also starts
and ends at the same vertex, so its a cycle.
The trouble is that its not Hamiltonian. Some vertices might get visited twice. Its easy enough, however, to count only the first
time that a vertex is visited. In fact, this is what ordering the vertices by their discovery time achieves.
By the triangle inequality, skipping over intermediate vertices can only make the path shorter, so the cost of this cycle is most
the cost of the depth-first traversal. Thus,

c(H) 2 c(e) 2 c(e) = 2c( H ).

eT

e H

Thats the proof of the theorem.


It may also be useful to view the argument on a scale like this one with 0 at the bottom, maybe the cost of the most expensive
cycle at the top and the optimal one somewhere in the middle.

As we argued, the cost of a minimum spanning tree must be less than the cost of the optimum cycle. We can just delete one
edges from the cycle and get a spanning tree. A depth first traversal of the spanning tree uses every edge twice, and therefore
is twice the cost of the tree. Shortcutting all but the first visit to a vertex in this traversal gives a Hamiltonian cycle, which MUST
have lower cost because of the triangle inequality.

A Tight Example Part 1 - (Udacity)


In response to any approximation result, it is natural to ask, is the analysis tight or does the algorithm actually perform better
even in the worst case than the theorem says. Lets address that question for our metric TSP algorithm.
Here is an example graph.

Let the blue edges have cost 1 and the red ones have cost 2. Enter a minimum cost solution in the box below.

A Tight Example Part 2 - (Udacity)


Now, we consider what our approximation algorithm might have returned. Recall that the optimum is a Hamiltonian cycle of cost
6. The approximation algorithm begins by building a minimum spanning tree for the graph. Perhaps, it chooses the star, like so.

Then, preferring lowest indexed vertices a depth-first traversal would produce this cycle.

Notice that every edge followed in this cycle is a red one except the first and the last. Hence the cost is 2 6 2 = 10. The
ratio is therefore 10/6 . But there wasnt really anything special about the fact that we were using 6 vertices here. We can form an
analogous graph for any n, letting the lighter edges be the union of a star and a cycle around the non-center vertices. All other
edges can be heavy.
My question to you then is how bad does the approximation get for 100 vertices? Give your answer in this box.

Conclusion - (Udacity, Youtube)


In this lesson, weve just scratched the surface of the vast literature on approximation algorithms. One could create a whole
course on the subject consisting just of results not much more complicated than the ones weve seen here. And, of course,
there are many more advanced results besides. The main take-away message then is that when you encounter a problem where
finding an optimum solution seems intractable, ask yourself, is an approximate solution good enough? You may find that
relaxing the optimality constraint makes the problem tractable.

Randomized Algorithms - (Udacity)


Introduction - (Udacity, Youtube)
In this lesson we are going to introduce a new element to our algorithms: randomization. In a full course on complexity or one
on randomized algorithms, we might go back to the definition of Turing machines, include randomness in the model, and then
argue that other models are equivalent. Here we are just going to assume that the standard built-in procedures available in most
programming language work. Of course, in reality, these only produce pseudorandom numbers, but for the purpose of studying
algorithms we assume that they produce truly random ones.
The lesson will use a few simple randomized algorithms to help motivate probability theory, and then use the basic theorems to
characterize the behavior of a few slightly more sophisticated algorithms. Some ideas that come up include Independence,
Expectation, Monte Carlo vs. Las Vegas algorithms, Derandomization, and in the end well tie our study of algorithms back to
complexity with a brief discussion of Probabilistically Checkable Proofs.

Verifying Polynomial Identities - (Udacity, Youtube)


Our first randomized algorithm will one that verifies polynomial identities. Suppose that you are working at company that is
building a numerical package for some parallel or distributed system. A colleague claims that he has come up with some clever
algorithm for expanding polynomial expressions into their coecient form.
His algorithm takes in some polynomial expression and outputs a supposedly equivalent expression in coecient form, but you
are a little skeptical that his algorithm works for the large instances that he claims it works on. You decide that you want to write
a test. That is to say that you want to verify that the polynomial represented by the input expression A , is equivalent to the one
represented by the output B.

Being slightly more general, we can state the problem like this. Given representations of polynomials A and B having degree d,
decide whether they represent the same polynomial. Note that we are being totally agnostic about how A and B are
represented. We are just assured that A and B are indeed polynomials, and we have a way of evaluating them. Well, here is a
fantastically simple algorithm for deciding whether these two polynomials are equal.
1. Pick a random integer in the range [1, 100d].

2. Evaluate the polynomials at this points.


3. Say that the polynomials are equal if they are equal at this point.

Why does this work? Well, the so-called Fundamental Theorem of Algebra says that any non-zero polynomial can have at most
d roots.

So if A and B are dierent, the bad case is that it has d roots and whats worse that all of them are integers in this range
[1, 100d]. Even so, the chance that the algorithm picks on of these is still only 1/100. So if the polynomials are the same, we will
always say so, but if they are dierent, then we will say they are the same is a chance 1 in 100. This is pretty eective, and if it is
found that A is not equal to B in some case, your algorithm is so simple that there cant be much dispute o over which piece of
code is incorrect.

Discrete Probability Spaces - (Udacity, Youtube)


In the analysis of the algorithm for polynomial identity verification, I avoided the word probability because we havent defined
what it means yet. Well do that now and use the algorithm to illustrate the meaning of the abstract mathematical turns.

A discrete probability space consists of a sample space Omega that is finite or countably infinite. This represents the set of
possible outcomes from whatever random process we are modeling. In the case of the previous algorithm, this is the value of x
that is chosen.
Second, a discrete probability space has a probability function with these properties. It must be defined from the set of subsets
of the sample space to the reals. Typically, we call a subset of the sample space an event E. For every event, the probability
must be at least 0 and at most 1. The probability of the whole sample space must be 1. And for any pairwise disjoint collection
of events the probability of the union of the events must be the sum of the probabilities.
To illustrate this idea with the polyequal example, lets define the events Fi to be the set consisting of the single element i, i.e.
Fi = {i}. This corresponds to i being chosen as the value at which we test the polynomials. And we define the probability of
the event Fi as Pr(Fi ) = 1/100d. Now, these Fi arent the only possible events. They are just the single elements sets of the
probability space. We need to define our function over all subsets. But actually, we have done so implicitly already because of

property 2c above. For any subset S of the sample space, we have that the subset is the union of the individual events Fi .
These are disjoint, so the probability of the union is the sum of the probabilities, and so the result is the size of the set divided
by the size of the sample space, as we would expect. That is to say

|S|
Pr(S) = Pr(iS Fi ) = Pr(Fi ) =
||
iS
Lets confirm that this function meets all of the requirements of the definition. Property 2c holds because the size of a disjoint
union is the sum of the sizes of the individual sets. We see from the ratio Pr(S) = |S|/|| that the probability of the whole sample
space is 1. And the probability of any event is between 0 and 1. So this example here is, in fact a discrete probability space.
By the way, this example probability function is called uniform, because the probability of every single element eventthat is to
say the Fi here are the same. Not all probability functions are like that.

Bounding Probabilities - (Udacity)


Here is a quick exercise on probability spaces. Consider one where the sample space are the integers 1 to 10, where the
probability of 1 or 2 is 3/10 and the probability of 2 or 3 is 4/10. Based on this information, I want you to give the tightest bound
you can on the probability of 1, 2, or 3.

Repeated Trials - (Udacity, Youtube)


Lets return to our polynomial identity verification algorithm and see if we can improve it. So far, weve seen how if the two
polynomials are equal, the algorithm will always say so. But if the polynomials are dierent, there is up to a 1/100 probability
that the algorithm will say that they are equal anyway. Maybe this isnt good enough. We want to do better. One idea is just to
change out the number 100 for a larger number. This works, but on a real computer we might start to run into range problems.
We dont want our algorithm to strain the dierence between the theoretical models and practice.
Another solution is repeated trials. Instead of just testing for equality at one point, well do it at several random ones. Such an
algorithm might look like this.

We start out by assuming that the two polynomials are equal. Then we try dierent values for x, and if we ever find a value for
which the two polynomials are not equal then we know that they arent. Note that we could terminate as soon as a dierence is
found, but this version of the algorithm makes the analysis a little more clear.
For simplicity, well make this k equal to 2 so that we can visualize the sample space with a 2D grid like so.

The row corresponds to the value of x in the first iteration, the column to the value of x in the second iteration. Now, the size of
2
2
2
the sample space is (100d) and since there are d pairs of roots for the dierence between A and B, at most d of these
possibilities make the algorithm fail.Well let F be the event that the algorithm fails on unequal polynomials. That is, the
probability that A(x) = B(x) for both chosen x. By symmetry, we can argue that all elements of the sample space should have
equal probability, so the probability of the algorithm failing on two unequal polynomials

Pr(F)

d2

(100d )

1002

Thats 1/100th of the probability one for just one trial.


We can also make the argument by following the actual process of the algorithm more closely. We let E1 be the even that the
polynomials are equal at x in the first iteration. In terms of our grid, this is the subset of the sample space as subset of the rows.

As we argued before, this probability is Pr(E1 )

1
100

Similarly, we let E2 be the event that the polynomials are equal at x in the second iteration. In the grid, this event corresponds to
a certain subset of the columns. Again, these red columns take up at most 1/100th of the whole probability mass.

We are interested in the probability of both E1 and E2 happeningi.e. the intersection of these two events, represented as the
black region in our grid.

What fraction of the probability mass does it take up? Notice that in order for a sample to fall into the black region, the first
iteration must restrict us to the the blue region. The probability of this happening is is just the probability of E1 . Then from within
the blue region, we ask what fraction of the probability mass does of E2 take up? Well, thats just Pr( E1 E2 )/ Pr(E1 ), and we
want to multiply this quantity with Pr(E1 ) to get the result. This sort of ratio is common enough that it gets its own name and
notation. We notate it like so,

Pr(E2 |E1 )

Pr(E1 E2 )
,
Pr(E1 )

and read this as the probability of E2 given E1 . This is called a conditional probability. The interpretation is that it gives the
probability of E2 happening, given that E1 has already happened.
More specifically for our polynomial verification, the probability that the second iteration will pick a value where the polynomials
are equal given that the first one did. Well, of course, this is that same probability as E2 happening, regardless of what
happened with in the first iteration, so this is just the probability of E2 . This is a condition known as independence, and it
corresponds to our intuitive notion of one event not depending on another.
Substituting in these values we find that that this approach gives the same result as the other.

Pr(E1 E2 ) = Pr(E1 ) Pr(E2 |E1 ) = Pr(E1 ) Pr(E2 ) =

1002

Independence and Conditional Probability - (Udacity, Youtube)


We just used the ideas of conditional probability and independence in the context of our polynomial verification algorithm. Now,
lets discuss these ideas in general. We define the conditional probability that an event E occurs given that event F occurs as
the ratio between the probability than both E and F occur divided by the probability that F occurs.

We can visualize this quantity using a traditional Venn diagram. We draw the whole sample space as a large rectangle, and
within here we draw the set F like so.

When we talk about conditional probability of E given F, we are restricting ourself to the set F within the sample space. Thus,
only the portion of E that is also in F is important to us. To make this a proper probability we have to renormalize by dividing by
the probability of F. That way, the probability of E given F and the probability of not E given F sum up to 1.

Pr(E F)
Pr(E F)
Pr(F)
E
Pr(E|F) + Pr( |F) =
+
=
=1
Pr(F)
Pr(F)
Pr(F)
An interesting situation is where the probability of an event E given F is the same as the probability when F isnt given.

Pr(E|F) = Pr(E)
This implies that E and F are independent. One doesnt depend on the other. Formally, we say
Two events E and F are independent if

Pr(E F) = Pr(E) Pr(F)


Note that this is a slightly more general statement than saying Pr(E|F) = Pr(E). The quantity Pr(E|F) isnt defined if Pr(F) = 0.

Bullseyes - (Udacity)
Here is a quick question on independence. Suppose that there is a 0.1 probability that Sam will get a bulleyes each time that he
throws a dart. What is the probability that he gets 5 bullseyes in a row?

Monte Carlo and Las Vegas - (Udacity, Youtube)

Lets go back to our polynomial verification algorithm with repeated trials and review its behavior.

If the two input polynomials are equal, then the probability that the algorithm says so is 1. On the other hand, if the polynomials
k
are dierent, there is a chance that the algorithm we get the answer wrong, but this happens only with probability 100 where
k is the number of trials that we did. We just need to extend the argument we made before for k = 2 to general k.
The fact that our algorithm might sometimes return an incorrect answer makes it what computer scientists call a Monte Carlo
algorithm. And because it makes a mistake only when the polynomials dier, it is called one-sided Monte Carlo algorithm.
This idea can be extended to arbritrary languages. Here strings in the language represent equal polynomials. This algorithm only
makes mistakes on strings not in the language. Of course, its possible for the situation to be reversed so that the algorithm
makes mistakes on string in the language. This is another kind of one sided Monte Carlo algorithm. I should say that there are
also two-sided Monte-Carlo algorithm, where both errors are possible, but regardless of what input is given the answer is more
likely to be correct than not.
Suppose, however, that any possibility of error is intolerable. Can we still use randomization? Well, yes we can. Instead of
picking a new point uniformly at random from the 100d possible choices, we can pick one from the choices that we havent
picked before. This is known as sampling without replacement, since we dont replace the same we took back in the pool
before choosing again. There are only d possible roots, by the time weve pick the (d + 1) th point, so we must have picked one
of the non-roots.

This algorithm still uses randomization, but nevertheless it always gives a correct answer. If the polynomials are equal, the
probability that the algorithms says so is 1. If they are unequal, the probability that it says they are equal is 0.
The fact that this algorithm never returns an incorrect answer makes it a Las Vegas algorithm. The randomization can eect
the running time, but not the correctness. If the polynomials are equal, the algorithm definitely takes d + 1 iterations, but when
they are unequal, it gets a little more complicated.
Let Ei be the event that A and B are equal at the ith element of the order array chosen randomly by the algorithm above. Then,
k1
we characterize the probability that the algorithm takes at least k steps as i=0 Ei . Note, however, that these events are no
longer independent. If A and B are equal at the first element of the list order, then thats one fewer root that could have been
chosen to be the second element. So how do we go about calculating this probability?

Sampling without Replacement - (Udacity)


Lets make this calculation for a concrete example an exercise. Suppose that the dierence A B is a polynomial of degree 7
with 7 roots in the set of integers 1 through 700. An algorithm samples 3 of these numbers uniformly without replacement. Give
an expression for the probability that all of these points are roots of the dierence A B.

Returning to the Las Vegas version of our polynomial identity verifier, we can write the probability that we dont detect a
dierence in k iterations as the product

Pr(k1
i=0 ) = Pr(E1 ) Pr(E2 |E1 ) Pr(Ek1 |E1 E2 Ek2 ) =

k1

dk
1

.
100d k
100k
i=0

With a little more work, we can get a tighter bound that this, but for our purpose this simple bound works. Note that even
though the probabilities for the LasVegas and Monte Carlo algorithms are the same, the meanings are dierent. In the Monte
Carlo algorithm, our analysis captured the probability of the the algorithm returning a correct answer. In the Las Vegas algorithm,
our analysis something about the running time for an algorithm that will always produce the correct answer.

Random Variables - (Udacity, Youtube)


So far, all the probabilistic objects weve discussed have been eventsthings that either happen or dont happen. As we go
further in our analysis, however, it will be convenient to talk about other random quantities: what is value of a random die roll? or
how many times did we have to repeat that procedure before we got an acceptable outcome? For this, we introduce the idea of
a random variable.
A random variable on a sample space

is a function X :

For example, let X be the sum of two die throws. Then, the sample space is = {1, , 6} {1, , 6} , and the random
variable X is the function that just adds those two numbers together,

X((i, j)) = i + j.
We use the notation X = a where a is some constant to indicate the event that X is equal to a. Thus, it is the set of elements
for the sample space for which the function X is equal to A. In our example,

(X = 3) = {(1, 2), (2, 1)}.

Expectation - (Udacity, Youtube)


Having defined a random variable, it now makes to talk about a random variables average or expected value. Formally,
The expectation of a random variable X is

E[X] = v Pr(X = v),

E[X] = v Pr(X = v),


vX()

where X() is the image of

X.

A can be seen from the formula, the expectation is a weighted average of all the values that the variable could take on. For
example, let the random variable X be number of heads in 3 fair coin tosses. Then, according to this definition, the expectation
would be:

1
8 for getting no heads.

3
8 for getting 1 head, as there are three possible tosses that could have come up heads.

3
8 for getting 2 heads, as there are three possible tosses that could have come up tails to give us two heads.

1
8 for getting 3 heads.

Adding these all up we get 12/8 = 3/2. Now, if I asked you casually, how many heads will there be in three coins tosses on
average, you probably would have said 3/2 rather quickly and without doing all this calculation. Each toss should get you 1/2 a
head, so with 3 your should get 3 halves, you might have reasoned.
In terms of our notation, we can express the argument like this. We let Xi be 1 if the i th fair coin toss is heads and 0 otherwise.
Then we say that the average number of heads in three tosses is

E[ X1 + X 2 + X 3 ] = E[ X1 ] + E[ X2 ] + E[ X3 ] =

3
.
2

The key step in the proof is the first equality here, which says that the expectation of the sum is the sum of the expectations.
This is called the linearity of expectation, and as we will see, this turns out to be a very powerful idea.
In general,
For any two random variables X and

Y and any constant a , the following hold


E[X + Y] = E[X] + E[Y]

and

E[aX] = aE[X]
.

The expectation of the sum is the sum of the expectations, and we can just factor out constant factors from expectations.
Remember this theorem.

Counting Cycles in a Permutation - (Udacity)


Here is an exercise that will help illustrate the power of the linearity of expectations. In a random permutation over 100
elements, what is the expected number of 3-cycles?
We can think about the permutation as defining a directed graph over 100 vertices, where every vertex has exactly one outgoing
and one incoming edge, which are defined by the permutation. Thus, if (55) = 57, we would draw this edge here, if
(57) = 58, we draw this edge here, and if (58) = 55, we would draw this edge here. Together, these form 3 cycle.

Use the linearity of expectation and write down the expected number of 3-cycles as a ratio here.

Quicksort - (Udacity, Youtube)


At this point, weve covered the basics of probability theory, so well be able to turn our focus on the algorithms themselves and
their analysis. Up first is the classic randomized quicksort algorithm.
I case you dont recall randomized quicksort from a previous algorithms course, here is the pseudocode.

To keep things simple, well assume that the elements to be sorted are distinct. This is a recursive algorithm, with the base case
being a list of 0 or 1 elements, where the list can simply be returned. For longer lists, we choose a pivot uniformly at random
from the elements of the list, and then split the list into two pieces: one with those element less than the pivot and one with
those elements larger than the pivot. We then recursively sort these shorter lists and then join them back together once they are
sorted.
The eciency of the algorithm depends greatly on the choices of the pivots. We can visualize a run of the algorithm by drawing
out the recursion tree. Ill write out the list is sorted order so that we can better see what is going on, though the algorithm itself
will likely have these elements in some unsorted order.
The ideal choice of pivot is always the middle value in the list. This splits the list into two equal-size sublists. One consisting of
the larger elements, the other of the smaller elements. Then in the recursive calls, we split these lists into two pieces, until we
get down to the base case.

Because the size get cut in half with each call, there are only log n levels to the tree. Every element gets compared with a pivot,
so there are O(n) comparisons at each level, for a total of O(n log n) comparisons overall. Thats if we are lucky and pick the
middle element for the pivot every time.
How about if we are unlucky? Suppose we pick the largest element in every iteration.

Then the size of the list only decreased by one in every iteration, so there are n levels. The first level require n 1 comparison,
2
the second n 2 and so forth, so that the total number of comparisons is an arithmetic sequence and therefore is O(n ). This
is as bad as a naive algorithm like insertion sort. The natural question to ask then, how does quicksort behave on average? Is it
like the best case where the pivot is chosen in the middle, the worst case that we have here, or somewhere in-between?

Analysis of Quicksort - (Udacity, Youtube)


Now, well analyze the average case performance of quicksort and show that it is O(n log n) , just like the optimum case.
Suppose that the randomized quicksort algorithm is used to sort a list consisting of elements a 1 < a2 < < a n all of which
are distinct.
Well let Eijk be the event that a i , the ith largest element in the list, is separated from a j , the jth largest element in the list, by the
kth choice of a pivot. And, well let Xij be a variable that is 1 if the algorithm actually compares a i with aj and 0 otherwise. The
sum of the X ij will count the number of comparisons that the algorithm uses.
Claim:

2
E[ Xij ] =

ji+1
i<j
For the proof, observe that

E[X ij ] = 1 Pr(Xij = 1) + 0 Pr(Xij = 0) = Pr(Xij = 1)

In fact, the expectation of all zero-one variables is just the probability that the variable is equal to one.
The element a i has to be separated from a j by some pivot in the algorithm, and they wont be separated by two. Therefore,
fixing i and j,the events Eijk are disjoint, so

Pr(Xij = 1) = Pr(Xij = 1 Eijk ).


k

This argument is known as the law of total probability.


Each non-zero term here can be written as a conditional expectation,

k:Pr( Eijk )>0

Pr(Xij = 1 Eijk ) =

k:Pr(Eijk )>0

Pr(X ij = 1| Eijk ) Pr( Eijk )

and its the conditional probability that will be easiest to reason about.
Given that a i is going to be separated from aj , it must be that they havent been separated yet. So this list must include ai , a j ,
every element in-between possibly some more to the outside.

Given that the separation does occur, however, the pivot must be chosen in the range [a i , aj ]. The element a i will only be
compared to a j , however, if one of the two are chosen as the pivot. Therefore, given that the separation is going to occur, the
probability that is will actually require a comparison is only 2 divided by the number of possible choices for the pivot (j i + 1).
Substituting, this value here will give us the answer.

k:Pr( Eijk )>0

Pr(Xij = 1|Eijk ) Pr(Eijk ) =

k:Pr( Eijk )>0

2
2
Pr( Eijk ) =
.
ji+1
ji+1

With the claim established, we argue that by the linearity of expectation the expected number of total comparisons is just the
sum of the expectations for
these X ij s.
n

2
1
X
E[ ij ] = j i + 1 j = O(n log n).
i<j
i<j
i=1 j=1
Its possible to do some tighter analysis, but for our purposes is enough to use this loose bound, which becomes just the sum of
n harmonic series. Summing 1/j is rather like integrating 1/x, so the inner sum becomes a log and we get O(n log n) in total.
Overall, then we can state our results as follows.
For any input of distinct elements quicksort with pivots chosen uniformly at random compares O(n log n) elements in
expectation.

The average case is on the same order as the worst case. This is comforting but by itself it is not necessarily a good guarantee
of performance. Its conceivable that the distribution of running times could be very spread out, so that it would be possible for
the running time to be a little better than the guarantee or potentially much worse.

It turns out that this is not case. The running times are actually quite concentrated around the expectation, meaning the one is
very unlikely to get running time much larger than the average. This sort of argument is called a concentration bound, and if you
ever take a full course on randomized algorithms, a good portion will be devoted to these types of arguments.

A Minimum Cut Algorithm - (Udacity, Youtube)


Next, we consider a randomized algorithm for finding a minimum cut in a graph. Likely, this wont be as familiar as quicksort. We
are given a connected graph G, and the goal is to find a minimum sized set of edges whose removal from the graph causes the
graph to have two connected components.

This set of edges S is called a minimum cut set. Note that this is a dierent problem from the minimum s-t cuts that we
considered in the context of the maximum flow. There are no two particular vertices that we are trying to separate hereany two
will do, and all the edges have equal weight here. We can use a the minimum s-t cut algorithm to help solve this problem, but I
think you will agree that this randomized algorithm is quite a bit simpler.
The algorithm operates by repeatedly contracting edges so as to join the two vertices together. Once there are only two vertices
left, they define the partition. (See video for animated example.)

Now, this particular choice of edges led to a minimum a cut set, but not all such choices would have. How then should we pick
an edge to contract?
It turns out that just picking a random one is a good idea. More often than not, this wont yield a correct answer, but as well see
its will be often enough to be useful.

Analysis of Min Cut - (Udacity, Youtube)


The algorithm just presented is known as Kargers Min-Cut Algorithm and it turns out that
2

Kargers minimum cut algorithm outputs a min-cut set with probability at least n(n1) .

Now at first, you might look at this result and ask what good is that? The algorithm doesnt even promise to be right more
often than not. The trick is that we can call the algorithm multiple times an just take this minimum of the result. If we do this
n(n 1) log(1/) times then there is a 1 chance that we will have found a minimum cut set.

The proof for this corollary is that each call to the algoirthm is independent, so the probability that all of the calls fail is given by

2
1
(
n(n 1) )

n(n1)
2

log(1/)

x
In general 1 x e , so applying that inequality, we have that the n(n1)/2 factors cancel in the exponent and we are left with

exp( log(1/)) = .
Thus, the probability that all the iterations fail is at most , so the chance that at least one them succeeds is 1 . This bound
here is extremely useful in analysis and is one that you should always have handy in your mental toolbox.
So if the theorem is true, we can boost it up into an eective algorithm just by repeating it, but why is the theorem true?
Consider a minimum cut set, and let Ei be the event that the edge chosen in iteration i of the algorithm in not in C. Note that
there could be other minimum cut sets as well. For the analysis, however, well just consider the probability of picking this
particular one.
Returning the cut-set C means not picking an edge in C in each iteration, so its the intersection of all the events Ei , which we
can turn into this product,

E
n3 Ei ) Pr(E2 |E1 ) Pr(E1 )
Pr( n2
i=1 ) = Pr( n2 | i=1
as weve done before. Were just conditioning the probability of avoiding C in the ith iteration given that weve avoid it in all
previous ones.
We now make the claim
Claim:
j

Pr(Ej+1 | i=1 Ej )

nj2
.
nj

will be a little easier to analyze if we write as

Well warm-up just be considering, the probability of E1 of avoiding the cut in the first iteration.
Letting the size of the cut be k, we have that every vertex must have degree at least k. Otherwise, the edges incident on the
nk
smaller degree vertex would be smaller cut set. This then implies that |E| 2 every vertex must have degree at least k and
summing up the degrees for every vertex counts every edge exactly twice. Therefore, the probability of avoiding the cut set C is

Pr(E1 ) = 1

|C|
k
n2
1
=
|E|
nk/2
n

The more general argument will be similar.


Given that the first j edges chosen were not in C, then C is still a min cut-set. If not, then by taking those same edges in the
original graph we would have a smaller cut-set than C. Note that throughout we count parallel edges.
Again, letting k be the size of C , we have that there must be at least k(n j)/2 edges left. The n j comes from the fact that
there are only n j vertices left after j iterations.
With the same argument as before, we have that the probability of avoiding C in the j + 1 iteration given that no edges in C
have been chosen yet is
j

Pr(Ej+1 | i=1 Ei ) = 1

|C|
k
2
nj2
1
=1
=
.
|E|
(n j)k/2
nj
nj

as claimed.
Substituting, back into our equation here,

we see that we are down to a 1/3 probability in the last iteration, 2/4 in the iteration before than, etc. This product telescopes
2
and leaves us with the bound of n(n1) as claimed.
Altogether then, this extremely simple procedure has given us a fairly ecient algorithm for finding a minimum cut set.

Max 3 SAT - (Udacity, Youtube)


For our last algorithm we will consider maximum 3-SAT. This will tie together randomized algorithms, approximation algorithms,
and complexity. We are given given a collection of clauses each with 3 literals, all coming from distinct variables. And we want
to output an assignment to the variables such that a maximum number of clauses is satisfied.

First, we are going to show that


For any 3CNF formula there is an assignment that satisfies at least 7/8 of the literals.

Consider an assignment chosen uniformly at random. Define Yj to be 1 if the clause cj is satisfied and to be 0 otherwise. Since
all the literals come from distict variables, then there are eight possible ways of assignment them true or false values, but only
one of them will cause Yj to be equal to 0. The rest satisfy the clause and cause Yj to be 1. Thus, E[Yj ] = 7/8 for every j.
m

Now we consider the formula as a whole and let Y = j=1 Yj . Using the linearity of expectations, we have that
m

7
Yj] = m
v
Pr(Y
=
v)
=
E[Y]
=
E[

8
vY()
j=1

The key realization is that this value represents a kind of average. This mean that not all of the v in this sum on the left can be
less than the average on the right. There has to be a v where the probability is positive and the value of v 7m/8. Because this
probability is positive, however, there must be some assignment to the variables that achieves it. Therefore, there is always a
way to satisfy 7/8 of the clauses in any 3-SAT formula.
This technique of proof is called the expectation argument and it is part of a larger collection of very powerful tools called the
probabilistic method, which were developed and popularized by the famous Paul Erdos.

Approx Max 3 SAT - (Udacity, Youtube)


So far, weve seen how there must be an assignment that satisfies at least 7/8 of the clauses in any 3-CNF formula, and in fact,
the same argument gives us an algorithm that satisfies 7/8 on average: just pick a random assignment.
By itself, however, this does not provide any guarantee that we will actually find an assignment that satisfies 7/8 of the clauses.
To obtain such a guarantee, we are going to use a technique called derandomization that will take randomness of the algorithm
our and give us a deterministic algorithm with a guaranteed 7/8 factor approximation.
An important part of the algorithm will be a subroutine that assigns a value to a variable and simplifies the clauses. We will call
this procedure instantiate.

Lets say that we set variable x1 to True. Then any clauses not using x1 are left alone. If a clause has x1 in it, then it gets set to

True. If a clause has x1 in it, then we just eliminate that literal from the clause. If the x1 is the only literal in a clause, then we just
set the clause to False.
Another important subroutine, will be one that calculates the expected number clauses that will be satisfied if the remaining
variables are assigned True or False uniformly at random.

Of course, if a clause is just true, it gets assigned a value 1, false get assigned 0. A single literal gets assigned 1/2, two literals
gets a value of 3/4 and three literals gets a value of 7/8. Remember that there is just one way to assign the relevant variables to
that this clause is false. The EofY procedure simply calculates these values for every clause and sums them up.

With these subroutines defined, we can write down our derandomized algorithm as follows.

Start with an empty assignment to the variables. Then for each variable in turn, consider the formula resulting from its being set
to True and its being set to False. Between these two, pick the one that gives a larger expected value for the number of satisfied
clauses assuming that the remaining variables were set at random. Note that we are using our knowledge of how a random
assignment would behave here, but we arent actually using any randomization. Having picked the better of the two ways of
assigning the xi variable, we update the set of clauses record our assignment to xi .
The reason this algorithm works is that it maintains the invariant that the expected number of clauses of C that would be
satisfied if the remaining variable were assigned random is at least 7/8. This is true at the beginning, just by our previous
theorem. But this expectation for C is always just the average of the expected number that would be satisfied in Cp and Cn . So
by picking the new C to be the one for which this EofY quantity is larger, the invariant is maintained. Of course, at the end, all of
the variables have been assigned, so this computing the expectation of C amounts to just counting up the number of True
clauses. This technique is known as the method of conditional expectation and it has a number of clever applications.

Overall, then we have shown that there is a deterministic algorithm which give any 3-CNF formula finds an assignment that
satisfies 7/8 of the clauses. Remarkably, it turns our that this is the best we can do assuming P is not equal to NP. For this
argument, we turn to the celebrated and extremely popular PCP theorem.

PCP Theorem - (Udacity, Youtube)


PCP stands for probabilistically checkable proofs. It turns out that if you take the verifiers we talked about when we defined the
class NP, and
1. you give them access to random bits, and
2. you give them random access into the certificate or proof

then they become extremely ecient. These types of verifiers are called probabilistically checkable proof systems, and the
famous PCP theorem relates the set of languages they can verify under certain constraints back to the class NP.

In a course on complexity, we would place these proof systems within the larger context of other complexity classes and
interactive proof systems. For our purposes, however, the PCP theorem can be stated in this much more accessible way.Well
let denote the set of all 3CNF formulas. Remember that we are assuming that all clauses have exactly 3 literals and that they
come from 3 distinct variables. Then a version of the PCP theorem can be stated like this:
For any constant

> 7/8 , there is a polytime computable function f such that for every 3CNF formula that has a

sucient number of variables, we have that


1. if is satisfiable, then f () is satisfiable, and

2. if is not satisfiable, then every assignment of the variables satisfies fewer than an

fraction of the clauses.

So if is satisfiable, there is a way to satisfy all the clauses of f () . If is unsatisfiable, then you cant even get close to
satisfying all the clauses of f (). Weve introduced a gap here, and this gap is extremely useful for proving the hardness of
approximation.
Many, many hardness of approximation results follow from this theorem. The most straightforward of them, however, is that you
cant do better than the 7/8 algorithm for max-sat that we just went overnot unless P=NP. Why?

Well, suppose that I wanted to test whether strings were in some language in NP.
And at my disposal, I had a polytime alpha-approximation for 3-SAT, where alpha is strictly greater than 7/8.
Then, I could use the Cook-Levin reduction to transform my string into an instance of SAT that will be satisfiable if and only if x
is in L.

Then, I can use this f function from the PCP theorem to transform this into another 3SAT clauses where either all the clauses are
satisfiable or fewer than an alpha fraction of them are.

That way, I just run this approximation on f () and see if the fraction of clauses satisfied is at least than or not. If it is, then
from the PCP theorem, I can reason that phi must have been satisfiable, and so from the Cook-Levin reduction x must have
been in L.
On the other hand, if the fraction of satisfied clauses is less than , then f () cannot have been satisfiable, so must not have
been satisfiable, so from the Cook-Levin reduction x must not be in L.
Using this reasoning, we just found a way to decide an arbitrary language in NP in polynomial time. So if such an
approximation exists, then P=NP. Or more likely, P is not equal to NP, so no such approximation algorithm can exist.
Many hardness of approximation proofs can be done in a similar way. All thats necessary is to stick in another transformation
here, transforming the 3SAT problem that has this gap into another problem, which might have a potentially dierent gap, to
show that certain approximation factors would imply P=NP.

Conclusion - (Udacity, Youtube)


As weve seen, randomization can be a very useful tool in the design and analysis of algorithms, and it turns out that this is true
in practical programming too, as many real-world computer programs rely on pseudorandomness to achieve their desired
behavior. Nevertheless, it is an open question whether randomization actually helps in the sense of the complexity classes that
we discussed earlier on in the course. It simply isnt known whether there is a language that can be decided in polynomial time
with a two-sided Monte-Carlo algorithm, that cant be decided with a normal Turing machine in polynomial time. This is known
as the P equals BPP question, and its one of the major open problems in Complexity.

Conclusion
With that, the study of algorithms has brought us back to complexity, and unfortunately, it also brings us to the end of the
course.
At the beginning, we said that by following the sort of rigorous arguments that we would make, you would be giving yourself a
kind of mental training that would be useful beyond the classroom. We certainly hope that you have found this to be true and
had some fun along the way.
You may not remember everything taught in the course, but here are a few points that I do hope will stay with you forever.
1. There are some things you cant compute at all, like the halting problem.
2. There are some things that we cant compute quickly, like traveling salesman and every other NP-complete problem.
3. There are some things that we can compute eciently, and a little cleverness can go a long way like with the Fast Fourier
transform and maximum flows.
More important than the specific problems we tackled, were the ways to think about computational problems. You will come
across problems that you havent seen before and you can use the tools and techniques youve learned from this class to
determine the best course of action to find the proper algorithm. And if the problem is NP-complete, you neednt just give up

but should try to find good heuristics or approximation algorithms.


Remember that computers these days dont just process one instruction at a time. Computers have many cores and are also
linked together and we didnt cover tools like mapreduce that allow spreading the computational work among many machines.
And quantum computers, if they can be built, may solve harder problems like factoring numbers.
We hope weve challenged you and made you think about computing in new ways. Just remember, in a complex world best to
keep it simple.

S-ar putea să vă placă și